Modular Predictor for Day-Ahead Load Forecasting and Feature Selection for Different Hours

To improve the accuracy of the day-ahead load forecasting predictions of a single model, a novel modular parallel forecasting model with feature selection was proposed. First, load features were extracted from a historic load with a horizon from the previous 24 h to the previous 168 h considering the calendar feature. Second, a feature selection combined with a predictor process was carried out to select the optimal feature for building a reliable predictor with respect to each hour. The final modular model consisted of 24 predictors with a respective optimal feature subset for day-ahead load forecasting. New England and Singapore load data were used to evaluate the effectiveness of the proposed method. The results indicated that the accuracy of the proposed modular model was higher than that of the traditional method. Furthermore, conducting a feature selection step when building a predictor improved the accuracy of load forecasting.


Introduction
The main idea of short-term load forecasting (STLF) is to predict future loads with horizons of a few hours to several days.Accurate STLF predictions play a vital role in electrical department load dispatch, unit commitment, and electricity market trading [1].With the permeation of renewable resources in grids and the technological innovation of electric vehicles, load components become more complex and make STLF difficult; therefore, strict requirements of stability and accuracy are needed [2][3][4][5][6].
STLF is an old but worthy theme for research.General forecasting methods can be divided into two branches: the statistical method and the artificial intelligence method.Statistical methods such as regression analysis, exponential smoothing, Kalman filter, and autoregressive integrated moving average (ARIMA) are easy to apply but modeling is difficult for complex loads [7][8][9].Artificial intelligence methods show better performance than statistical methods in load forecasting and include fuzzy logic, the artificial neural network (ANN), the support vector machine (SVM), Gaussian process regression (GPR), and random forest (RF) [10][11][12][13][14][15][16][17].The relationship of input and output is confirmed by a list of rules by fuzzy logic.However, the prior knowledge required to select the parameters in the membership function and the rules makes the modeling process complex [18].The artificial neural network method is applied to the STLF of power systems owing to its self-learning ability and robustness to data noise.However, shortcomings such as the difficulty in determining initial network parameters and over-fitting still exist [19].By adopting a structural risk minimization principle, the complexity and the learning ability of an SVM can be balanced.With low-dimension conditions The filter method is a feature ranking method that computes a feature's numerical value to evaluate its importance.Therefore, the estimation of a feature is important to the feature selection result.MI, CMI, and RreliefF methods were used as filters in this study.

Mutual Information
The Mutual Information (MI) method measures the common information between two random variables.For two random variables X and Y, the MI between X and Y can be estimated as: X,Y P(x, y) log P(x, y) P(x)P(y) where P(x) and P(y) are the marginal density functions corresponding to X and Y, respectively.P(x,y) is the joint probability density function.In load forecasting, the feature is defined as X, the target variable is defined as Y, and I(X,Y) represents their strength of relevance.The larger I(X,Y) is, the more dependent X is.If I(X,Y) is zero, X and Y are independent.The MI method can measure the relevance between a feature and a target variable effectively; however, the redundancy is analyzed differently.

Conditional Mutual Information
The Conditional Mutual Information (CMI) method measures the relevance of two variables when the variable Z is known.In the feature selection of load forecasting, let us suppose the selected feature set is S and the CMI between feature X i and target Y is defined as: where I(Y;X i |S) represents the new information that X i supplies to S. The larger I(Y;X i |S) is, the more information X i can supply, and the less is the redundancy to S. Compared to the MI method, the redundancy among features can be reduced by CMI.

RreliefF
RreliefF is the extended version of relief for regression [36].By evaluating the feature weight, the feature quality is measured.Relief works by randomly selecting an instance and searching the nearest neighbor from the same class and from a different class.The weight W[X i ] of feature X i estimated by relief is an approximation of the difference of probabilities: Energies 2018, 11, 1899 4 of 30 For RreliefF, the probability of two instances belonging to different classes can be evaluated by their relative distances for classification.However, for STLF, the predicted value is continuous; therefore, Equation (3) should be reformulated.By using Bayes' theorem, W[X i ] can be obtained as:

Embedded Method for Feature Selection
In the embedded method, feature selection is performed during the training process where the contribution of the feature combination is efficiently evaluated.The embedded method can be directly applied to STLF and can collaborate with other feature selection methods according to their estimated importance.

Classification and Regression Tree
The Classification and Regression Tree (CART) method uses a binary recursive partitioning algorithm [37].By splitting the current samples into two sub-samples, a father node generates two child nodes.The final model of CART is a simple binary tree.
The generation of the CART can be divided into two steps: Step one: first, the root node is split.A best feature X bset chosen from the feature set serves as the criterion for node splitting.To select the best feature, the minimum variance of child nodes is the objective function.The variance of the child node of X i is defined as: var(q) = ∑ X i ∈q (y i − y q ) 2 (5) where y q is the average of observation values y i at node q.The importance of feature X i according to the variance is defined as: Step two: for each child node, repeat Step one until the CART grows completely.The predictive model can be expressed as t(x, T), where T = (x i , y i ), i = 1,2, . . .,n and x ∈ R is the training set.For STLF, the forecasting value of load ŷ is obtained when inputting the new x.ŷ = t(x, T)

Random Forest
Random Forest (RF) is a machine-learning algorithm that uses a combination of CART with a bootstrap sample for classification and regression [38].For a training set T with n samples, the bootstrap sample means randomly selecting n samples from T replacements.The probability that each sample selected is 1/n, means one sample may appear several times.After a complete bootstrap sample, the samples that were not sampled form the out-of-bag (OOB) dataset.Different from CART, the feature for node splitting in RF is selected from m features which are chosen from the original feature set.The basis of selecting the best feature for node splitting is Equation (5).The predictive output of RF is obtained by averaging the results of the trees: where N t is the number of trees.
Energies 2018, 11, 1899 5 of 30 In addition, the OOB error and the importance of each feature are computed in the process of modeling.Each tree has an OOB dataset, and the OOB error is evaluated by predicting the OOB dataset using the tree model corresponding to the OOB dataset.The OOB error is defined as: A feature's importance is estimated by permutating the feature and averaging the difference of OOB errors before and after the permutation of all trees.For instance, for the ith tree whose OOB data is OOB i and OOB error is e i , after permutation, the new OOB data will be OOB i and the OOB error will be e i .The feature's importance in this tree is computed as:

The Short-Term Load Forecasting (STLF) Predictor
Selecting an appropriate predictor is key to improving the accuracy of STLF.Five state-of-the-art predictors were applied in this study: support vector regression (SVR), back-propagation neural network (BPNN), CART, GPR, and RF.The SVR, BPNN, and GPR are introduced briefly in this section.The detailed mathematical theories of these algorithms are shown in the references [39][40][41].

Support Vector Regression
By using the non-sensitive loss function, an Support Vector Regression (SVM), which is used only for classification, is extended for regression to be applied for load forecasting in power systems and is called support vector regression (SVR).
Given a training set T, the model for the load that decreases the difference between the predictive value f (x) and the true load y as much as possible is expected to be: In SVR, the maximum difference that can be tolerated between f (x) and y is ε.The mathematical model can be expressed as: where C is the regularization parameter, K(x i , x j ) = ϕ(x i )ϕ(x j ) is the kernel function, and α i , α * i are Lagrange factors.
The radial basis function selected in this study is expressed as: where σ 2 is the kernel width.
The SVR model is obtained by solving Equation (12): where b is the bias value.

Back-Propagation Neural Network
The Back-Propagation Neural Network (BPNN) is a type of artificial neural network consisting of an input layer, a hidden layer, and an outer layer trained by a back-propagation algorithm with the mean squared error (MSE) as the objective function.The main idea of the BPNN is to deliver the output-layer error from back to front by which the error of the hidden layer is computed.The learning process of BPNN is divided into two steps: Step 1: The output of each neural unit in the input and hidden layers is estimated.
Step 2: By using the output error, the error of each neural unit which is used for updating the former layer weight is computed.
The objective function of the gradient minimization is based on: where y i is the actual value of neural unit i and ŷi is the predictive value.To compute the minimum value of e f , a modification value is needed to correct the weight.The modification value is defined as: 16) where η is the learning rate, net i is the input of neuron i, O i is the output of neuron i, and α is the momentum factor.The modified weight is: The final output ŷi of neuron i can be estimated by the iteration of weight w ij when meeting precision requirements.

Gaussian Process Regression (GPR)
Gaussian Process Regression (GPR) is a random process in which the random variables obey the Gaussian distribution and is used to establish the input and output maps.For STLF, the load data collected is polluted by noise.Assuming that the noise follows a normal distribution ε ∼ N 0, σ 2 n , then the joint prior distribution of observation y and the predictive value f * are defined as: where n is the number of training samples, K(X, X) is the covariance matrix, and I n is the unit matrix.
The posterior distribution of f * is defined as: where f * is the mean value of f * and cov( f * ) is the variance of f * .
Then, f * and cov( f * ) can be computed as: Energies 2018, 11, 1899 7 of 30 The covariance function of GPR is the squared exponential function: where θ = M, σ 2 f , σ 2 n is a hyper-parameter that can be solved by the maximum likelihood method [41].

Load Analysis
Affected by different factors, load sequence appears as a type of complicated non-linear time series.To construct a reasonable original feature set and achieve better forecasting for a region, the load characteristics and other factors should be analyzed.
Figure 1 shows the power load of New England in different time lengths.By observing Figure 1a,b, the load patterns from 2011 to 2013 are similar.Influenced by climate, load patterns differ by season.In Figure 1c, the load curves of two continuous weeks in four seasons are presented (the first day is Monday).It is easy to see that the weekday and weekend load demands differ, and the load demand presented a cycling mode with a period of seven days.The Tuesday load curves of the different seasons shown in Figure 1d shows that the Tuesday load pattern of different weeks is similar.The load increased rapidly from 6:00 am to 11:00 am, which corresponds to the beginning of work, and reached the first peak load.The second peak load occurred from 19:00 pm to 20:00 pm.As analyzed above, the load characteristics can be summarized as (1) The same-day load patterns are similar and represent the week-cycle of the load.
(2) The weekday and weekend load patterns were similar respectively and represent the daycycle of the load.

Candidate Feature Set
An appropriate feature set plays a significant role in modeling an uncomplicated but outstanding predictor.However, a candidate feature set that contains sufficient information must be found to ensure that effective features can be selected by the feature selection method.The two main feature types are the endogenous predictor (load feature) and the exogenous predictor (calendar feature).
The time interval before the predictive moment before submission of the dispatch department's forecasting result should be considered when extracting features.To ensure the universality of the original feature set, we used the interval time p = 24.A feature set consisting of 145 internal historic load features (from lag 24 to lag 168) from a one-week data window was chosen as a part of the candidate feature set.The maximum, minimum, and mean loads were also included.Except for the load feature, calendar features such as hour of day, day type, working day, and non-working day were also considered.The candidate feature set with 173 features was formed as shown in Table 1.As analyzed above, the load characteristics can be summarized as (1) The same-day load patterns are similar and represent the week-cycle of the load.
(2) The weekday and weekend load patterns were similar respectively and represent the day-cycle of the load.

Candidate Feature Set
An appropriate feature set plays a significant role in modeling an uncomplicated but outstanding predictor.However, a candidate feature set that contains sufficient information must be found to ensure that effective features can be selected by the feature selection method.The two main feature types are the endogenous predictor (load feature) and the exogenous predictor (calendar feature).
The time interval before the predictive moment before submission of the dispatch department's forecasting result should be considered when extracting features.To ensure the universality of the original feature set, we used the interval time p = 24.A feature set consisting of 145 internal historic load features (from lag 24 to lag 168) from a one-week data window was chosen as a part of the candidate feature set.The maximum, minimum, and mean loads were also included.Except for the load feature, calendar features such as hour of day, day type, working day, and non-working day were also considered.The candidate feature set with 173 features was formed as shown in Table 1.

Proposed STLF Process with Feature Selection
Figure 2 provides an overview of the proposed method which covers the construction of the feature set, the dataset separation, and the feature selection for the load with respect to the different hours and the modeling for different hours.Figure 2a shows the one-day structure of a sample.The inputs include 173 features, and the output is the predicted load.
The diagram of the proposed method is displayed in Figure 2b.The training set was separated into 24 training subsets corresponding to each hour.The features in each training subset were ranked in descending order according to their feature scores as computed by the feature selection method.Then, the optimal feature subset was selected using the predictor and the MAPE-based criteria.Finally, the modular predictor was constructed based on 24 predictors with the obtained optimal subsets.
The process of selecting the optimal feature subset in modeling is shown in Figure 2c.According to the ranked feature order, the predictor was used to test the feature subset consisting of the top i features, and the criteria based on the MAPE was used to select the optimal feature subset.

Dataset Split
The data used in this study were from New England [42] and Singapore [43].The New England data were recorded every hour from 2011 to 2013 for a total of 26,304 data points.The Singapore data were recorded every half hour from 2014 to 2015 for a total number of 35,040 data points.To apply the proposed method, the hourly load of Singapore was extracted to form a new load time series.The data used for training and testing the predictor consisted of the feature set (173 features) and the predictive object (the load corresponding to different hours) as shown in Figure 2.
Each dataset was split into three parts: a training set (14,616 New England samples, 11,712 Singapore samples), a validation set (2928 New England samples, 2094 Singapore samples), and a test set (8760 New England samples, 2904 Singapore samples).The training and the validation sets were used to build the predictor and to select an optimal feature subset.The test set was used to examine the performance of the feature subset and the predictor.Detailed information about the datasets is shown in Table 2. -Mar., Jun., Sept., Nov.

Evaluation Criterion
To evaluate the performance of the proposed method, three criteria, the MAPE, the mean absolute error (MAE), and the root mean square error (RMSE) were used as follows:

Dataset Split
The data used in this study were from New England [42] and Singapore [43].The New England data were recorded every hour from 2011 to 2013 for a total of 26,304 data points.The Singapore data were recorded every half hour from 2014 to 2015 for a total number of 35,040 data points.To apply the proposed method, the hourly load of Singapore was extracted to form a new load time series.The data used for training and testing the predictor consisted of the feature set (173 features) and the predictive object (the load corresponding to different hours) as shown in Figure 2.
Each dataset was split into three parts: a training set (14,616 New England samples, 11,712 Singapore samples), a validation set (2928 New England samples, 2094 Singapore samples), and a test set (8760 New England samples, 2904 Singapore samples).The training and the validation sets were used to build the predictor and to select an optimal feature subset.The test set was used to examine the performance of the feature subset and the predictor.Detailed information about the datasets is shown in Table 2.

Evaluation Criterion
To evaluate the performance of the proposed method, three criteria, the MAPE, the mean absolute error (MAE), and the root mean square error (RMSE) were used as follows: where y i is the actual load, ŷi is the predictive load, and n is the number of predictive loads.

Results
The software used were MATLAB 2016b (Version 9.1.0.441655,Mathworks Inc., Natick, MA, USA) and Rx64 3.3.2(Version 3.3.2,GUN Project, developed at Bell Laboratories).It is noted that the CART algorithm in the rpart package in R identifies part of the features whose total importance value is 100.The parameter of each predictor was set by: BPNN: the number of neurons in the hidden layer was N neu = 2 × N feature + 1, iteration T = 1000 [44].SVR: the regularization parameter C = 1, the non-sensitive loss function ε = 0.1, the kernel width [15].RF: m = N feature /3 and N Tree = 500 [16,23].CART: no pruning parameter was set because the tree grows completely.GPR: the parameter of GPR was tuned by learning the training data.Feature selection methods rate the importance of a feature by assigning a numerical value to represent the relation between the feature and the target.For example, the value of a feature computed by MI is called an MI value, while that computed by RF and CART is called its permutation importance.The feature score is used for easy description.Parts of normalized feature score curves computed by different feature selection methods are shown in Figure 3.The feature score curves of typical hours (hour 5, hour 6, hour 10, and hour 11 when the valley and peak loads appear) were chosen for analysis.The feature score curves that used the same feature score calculation method were different at various hours.For example, the MI curves were much different for hour 5, hour 6, hour 10, and hour 11, and the features with the highest scores were different from each other (marked by a red circle).
The feature score shows the importance between the feature and the target variable.When selecting a feature subset, the feature with the highest score should be retained and one with the lowest should be eliminated.
The top 10 features after ranking are shown in Table 3, where it is clear that the top 10 features for the same hour were similar.For example, for hour 5, the same top 10 features were selected by the various methods such as F L(t−24) , F L(t−25) , F L(t−26) , and F L(t−27) and similar features such as F L(t−28) , F L(t−29) , F L(t−30) , and F L(t−31) .However, there was an obvious difference in the features of hour 5 and hour 6 which may have been caused by the different load characteristics shown in Figure 1d.The feature score shows the importance between the feature and the target variable.When selecting a feature subset, the feature with the highest score should be retained and one with the lowest should be eliminated.
The top 10 features after ranking are shown in Table 3, where it is clear that the top 10 features for the same hour were similar.For example, for hour 5, the same top 10 features were selected by the various methods such as FL(t−24), FL(t−25), FL(t−26), and FL(t−27) and similar features such as FL(t−28), FL(t−29), FL(t−30), and FL(t−31).However, there was an obvious difference in the features of hour 5 and hour 6 which may have been caused by the different load characteristics shown in Figure 1d.Therefore, a feature analysis for each hour is required to choose the best features for improving the accuracy of STLF.
Hour 6 Hour 10 Hour 11 Therefore, a feature analysis for each hour is required to choose the best features for improving the accuracy of STLF.

Optimal Feature Subset Selection Process
According to the trend of feature score curves of diverse feature selection methods, the first 36 to 50 features are chosen as the optimal features for modeling [30].By analyzing the autocorrelation of the lag variables, 50 features were selected for very-short-term load forecasting [41].When selecting a feature subset, most studies did not give a specific threshold for selecting the optimal feature subset.In this study, the performance of features which ranked in descending order based on feature score were estimated by the MAPE which was chosen as the threshold for selecting the optimal feature subset by adding features one-by-one to the feature subset.
Figure 4 shows the MAPE curves of different feature selection methods and predictor-based feature selection processes.As shown in each subplot in Figure 4, the MAPE was reduced and reached a minimum value with an increase in the number of features.For example, the MAPE of MI for hour 5 and the MAPEs of BPNN, CART, GPR, RF, and SVR when using the top feature were 4.587%, 4.743%, 4.618%, 5.196%, and 4.718%, respectively.When 20 features were used, the MAPEs were reduced to 3.901%, 4.555%, 4.008%, 4.160%, and 3.831%, respectively.The MAPEs of different predictors decreased in different levels, indicating that the 20 features made a positive contribution to a better prediction model build.A similar conclusion can be summarized by analyzing other curves.The dimension of each optimal feature subset and its MAPE is marked by different colored circles corresponding to different predictors.
According to the trend of feature score curves of diverse feature selection methods, the first 36 to 50 features are chosen as the optimal features for modeling [30].By analyzing the autocorrelation of the lag variables, 50 features were selected for very-short-term load forecasting [41].When selecting a feature subset, most studies did not give a specific threshold for selecting the optimal feature subset.In this study, the performance of features which ranked in descending order based on feature score were estimated by the MAPE which was chosen as the threshold for selecting the optimal feature subset by adding features one-by-one to the feature subset.
Figure 4 shows the MAPE curves of different feature selection methods and predictor-based feature selection processes.As shown in each subplot in Figure 4, the MAPE was reduced and reached a minimum value with an increase in the number of features.For example, the MAPE of MI for hour 5 and the MAPEs of BPNN, CART, GPR, RF, and SVR when using the top feature were 4.587%, 4.743%, 4.618%, 5.196%, and 4.718%, respectively.When 20 features were used, the MAPEs were reduced to 3.901%, 4.555%, 4.008%, 4.160%, and 3.831%, respectively.The MAPEs of different predictors decreased in different levels, indicating that the 20 features made a positive contribution to a better prediction model build.A similar conclusion can be summarized by analyzing other curves.The dimension of each optimal feature subset and its MAPE is marked by different colored circles corresponding to different predictors.The following conclusions can be drawn from Table 3 and Figure 4: (1) The feature permutation estimated by different feature selection methods varies.
(2) The dimension of the optimal feature subset and its MAPE depends on the predictor-based feature selection method.
(3) The optimal feature subset selected by the same predictor-based feature selection method for the predictive target of different hours is different.
Table 4 shows the MAPE and the dimension of the optimal feature subset corresponding to using MI as the feature selection method and RF as the prediction model.The Table shows that 1 to 6 am, the dimension of optimal feature subset is less than at 7 to 19 pm, as the same as the forecasting error.This is because people are less active at night and there are fewer factors affecting the load than during the day.The following conclusions can be drawn from Table 3 and Figure 4: (1) The feature permutation estimated by different feature selection methods varies.
(2) The dimension of the optimal feature subset and its MAPE depends on the predictor-based feature selection method.
(3) The optimal feature subset selected by the same predictor-based feature selection method for the predictive target of different hours is different.
Table 4 shows the MAPE and the dimension of the optimal feature subset corresponding to using MI as the feature selection method and RF as the prediction model.The Table shows that 1 to 6 am, the dimension of optimal feature subset is less than at 7 to 19 pm, as the same as the forecasting error.This is because people are less active at night and there are fewer factors affecting the load than during the day.
The MAPE and the dimension of optimal feature subset corresponding to 1:00 were carried out by different feature selection methods and forecasting methods shown in Table 5.The MAPEs are in 3% to 4% which means the performance of forecasters were similar after feature selection.By analysis of the feature dimension, we could find there is huge difference between the number of the feature of the optimal feature subset that selected by different feature selection methods, which caused by the different evaluation criterions.The details of the dimension of the optimal feature subset and its MAPE are shown in Appendix A Table A1 to Table A2.Based on a longitudinal comparison, the dimension of optimal feature subsets selected by different feature selection methods with same-hour predictors were different.For instance, the horizon of the hour-2 MAPE calculated by various methods was from 3.107% to 4.050%.The combination RreliefF + SVR method had the smallest MAPE and lower feature subset dimension.
By the horizontal comparison, the dimension of optimal feature subsets selected by the same feature selection method with the same-hour predictor varied.For example, the horizon of dimension of the feature subset corresponding to different hours selected by the RreliefF + SVR method ranged from 13 to 109 and the MAPE range was 3.043% to 4.558%.In addition, the number of features for a night hour was less than the day hour, indicating that the day load components were more complex and more difficult to forecast.
In conclusion, the characteristic of the load to predict for different hours varies; therefore, the load needs a special feature set to build a predictor for special hours.The necessity of using one kind of structure of modular time-scale prediction and feature selection for the load of different hours was verified.

Forecasting Result of Method Combinations with Optimal Feature Subsets for New England Load Data
To test the performance of diverse method combinations with the optimal feature subset, we used a special week for our experiment.
The effect of temperature on the loads in summer and winter is large, and severe fluctuations make accurate forecasting difficult.Therefore, two weeks were chosen randomly from the summer and winter of 2013 for testing.The summer period was from 28 July to 3 August, and the winter period was from 22 to 28 December.As shown in Figure 5, the predictive load of each combined Energies 2018, 11, 1899 15 of 30 method was fit with the true summer load.The average error of the various methods are shown in Table 6.The top-three combined methods were CART + SVR, RreliefF + RF, and RreliefF + SVR, and the MAPEs were 3.634%, 3.710%, and 4.204%, respectively.The predictive load of each combined method in winter is shown in Figure 6, each of the predicted loads matched the actual load except for Tuesday and Wednesday which corresponded to Christmas day and the day before.As is shown in Table 7, the first three combined methods were RreliefF + SVR, CART + GPR, and CART + SVR, and the MAPEs were 4.207%, 4.754%, and 4.770%, respectively.

Load Data
To test the performance of diverse method combinations with the optimal feature subset, we used a special week for our experiment.
The effect of temperature on the loads in summer and winter is large, and severe fluctuations make accurate forecasting difficult.Therefore, two weeks were chosen randomly from the summer and winter of 2013 for testing.The summer period was from 28 July to 3 August, and the winter period was from 22 to 28 December.As shown in Figure 5, the predictive load of each combined method was fit with the true summer load.The average error of the various methods are shown in Table 6.The top-three combined methods were CART + SVR, RreliefF + RF, and RreliefF + SVR, and the MAPEs were 3.634%, 3.710%, and 4.204%, respectively.The predictive load of each combined method in winter is shown in Figure 6, each of the predicted loads matched the actual load except for Tuesday and Wednesday which corresponded to Christmas day and the day before.As is shown in Table 7, the first three combined methods were RreliefF + SVR, CART + GPR, and CART + SVR, and the MAPEs were 4.207%, 4.754%, and 4.770%, respectively.For the full verification of different method combinations, the entire test set was used for the contrast experiment.The results of different evaluated criteria for the proposed forecasting approach  For the full verification of different method combinations, the entire test set was used for the contrast experiment.The results of different evaluated criteria for the proposed forecasting approach applied by 25 method combinations are presented for day-ahead load forecasting in Table 8.The forecast errors of the different methods varied.For example, the error of MI-based SVR was close to that of the GPR.The MAPEs for the MI-based SVR and GPR were 4.872% and 4.785%, the RMSEs were 1196.775MW and 1141.372MW, and the MAEs were 773.447MW and 755.325MW, respectively.Based on these observations, the forecast errors of the SVR with any feature selection method was below 5% (marked in bold) except with RF.In addition, the MAPEs of GPR with CMI and RF were below 5% as well.By comparison of the results, the RreliefF + SVR method showed the best performance with the least MAPE.

Load Forecasting for Singapore
To further verify the applicability of the proposed approach, the load data from Singapore was used to perform the load forecasting experiments.

Feature Selection for Hour Loads
First, using the same method used in Section 6.1.1,the score of the feature corresponding to the predictive target at different hours was computed by different feature selection methods.Then, the optimal feature subset was obtained based on the MAPE of different subsets forecast by a predictor.
Table 9 shows the MAPE and the dimension of the optimal feature subset corresponding to using MI as the feature selection method and RF as the prediction model.The Table shows that 1 to 7 am, the dimension of optimal feature subset is less than at 8 to 19 pm, as the same as the forecasting error.Similar to the analysis result of 4, this is because people are less active at night and there are fewer factors affecting the load than during the day.The MAPE and the dimension of optimal feature subset corresponding to 1:00 were carried out by different feature selection methods and forecasting methods shown in Table 10.The MAPEs are in 1.0% to 1.6% which means the performance of forecasters were similar after feature selection.While by analysis the feature dimension, we could find there is huge difference between the number of the feature of the optimal feature subset and that selected by different feature selection methods, which is caused by the different evaluation criteria.As is shown in Appendix A Table A3 to Table A4, considering both the MAPEs and the dimensions, the optimal feature subsets were used for the load forecasting of the Singapore data.Similar to the conclusion summarized in Table 4, the different optimal feature subsets employed various feature selection methods and forecasters.

Forecasting Results of Method Combinations with Optimal Feature Subsets for Singapore Load Data
To test the performance of diverse combined methods with the optimal feature subset, the data of special weeks were used for the experiment.
Two weeks were chosen randomly from the summer and winter of 2015 for testing as is shown in Figures 7 and 8.The summer week included the days from 21 to 27 June and the winter week included days from 8 to 14 November.The results are shown in Figure 7 and Table 11.It was found that the GPR, RF, and SVR methods showed a better ability to forecast the summer loads.The MAPEs of the combinations of MI + GPR, CMI + GPR, RF + GPR, RreliefF + GPR, CMI + SVR, and RreliefF + SVR were less than 1.5%.The outstanding combined method was RreliefF + GPR whose MAPE was 1.402%, MAE was 74.400 MW, and RMSE was 93.092 MW.By observing Figure 8      To further verify the superiority of the proposed method based on feature subsets of different hours, the entire test data from Singapore was used for validation.Detailed information about the test data is shown in Table 2 in Section 5.2.Table 13 shows the average predictive error of the different combined methods.It indicates that, based on MI, the CMI, RF, RreliefF, and SVR methods achieved the minimum errors with MAPEs of 1.471%, 1.440%, 1.387%, and 1.373%, respectively.Of all the combined methods, the RreliefF + SVR method worked best with an MAPE of 1.373%, an MAE of 75.118MW, and an RMSE of 147.585MW.By analyzing the load forecasting results for Singapore, the combination of RreliefF and SVR was the most accurate method.

Comparison of Forecasting Methods without Feature Selection for New England and Singapore
In this section, a comparison of the proposed method and the traditional method (which only builds a single predictor for forecasting without feature selection) based on the data of New England and Singapore was carried out to verify the necessity of forecasting by a modular predictor.
The histograms of the error and the training time duration of different forecasting methods using New England data are displayed in Figure 9.As shown in Figure 9a, the MAPE of the SVR that adopted the proposed method was almost half that of the SVR using the traditional method.The MAPE of other predictors employing the proposed method without the feature selection step decreased in different levels compared with the predictors employing the traditional method.By analyzing the MAE in Figure 9b and the RMSE in Figure 9c, a similar conclusion can be obtained.In addition, it is noted that the model training time of the proposed method decreased because of the smaller modeling training set.Therefore, the decreased error and training time reflect the advantages of the proposed method and confirms the necessity of employing a modular predictor.14.The results for New England indicate that the MAPE values of CART, RF, SVR, BPNN, and GPR with the proposed method were reduced by 0.182%, 2.253%, 4.294%, 1.953%, and 3.775% compared with the CART, RF, SVR, BPNN, and GPR with the traditional approach, respectively.Similarly, the results for Singapore also verified the superior performance of the proposed method.A comparison between the proposed method and traditional method with feature selection was performed on the New England and Singapore datasets.The results of the proposed method with feature selection are shown in Table 8 (New England) and Table 13 (Singapore), and the results of the traditional method with feature selection are shown in Table 15.The results indicate that the error was reduced in different levels by adopting the proposed method.The largest reduction in MAPE resulted from the CMI + SVR and CART + BPNN methods with MAPEs of 2.799% and 3.072%, respectively.The minimum error was achieved by the RreliefF + SVR combination with MAPEs of 4.746% (New England) and 1.373% (Singapore).In conclusion, the forecasted results obtained by the  14.The results for New England indicate that the MAPE values of CART, RF, SVR, BPNN, and GPR with the proposed method were reduced by 0.182%, 2.253%, 4.294%, 1.953%, and 3.775% compared with the CART, RF, SVR, BPNN, and GPR with the traditional approach, respectively.Similarly, the results for Singapore also verified the superior performance of the proposed method.MAPE resulted from the CMI + SVR and CART + BPNN methods with MAPEs of 2.799% and 3.072%, respectively.The minimum error was achieved by the RreliefF + SVR combination with MAPEs of 4.746% (New England) and 1.373% (Singapore).In conclusion, the forecasted results obtained by the proposed method were better than those of the traditional method regardless of the predictor used.The most accurate combined method was RreliefF + SVR.

Conclusions
Accurate day-ahead load forecasting enhances the stability of grid operations and improves the social benefits of power systems.To improve the accuracy of day-ahead load forecasting, a novel modular parallel forecasting model with feature selection was proposed.Load data from New England and Singapore were used to test the proposed method.The experimental results show the advantages of the proposed method as follows: (1) A modular predictor consisting of 24 independent predictors can efficiently capture load characteristics with respect to different hours and thereby avoid the inaccurate analysis of a single predictor.
(2) The feature selection adopted for the load corresponding to different hours analyzes the relevance between the feature and a special load.Each optimal feature subset of different dimension benefits the building of a more-accurate predictor.
(3) To serve the demand of dispatch departments of different regions, the interval time p = 24 was chosen for structuring a general candidate feature set that met the requirements of the power system.

Figure 1 .
Figure 1.The power load of New England.

Figure 1 .
Figure 1.The power load of New England.

Figure 2 .
Figure 2. Overview of the proposed method.

Figure 2 .
Figure 2. Overview of the proposed method.

6. 1 .
Load Forecasting for New England 6.1.1.Feature Selection for Different-Hour Loads Feature Score for Feature Analysis

Figure 3 .
Figure 3. Normalized feature score of features evaluated by kinds of feature selection methods.

Figure 3 .
Figure 3. Normalized feature score of features evaluated by kinds of feature selection methods.

Figure 4 .
Figure 4. Mean absolute percentage error (MAPE) curves of combinations of feature selection methods and forecasting methods for selecting feature subset.

Figure 4 .
Figure 4. Mean absolute percentage error (MAPE) curves of combinations of feature selection methods and forecasting methods for selecting feature subset.

Figure 9 .
Figure 9.Comparison of error and time of training a model with traditional and proposed approaches.

Figure 9 .
Figure 9.Comparison of error and time of training a model with traditional and proposed approaches.

Table 1 .
The feature information.

Table 2 .
Experimental data description.

Table 2 .
Experimental data description.

Table 3 .
Top 10 features of ranked of feature by different feature selection corresponding to Figure 3.

Table 4 .
Optimal feature subset construction of different hours with mutual information (MI) + random forest (RF) for New England.
Remark: FD means the feature dimension.

Table 5 .
Optimal feature subset construction of 1:00 with different methods for New England.

Table 6 .
Comparison of different combined methods.

Table 6 .
Comparison of different combined methods.

Table 7 .
Comparison of different combined methods.

Table 7 .
Comparison of different combined methods.

Table 8 .
Error of load forecasting of different methods with proposed forecasting approach for the whole test set.

Table 9 .
Optimal feature subset construction of different hours with MI + RF for Singapore.

Table 10 .
Optimal feature subset construction of 1:00 with different methods for Singapore.
and Table12, the RreliefF + GPR method showed the best performance with an MAPE of 3.567%, an MAE of 200.711MW, and an RMSE of 224.017MW.The predictive results of GPR and SVR with different feature selection methods were better than those of the CART, BPNN, and RF methods.RreliefF + GPR method showed the best performance with an MAPE of 3.567%, an MAE of 200.711MW, and an RMSE of 224.017MW.The predictive results of GPR and SVR with different feature selection methods were better than those of the CART, BPNN, and RF methods.

Table 11 .
Comparison of different combined methods.Prediction from 8 to 14 November 2015.

Table 12 .
Comparison of different combined methods.

Table 11 .
Comparison of different combined methods.

Table 12 .
Comparison of different combined methods.

Table 13 .
Error of load forecasting of different methods with proposed forecasting strategy for the whole test set.

Table 14 .
Comparison of the error of different forecasting approaches with original feature set.

Table 14 .
Comparison of the error of different forecasting approaches with original feature set.Comparison of Forecasting Approaches with Feature Selection for New England and Singapore A comparison between the proposed method and traditional method with feature selection was performed on the New England and Singapore datasets.The results of the proposed method with feature selection are shown in Table8(New England) and Table13(Singapore), and the results of the traditional method with feature selection are shown in Table15.The results indicate that the error was reduced in different levels by adopting the proposed method.The largest reduction in

Table 15 .
Error of load forecasting of different methods with traditional forecasting approach for the whole test set.

Table A1 .
Optimal feature subset construction of different hours from 1:00 to 12:00 with different methods for New England.

Table A2 .
Optimal feature subset construction of different hours from 13:00 to 24:00 with different methods for New England.

Table A3 .
Optimal feature subset construction of different hours from 1:00 to 12:00 with different methods for Singapore.

Table A4 .
Optimal feature subset construction of different hours from 13:00 to 24:00 with different methods for Singapore.