Short-Term Load Forecasting Using EMD-LSTM Neural Networks with a Xgboost Algorithm for Feature Importance Evaluation

: Accurate load forecasting is an important issue for the reliable and efﬁcient operation of a power system. This study presents a hybrid algorithm that combines similar days (SD) selection, empirical mode decomposition (EMD), and long short-term memory (LSTM) neural networks to construct a prediction model (i.e., SD-EMD-LSTM) for short-term load forecasting. The extreme gradient boosting-based weighted k-means algorithm is used to evaluate the similarity between the forecasting and historical days. The EMD method is employed to decompose the SD load to several intrinsic mode functions (IMFs) and residual. Separated LSTM neural networks were also employed to forecast each IMF and residual. Lastly, the forecasting values from each LSTM model were reconstructed. Numerical testing demonstrates that the SD-EMD-LSTM method can accurately forecast the electric load.


Introduction
Short-term load forecasting (STLF), which ranges from one hour to one week ahead, plays an important role in the control, power security, market operation, and scheduling of reasonable dispatching plans for smart grids.However, achieving high accuracy is difficult because of the complicated effects of a variety of attributes on the load.
Over the past few decades, scholars have developed many modus to improve the accuracy of STLF that can mainly be divided into three methods, namely, traditional, similar day (SD), and artificial intelligence (AI)-based methods.Traditional methods are based on mathematical models, including multiple linear regression [1], stochastic time series [2], exponential smoothing [3], and knowledge-based methods [4].Traditional methods often perform poorly at nonlinear forecasting, and STLF is a nonlinear problem.Accordingly, the prediction accuracy of traditional methods is insufficient for STLF.
The SD method is based on the selection of historical days that have similar features to the forecasted days [5][6][7][8][9].Mandal et al. [7] selected SDs based on the calculation of the Euclidean norm of factors between historical and forecasted days.Chen et al. [8] required SDs to have the same weekday index and similar weather to the forecasted days.Mu [9] applied a weighted average model for the historical day to determine the influence of most SDs on the forecasted day.However, using this method solely cannot sufficiently obtain high prediction accuracy.The selection of input variables plays a crucial role when modelling time series and thus should be treated as a generalization problem.Arahal [10] proposed a method consists on calculating the difference index for all variables.
1.Although the temperature, humidity, and day type have been extensively used as input features in STLF, we also recognize that STLF is sensitive to the day-ahead peak load, which has to be a supplemental input feature to the SD selection and LSTM training processes.2. Extending from our previous work on data analysis, we independently learned the feature candidate weights for the SD selection framework based on the Xgboost algorithm to overcome the dimensionality limitation in clustering.Thus, the proposed Xgboost-based k-means framework can deal with the SD selection tasks beyond pure clustering.3. Numerical testing demonstrates that data decomposition-based LSTM neural networks can outperform most of the well-established forecasting methods in the longer-horizon load forecasting problem.
The rest of this paper is organized as follows.Section 2 discusses the factors that affect electricity forecasting, including temperature, day-type, and day-ahead peak load factors.Section 3 presents a generic SD selection framework that combines the Xgboost and k-means algorithms.Section 4 presents the forecasting framework, which combines the EMD and LSTM neural networks.Section 5 presents the experimental design and numerical test results.Lastly, Section 6 provides the conclusions of this study.

Data Analysis
The analysis of the relationship between the load data and external variables that affect the electric load is necessary to achieve high forecasting accuracy.This analysis is based on the electricity load data (provided by ISO New England) measured at one-hour intervals from 2003 to 2016.This section describes the major load-affecting factors, including temperature and day-type index.We also analyze the relationship between the daily and day-ahead peak loads.
Evidently, temperature changes are the primary cause of electricity load changes.In particular, the temperature variation range often determines the variation range of the electricity load.The variation in the interval-valued load with respect to the interval-valued temperature is shown in Figure 1.In the summer season, the higher the temperature is, the larger the electricity load value becomes (see Figure 1a).That is, a positive correlation exists between the load and temperature.By contrast, this correlation becomes negative in the winter season (Figure 1b).The preceding analysis indicates the necessity of discussing the effect of temperature on electricity load from one season to another.
Different day-types have different daily load curves, and the load of different day-types, such as weekends, holidays, and working days, are also different.The load of a working day is often higher than that on the weekend due to the decrease in industrial load on weekends.Accordingly, load on Saturdays is lower compared with those on other days (see Figure 2).Mondays and Tuesdays typically have the largest energy consumption over the week.Evidently, non-working days have considerably low energy consumption.Therefore, day-types are an important feature that cannot be ignored.Although we have already identified several features that affect load forecasting, prediction errors may still be large during peak hours in the STLF process.Thus, we suppose that the day-ahead peak load is an important feature for forecasting.The Figure 3 is a scatter plot between the day-ahead peak load and daily load from 1 March 2003 to 31 October 2016.We determine that these two variables are closely related with the correlation coefficient = 0.8754.This result confirms the necessity of one day-ahead peak load to be a supplemental input feature to the SD selection and LSTM training processes.Precipitation and wind speed also have a bearing on the electricity load.Load on a sunny day is significantly higher than that on a rainy day.Therefore, improving the prediction accuracy is possible by selecting SD and maximizing the historical and features data.

Similar Day Selection: Improved K-Means with Extreme Gradient Boosting
If exogenous features, such as temperature, are included, then the traditional load forecasting model could lead to slow convergence and poor prediction accuracy.Thus, we select the SD load as the input data to improve the prediction power.
Clustering based on the feature values of the data and similar samples gathered in the same cluster can substantially improve the selection of SDs with the forecasting day.The performance of the clustering algorithm depends on the distance between records."It is misleading to calculate the distance by measuring all attributes equally.The distance between neighbors will be dominated by the large number of irrelevant attributes, which sometimes leads to the dimensionality curse" [30].An effective method to overcome this problem is to add a weighted parameter for each feature.Hence, the more relevant the feature is, the larger the impact of this feature becomes on the clustering results.
This section presents an alternative to SD selection that calculates the weights of the features using the Xgboost algorithm and integrates the weighted features using the k-means clustering.

Feature-Weight Learning Algorithm: Extreme Gradient Boosting
Xgboost [31] is an improved algorithm based on the gradient boosting decision tree and can construct boosted trees efficiently and operate in parallel.The boosted trees in Xgboost are divided into regression and classification trees.The core of the algorithm is to optimize the value of the objective function.
Unlike the use of feature vectors to calculate the similarity between the forecasting and history days, gradient boosting constructs the boosted trees to intelligently obtain the feature scores, thereby indicating the importance of each feature to the training model.The more a feature is used to make key decisions with boosted trees, the higher its score becomes.The algorithm counts out the importance by "gain", "frequency", and "cover" [32].Gain is the main reference factor of the importance of a feature in the tree branches.Frequency, which is a simple version of gain, is the number of a feature in all constructed trees.Cover is the relative value of a feature observation.In this study the feature importance is set by "gain".
For a single decision tree T, Breiman et al. [33] proposed as a score of importance for each predictor feature X .The decision tree has J − 1 internal nodes, and partitions the region into two subregions at every node t by the prediction feature X .The selected feature is the one that provides maximal estimated improvement τ2 t in the squared error risk over that for a constant fit over the entire region.The squared importance of the feature X is the sum of such squared improvement over the J − 1 nodes, for which it was selected as the splitting feature.The following formula represents the importance calculation over the additive M trees.
The importance of a feature depends on whether the prediction performance changes considerably when such feature is replaced with random noise.Given the data analysis in the previous section, we take several features as input for the Xgboost algorithm to calculate the feature importance with the electricity load.We can obtain how each feature contributes to the prediction performance in the training course of the Xgboost algorithm.Evidently, the electricity load is sensitive to temperature variables (see Figure 4).Moreover, the supplement features (i.e., day-ahead-peak load) are an important feature for load forecasting.This conclusion is consistent with the results of the data analysis.We have now derived the important values of all features, which will be used as a priori knowledge of the subsequent clustering algorithm.

K-Means Clustering Based on Feature-Weight
K-means, which was first proposed by MacQueen in 1967 [34], is extensively applied in many fields and sensitive to the selection of the initial cluster centroids.We selected the initial cluster centers with the maximum distance method to diminish the probability of converging to a local optimum.This section improves the k-means clustering by computing the initial cluster centers and utilizing the new distance calculation method.The steps are presented as follows.
1. Given a data set X = {x 1 , x 2 , ..., x n } and an integer value K.The data set is normalized as follows: where x i min and x i max denote the minimum and maximum values, respectively, of each input factor.2. The forecasting day is selected as the first center u 0 3.The next center u j is selected, where u j is the farthest point from the previously selected cluster centers {u 0 , u 1 , ..., u j−1 }.Steps 2 and 3 are repeated until the K centers have been identified.4. The feature weights are calculated using the Xgboost algorithm.Thereafter, the weights are attributed to each feature, thereby providing them with different levels of importance.Let w p be the weight associated with the feature p.The norm is presented as follows.
(1) Each data point is assigned to the nearest cluster.
(2) The clusters are updated by recalculating the cluster centroid.The algorithm repeatedly executes ( 1) and ( 2) until convergence is reached.
The key idea in selecting SDs is to determine the attribute weights using the Xgboost algorithm and calculating the distance between the selected day and the day that relies on measuring different attributes in different weights.In Figure 5, the horizontal coordinate-axis presents the time (hour), whereas the longitudinal coordinate-axis presents the load curves.The color that changes from light to dark means that the electric load values change from large to small.Figure 5a shows the heat map for the original load data set, where every curve is evidently different in shape.Figure 5b,c are the heat maps for the original load data after the simple k-means clustering and weight k-means clustering, respectively.Our proposed Xgboost-k-means method can merge SDs into one cluster more effectively than the simple k-means algorithm does.Therefore, the SD can be the input data for subsequent load forecasting.

LSTM with Empirical Mode Decomposition
Neural networks are extensively employed in time series forecasting.However, determining the structure is difficult and often falls into the local minimum.The EMD method can facilitate the determination of the characteristics of the complex non-linear or non-stationary time series, i.e., it can divide the singular values into separated IMFs and determines the general trend of the real time series.This can effectively reduce the unnecessary interactions among singular values and improve the performance when a single kernel function is used in forecasting.Thus, this section proposes a model that combines the EMD and LSTM neural networks for STLF.

Empirical Mode Decomposition
EMD is a new signal processing method proposed by Huang et al. in 1998 [26].The original signal was derived from the data's characteristics and can be decomposed into the intrinsic mode functions (IMF) by EMD.Thus, EMD can effectively decompose the singular values and avoid trapping into a local optimum, thereby improving the performance and robustness of the model.
All IMFs must meet the following conditions: a.For a set of data sequences, the number of extremal points must be equal to the number of zero crossings or, at most, differ by one.b.For any point, the mean value of the envelope of the local maxima and local minima must be zero.
For the original signal x(t), EMD decomposes x(t) through the "sifting" process, which is described as follows.
1. Identify all the maxima and minima of signal x(t).2. Through the cubic spline interpolation fitting out the upper envelope u(t) and lower envelope l(t) of signal x(t) .The mean of the two envelopes can be the average envelope curve m 1 (t) : 3. Subtraction of m from x(t) to obtain an IMF candidate : 4. If h 1 (t) does not satisfy the two conditions of the IMF, then it should take h 1 (t) as original signal and repeat above calculate k times.At this point, h 1k (t) could be as shown in Equation ( 7): h 1(k−1) (t) and h 1k (t) present the signal after shifting k − 1 times and k times,respectively.m 1k (t) is the average envelope of h 1k (t) 5.If h 1(k−1) (t) satisfies the conditions of the IMF, define h 1k (t) as c 1 (t).Standard deviation is defined by Equation ( 8): 6. Subtraction of c 1 (t) from x(t) to obtain new signal r 1 (t) 7. Repeat previous steps 1 to 6 until the r n (t) cannot be decomposed into the IMF.r n (t) is the residual of the original data x(t).Finally, the original signal x(t) can be presented as a collection of n components u i (t) (i = 1, 2, ..., n) and a residual r n (t): The preceding steps show that the EMD method is employed to decompose the SD load at low and high frequencies, respectively.Figure 6 evidently shows the decomposition of the eight IMF extractions and residuals.Furthermore, all graphs in Figure 6 are shown in the same scale, thereby enabling the assessment of the contribution of each extracted IMF.

Lstm-Based Rnn for Electric Load Forecasting
LSTM was proposed by Hochreiter et al. in 1997 [35] as a type of efficient RNN architecture, and has been extensively applied in various fields.Moreover, LSTM is a popular time series forecasting model and can expertly deal with long-term dependencies data.
A. Recurrent Neural Networks (RNNs) RNNs are designed to operate over the non-linear time-varying problem [24].The RNN internal connections enable signals to travel forward and backward, thereby making RNNs substantially suitable for time series prediction.
RNNs can mine the rules from the time sequences to predict the data that have yet to occur [36,37].The reason for this characteristic lies in the feedback connections that can facilitate updating the weights based on the residual in each forward step (Figure 7).The forecasting day load in STLF is bound up with the SD load.Therefore, if we provided the SD time sequences, then obtaining high accuracy on the forecasting day becomes possible.RNN proves to be suitable for this problem [38].However, RNNs tend to suffer heavily from gradient vanishing, which may increase indefinitely and eventually cause the network to break down.Therefore, simple RNNs may not be the ideal option for forecasting problems with long-term dependencies.

B. LSTM-Based RNN Forecasting Scheme
LSTM was mainly motivated and designed to overcome the vanishing gradients problem of the standard RNN when dealing with long term dependencies.This section leads to the long short-term memory neural network.The LSTM model add the input gate,output gate and forgetting gate to the neurons in RNN.Such a structure can effectively mitigate the vanishing gradient problem [39].This makes LSTM an architecture suitable for problems with long term dependencies.
The major innovation of LSTM is its memory cell, which essentially acts as an accumulator of the state information.First, as shown in Figure 8, the forget gate is applied to decide what information to get rid of the cell state.A sigmoid function is used to calculate the activation of the forget gate f t .
The second step is to determine what new information should be stored in the cell state.To start with, a sigmoid layer named the "input gate layer" decides which information should be update.Then, a tanh layer creates a vector ct of new candidate values should be updated in the next state.
Next, we will update the old cell state c t−1 into the new cell state c t .We multiply c t−1 by f t for throw away the information from old cell.Then we add i t * ct .There are the new candidate values, scaled by how much information should be updated to each state value.
Lastly, we need to decide the output.This has two parts: we run a sigmoid layer as output gate to filter the cell state firstly.Then, we put the cell state through tanh(•) and multiply it by the output o t to calculate the desired information.
In Equations ( 11)-( 16 This study presents the experiments that apply the separated LSTM neural networks scheme for the SD load's IMF and residual forecasting.The training process inputs include the temperature, day-ahead-peak, humidity, day-type index, precipitation, wind speed, and IMF component of the SD load.The model framework is shown in Figure 9.In order to further improve the accuracy and practicability of the prediction model, we establish the architecture based on LSTM, named sequence to sequence (S2S).Sequence to sequence structure can adjust the length of the input and output sequences flexibly, that is appropriate to perform different time scales load forecasting.
Standard backpropagation can be applied to train the network using a gradient based method called Stochastic Gradient Descent (SGD).Table 1 shows the Mean Absolute Percentage Error (MAPE) on training and testing datasets for different number of layers and units using the S2S architecture for data with one-day-ahead forecasting.
It can be seen that the proposed architecture is able to produce very low errors on training dataset.Further, it is noticed that increasing the capacity of the network by increasing the number of layers and units only improves error on training dataset.The model performs well on training dataset using a 2-layer network with 50 units in each layer.However, increasing the capacity of the network does not improve performance on testing data.In order to improve accuracy on testing data Dropout is used as regularization methodology.

Numerical Experiments
This section presents the forecasting performance of the proposed SD-EMD-LSTM model.The hourly electric load data from NE-ISO 2003 to 2016 is employed for the models.The forecasting has been conducted in two time scales, namely, one day ahead (24 h) and one week ahead (168 h).
First, we present the experiments on applying the weighted k-means-based SD selection algorithm for load forecasting, as well as analyze the optimum value of the clusters k.Second, we verify the clustering effect of the proposed SD selection method and the need of using the supplemental feature.Third, experiments in two time scales are conducted to compare the proposed model with the standalone LSTM, SD-LSTM, and EMD-LSTM models to show the fitting effect of the hybrid model.Lastly, we compare the forecasting performance with three other models (i.e., ARIMA, BPNN, and SVR) to illustrate the forecasting accuracy and stability of the SD-EMD-LSTM model.The structure of BPNN model comprises of 3 layers viz.input, hidden and output layers(6-20-1), where the transfer functions of hidden layer and output layer are tansig and purelin, respectively.While the training function uses traingdm, the learning function of threshold and weights use learnged.SVR with LIBSVM package with C = 8.4065, γ = 0.0869335, ε = 0.000118.

Evaluation Indices for the Forecasting Performance
The mean absolute percentage error (MAPE) is employed as a criterion of error evaluation to analyze the forecasting performance.
where X j is the forecasting value, X j is the actual value, and m is the total number of forecasting points.
For the two forecasting time scales, m is set at 24 h and 168 h.

Empirical Results and Analysis
We perform simulations of the four examples to verify the predictive ability of the proposed method: Example 1: Through the enumeration method, k ranges from 5 to 12, the run is repeated several times in each k value using the Xgboost-k-means-based SD-EMD-LSTM model.Thereafter, the prediction of each k is calculated.Experiments of the 24-h-ahead forecasting in different seasons are performed to analyze the best k value with the highest prediction accuracy.Figure 11 shows that when the number of clusters equals 9, the prediction curve most closely follows the raw curve in four days, including 30  MAPE can also be used to determine the ideal number of clusters.Comparison results (see Table 2) show that the proposed model with 9 clusters outperformed all the other cluster numbers with the smallest forecasting MAPE of 0.97%.That is, the proposed Xgboost-k-means method can effectively merge SDs into one cluster.Consequently, we define k = 9 as a priori knowledge in the proposed SD-EMD-LSTM model to select SD for the subsequent load forecasting.Example 2: This example includes two cases.Case 1 verifies the effect of the proposed SD selection method, and the simple k-means algorithm is used to the SD selection for comparison with the proposed Xgboost-k-means model.Case 2 demonstrates the importance of using the supplemental feature, namely, day-ahead peak load.The training period in this example is from 2003 to 2015, and the prediction period is 2016.
Case 1: EMD-LSTM is combined with the proposed SD selection method and simple k-means algorithm respectively to verify its performance.In the one-day ahead load forecasting as shown in Figure 12, Xgboost-k-means hybrid with the EMD-LSTM model fits the raw data better than the simple k-means clustering algorithm.That is, the Xgboost-k-means algorithm could merge SDs into one cluster more effectively, thereby improving the prediction accuracy.
Table 3 also verifies this scenario, which shows that the SD-EMD-LSTM model achieved an improved forecasting performance with a considerably small MAPE, as well as agrees with the conclusion presented in Section 3. The reason lies in that the Xgboost algorithm has considerable ability to access each feature's weight, the limitation of the dimensionality is generally reduced, and the models are obtained with increasing the forecasting accuracy.Case 2: The SD-EMD-LSTM model is used with and without the supplemental feature (i.e., day-ahead peak load) to analyze the prediction accuracy on the one-day ahead load forecasting.Further details are shown in Figure 13.
The most significant forecasting errors often occur at the peak points of the forecast load curve.The reason is that the proposed model with the supplemental feature (i.e., day-ahead peak load) can achieve an improved forecasting performance at the peak points.The hourly mean absolute percentage errors listed in Table 4 indicate that the proposed model with supplemental input feature (i.e., day-ahead peak load) obtained an average MAPE of 1.10%.This value is lower than the 1.44% obtained in the model without the supplemental input feature.Furthermore, SD-EMD-LSTM with supplemental input feature has good prediction accuracy during peak hours (i.e., from 15:00 to 20:00).Therefore, the day-ahead peak load should be the supplemental input feature for load forecasting.We can conclude from Figure 14 that the forecasting curve of the proposed SD-EMD-LSTM model follows the raw data better than the other alternative models for the two forecasting horizon in Example 3. Evidently, comparing the LSTM curve with those of SD-LSTM and EMD-LSTM shows that the SD selection can generally enhances the accuracy of the load forecasting in the one-day-ahead and one-week-ahead forecasting.EMD can also effectively determine the general trend of the real time series.
Table 5 shows the MAPE values per month of all the models in Example 3. The last row of Table 4 lists the average MAPE values for the experiment based on 12 months.The LSTM neural networks combined with the Xgboost-k-means-based SD selection method is better than the LSTM neural networks combined with the EMD model but is slightly inferior to the SD-EMD-LSTM model.The evaluation results of the MAPE indexes and prediction curves for the four models tend to be consistent.
Example 3 enables us to conclude the following points.
(1) The fitting effect of the hybrid model is evidently better than that of the single LSTM neural networks model in both time scales.(2) The Xgboost-k-means method can effectively merge SDs into one cluster and prevent the LSTM neural networks from being trapped into a local optimum, thereby substantially improving the prediction accuracy.(3) The data decomposition method divides the singular values into separated IMFs and determines the general trend of the real time series, thereby effectively improving the performance and robustness of the model.
In general, the SD-EMD-LSTM model significantly outperforms the three other methods and achieves a good prediction effect in STLF.  Figure 15 shows that the forecasting curve of the proposed SD-EMD-LSTM model is closer to the raw load curve than the other alternative models in Example 4. The performance results of the three other methods are insufficient for STLF.
From the MAPE values in Table 6, the experiment results indicate that the proposed model is significantly superior to the SVR, ARIMA, and BPNN models.MAPE of the SD-EMD-LSTM model is the lowest among all the models.Its prediction accuracy also reaches 98.96% and 98.44% in the 24-h-ahead and 168-h-ahead forecasting, respectively.ARIMA has the maximum MAPE value.Although the three other models determined the general trend of the raw data, their forecasting errors were extremely high.
The comparison between the two different forecasting time scales demonstrate that the accuracy of the proposed hybrid model exhibit minimal changes because the LSTM neural networks can maximize the long-term dependencies in the electric load time series for substantially accurate forecasting.That is, the SD-EMD-LSTM model can perform longer-horizon load forecasting.Overall, the proposed hybrid model provides a powerful method that can outperform many other forecasting methods in the challenging STLF problem.

Conclusions
This study presents an LSTM neural network model hybridized with the SD selection and EMD methods for STLF.The key idea in selecting SDs is to determine the attribute weights using the Xgboost algorithm and calculate the distance between the selected day and the day that relies on the different measured attributes in different weights.Thereafter, the k-means algorithm merges SDs into one cluster as input data for the subsequent forecasting based on the Xgboost distance.EMD eventually determines the key features of the SD load at low and high frequencies.Lastly, the separated LSTM neural networks are used to forecast the future values in low-frequency and high-frequency time series.The proposed method has been compared with the LSTM, SD-LSTM, EMD-LSTM, ARIMA, BPNN, and SVR models in real-load data obtained from the NE-ISO for the one-day ahead and one-week ahead load forecasting.Comparison results demonstrate that the proposed Xgboost-k-means method can effectively merge SDs into one cluster.Moreover, the EMD-LSTM model has a good ability to accurately forecast the complex non-linear electric load time series over a long horizon.The aforementioned analysis implies that the proposed SD-EMD-LSTM framework can be a promising alternative approach to STLF.

Figure 3 .
Figure 3. Correlations between daily load and one day-ahead peak load.

Figure 6 .
Figure 6.The original data sequence of the similar daily load and the result of empirical mode decomposition.
), W i , W f , W c , W o represents the appropriate weight matrices.The vectors b i , b f , b c , b o denote the corresponding bias vectors.

Figure 8 .
Figure 8.The architecture of LSTM memory block.

Figure 10 .
Figure 10.The full flowchart of the SD-EMD-LSTM model flowchart.

Example 4 :Figure 15 .
Figure 15.(a) One-day ahead prediction of 9 January 2014 are performed by example 4; (b) One-week ahead prediction from 12 October 2014 to 18 October 2014 are performed by example 4.

Table 2 .
MAPE(%) for the different number of clusters in one-day ahead prediction.