Support Vector Regression Model Based on Empirical Mode Decomposition and Auto Regression for Electric Load Forecasting

Electric load forecasting is an important issue for a power utility, associated with the management of daily operations such as energy transfer scheduling, unit commitment, and load dispatch. Inspired by strong non-linear learning capability of support vector regression (SVR), this paper presents a SVR model hybridized with the empirical mode decomposition (EMD) method and auto regression (AR) for electric load forecasting. The electric load data of the New South Wales (Australia) market are employed for comparing the forecasting performances of different forecasting models. The results confirm the validity of the idea that the proposed model can simultaneously provide forecasting with good accuracy and interpretability.


Introduction
Electric energy is an unstored resource, thus, electric load forecasting plays a vital role in the management of the daily operations of a power utility, such as energy transfer scheduling, unit OPEN ACCESS commitment, and load dispatch.With the emergence of load management strategies, it is highly desirable to develop accurate, fast, simple, robust and interpretable load forecasting models for these electric utilities to achieve the purposes of higher reliability and better management [1].
In the past decades, researchers have proposed lots of methodologies to improve load forecasting accuracy.For example, Bianco et al. [2] proposed linear regression models for electricity consumption forecasting; Zhou et al. [3] applied a grey prediction model for energy consumption; Afshar and Bigdeli [4] proposed an improved singular spectral analysis method for short-term load forecasting (STLF) for the Iranian electricity market; and Kumar and Jain [5] applied three time series models-Grey-Markov model, Grey-Model with rolling mechanism, and singular spectrum analysis-to forecast the consumption of conventional energy in India.By employing artificial neural networks, references [6][7][8][9] proposed several useful short-term load forecasting models.By hybridizing the popular method and evolutionary algorithm, the authors of [10][11][12][13] demonstrated further performance improvements which could be made for energy forecasting.Though these methods can yield a significant proven forecasting accuracy improvement in some cases, they have usually focused on the improvement of the accuracy without paying special attention to the interpretability.Recently, expert systems, mainly developed by means of linguistic fuzzy rule-based systems, allow us to deal with the system modeling with good interpretability [14].However, these models have strong dependency on an expert and often cannot generate good accuracy.Therefore, combination models, based on the popular methods, expert systems and other techniques, are proposed to satisfy both high accurate level and interpretability.
Based on the advantages in statistical learning capacity to handle high dimensional data, the SVR (support vector regression) model, especially suitable for small sample size learning, has become a popular algorithm for many forecasting problems [15][16][17].As a disadvantage of an SVR method, it is easily trapped into a local optimum during the nonlinear optimization process of the three parameters, in the meanwhile, its robustness and sparsity are also lacking satisfactory levels.On the other hand, empirical mode decomposition (EMD) and auto regression (AR), a fast, easy and reliable unsupervised clustering algorithm, has been successfully applied to many fields, such as communication, society, economy, engineering, and has achieved good effects [18][19][20].Particularly, the EMD method can effectively extract the components of the basic mode from nonlinear or non-stationary time series [21], i.e., the original complex time series can be transferred into a series of single and apparent components.It can effectively reduce the interactions among lots of singular values and improve the forecasting performance of a single kernel function.Thus, it is useful to employ suitable kernel functions for forecasting the medium-and-long-term tendencies of the time series.
In this paper, we present a new hybrid model with clear human-understandable knowledge on training data to achieve a satisfactory level of forecasting accuracy.The principal idea is hybridizing EMD with SVR and AR, namely creating the EMDSVRAR model, to receive better solutions.The proposed EMDSVRAR model has the capability of smoothing and reducing the noise (inherited from EMD), the capability of filtering dataset and improving forecasting performance (inherited from SVR), and the in capability of effectively forecasting the future tendencies of data (inherited from AR).The forecasting outputs of an unseen example by using the hybrid method are described in the following section.
To show the applicability and superiority of the proposed algorithm, half-hourly electric load data (48 data points per day) from New South Wales (Australia) with two different sample sizes are 1889 employed to compare the forecasting performances among the proposed model and other four alternative models, namely the PSO-BP model (BP neural network trained by a particle swarm optimization algorithm), SVR model, PSO-SVR model (optimal combination of SVR parameters determined by a PSO algorithm), and the AFCM model (an adaptive fuzzy combination model based on a self-organizing map and support vector regression).This study also suggests that researchers and practitioners should carefully consider the nature and intention in using these electric load data while neural networks, statistical methods, and other hybrid models are being determined to be the critical management tools in electricity markets.The experimental results indicate that this proposed EMDSVRAR model has the following advantages: (1) simultaneously satisfies the need for high levels of accuracy and interpretability; (2) the proposed model can tolerate more redundant information than the original SVR model, thus, it has better generalization ability.
The rest of this paper is organized as follows: in Section 2, the EMDSVRAR forecasting model is introduced and the main steps of the model are given.In Section 3, the data description and the research design are outlined.The numerical results and comparisons are presented and discussed in Section 4. A brief conclusion of this paper and the future research are provided in Section 5.

Empirical Mode Decomposition (EMD)
The EMD method is based on the simple assumption that any signal consists of different simple intrinsic modes of oscillations.Each linear or non-linear mode will have the same number of extreme and zero-crossings.There is only one extreme between successive zero-crossings.Each mode should be independent of the others.In this way, each signal could be decomposed into a number of intrinsic mode functions (IMFs), each of which should satisfy the following two definitions [22]: a In the whole data set, the number of extreme and the number of zero-crossings should either equal or differ to each other at most by one.b At any point, the mean value of the envelope defined by local maxima and the envelope defined by the local minima is zero.
An IMF represents a simple oscillatory mode compared with the simple harmonic function.With the definition, any signal ( ) x t can be decomposed as following steps: 1 Identify all local extremes, and then connect all the local maxima by a cubic spline line as the upper envelope.2 Repeat the procedure for the local minima to produce the lower envelope.The upper and lower envelopes should cover all the data among them.3 The mean of upper and low envelope value is designated as 1 m , and the difference between the signal ( ) x t and 1 m is the first component, 1 h , as shown in Equation (1): Generally speaking, 1 h will not necessarily meet the requirements of the IMF, because 1 h is not a standard IMF.It needs to be determined for k times until the mean envelope tends to zero.Then, the first intrinsic mode function 1 c is introduced, which stands for the most high-frequency component of the original data sequence.At this point, the data could be represented as Equation ( 2): where 1k h is the datum after k times siftings.1( 1) k h − stands for the data after 1 k − times sifting.Standard deviation ( SD ) is used to determine whether the results of each filter component meet the IMF or not.SD is defined as Equation (3): where T is the length of the data.The value of standard deviation SD is limited in the range of 0.2 to 0.3, which means when 0.2 0.3 SD < < , the decomposition process can be finished.The consideration for this standard is that it should not only ensure ( ) k h t to meet the IMF requirements, but also control the decomposition times.Therefore, in this way, the IMF components could retain amplitude modulation information in the original signal.

When 1k
h has met the basic requirements of SD , based on the condition of 1 c = 1k h , the signal ( ) x t of the first IMF component 1 c can be obtained directly, and a new series 1 r could be achieved after deleting the high frequency components.This relationship could be expressed as Equation ( 4): The new sequence is treated as the original data and repeats the steps (1) to (3) processes.The second intrinsic mode function c 2 could be obtained.
5 Repeat previous steps (1) to (4) until the n r can not be decomposed into the IMF.The sequence n r is called the remainder of the original data ( ) x t .n r is a monotonic sequence, it can indicate the overall trend of the raw data ( ) x t or mean, and it is usually referred as the so-called trend items.It is of clear physical significance.The process is expressed as Equations ( 5) and ( 6): ) The original data can be expressed as the IMF component and remainder.

Support Vector Regression
The notions of SVMs for the case of regression are introduced briefly.Given a data set of N elements {( , ), 1, 2, , } , and i y ∈ℜ is the actual value corresponding to i X .A non-linear mapping (•): ℜ → ℜ is defined to map the training (input) data i X into the so-called high dimensional feature space (which may have infinite dimensions), h n ℜ (Figure 1a,b).Then, in the high dimensional feature space, there theoretically exists a linear function, f , to formulate the non-linear relationship between input data and output data.Such a linear function, namely SVR function, is shown as Equation ( 7): where ( ) f X denotes the forecasting values; the coefficients W ( h n W ∈ℜ ) and b ( b ∈ℜ ) are adjustable.As mentioned above, the SVM method aims at minimizing the empirical risk, shown as Equation ( 8): where ( , ( )) is the ε-insensitive loss function (indicated as a thick line in Figure 1c) and defined as Equation ( 9): (a) Input space , , , with the constraints: * ( ) , 1, 2,..., ( ) , 1, 2,..., 0, 1, 2,..., 0, 1, 2,..., The first term of Equation ( 10), employing the concept of maximizing the distance of two separated training data, is used to regularize weight sizes to penalize large weights, and to maintain regression function flatness.The second term penalizes training errors of ( ) f x and y by using the ε -insensitive loss function.C is the parameter to trade off these two terms.Training errors above ε are denoted as i ξ * , whereas training errors below − ε are denoted as i ξ (Figure 1b).
After the quadratic optimization problem with inequality constraints is solved, the parameter vector w in Equation ( 7) is obtained as Equation ( 12): where i ξ * , i ξ are obtained by solving a quadratic program and are the Lagrangian multipliers.Finally, the SVR regression function is obtained as Equation ( 13) in the dual space: 1 ( ) ( ) ( , ) where ( , )   i K X X is called the kernel function, and the value of the kernel equals the inner product of two vectors, i X and j X , in the feature space ( ) Any function that meets Mercer's condition [23] can be used as the kernel function.
There are several types of kernel function.The most used kernel functions are the Gaussian radial basis functions (RBF) with a width of : ( , ) exp( 0.5 / ) and the polynomial kernel with an order of d and constants 1 a and 2 a : . However, the Gaussian RBF kernel is not only easy to implement, but also capable of non-linearly mapping the training data into an infinite dimensional space, thus, it is suitable to deal with non-linear relationship problems.Therefore, the Gaussian RBF kernel function is specified in this study.The forecasting process of a SVR model is illustrated in Figure 2.   is named as the regression coefficients of the ( )

Numerical Examples
In the first experiment, the proposed model is trained by electric load from New South Wales (Australia) from 2 May 2007 to 7 May 2007, and testing electric load is from 8 May 2007.The employed electric load data is on a half-hourly basis (i.e., 48 data points per day).The data size contains only 7 days, to differ from the other example with more sample data, this example is so-called the small sample size data, and illustrated in Figure 3a.Too large training sets should avoid overtraining during the learning process of the SVR model.Therefore, the second experiment with 23 days (1104 data points from 2 May 2007 to 24 May 2007) is modeled by using part of all the training samples as training set.This example is so-called the large sample size data, and illustrated in Figure 3b.

Results after EMD
After being decomposed by EMD, the data can be divided into eight groups, which are shown in Figure 4a-h and the last group (Figure 4h) is a trend term (remainders).The so-called high frequency item is obtained by adding the preceding seven groups.From Figures 3a,b, the trend of the high frequency item is the same as original data, and its the structure is more regular, i.e., it is more stable.Then, the high frequency item (data-I) and the remainders (data-II) have good effect of regression by the SVR and AR, respectively, and will be described as follow.

Figure 4.
For ease of prevention, the graphs (a-h) show our section of plots at different IMFs for the small sample size.

Forecasting Using SVR for Data I (The High Frequency Item)
Firstly, for both small sample and large sample data, the high-frequency item is simultaneously employed for SVR modeling, and the better performances of the training and testing (forecasting) sets are shown in Figure 5a,b, respectively.The correlation coefficients of training effects are 0.9912 and 0.9901, respectively, of the forecast effects are 0.9875 and 0.9887, accordingly.This implies that the decomposition is helpful to improve the forecasting accuracy.The parameters of a SVR model for data I are shown in Table 1.

Forecasting Using AR for Data II (The Remainders)
Then, according to the geometric decay of the correlation coefficient and partial correlation coefficients fourth-order truncation for data II (the remainders), it can be regarded as AR (4) model.The parameters of a SVR model for data II are shown in Table 1.
As shown in Figure 6a,b, the remainders, for both small sample and large sample data, almost are in a straight line.The good forecasting results are shown in Table 2, and the errors have reached the level of 10 −7 for the small or large amount of data.It has demonstrated the superiority of the AR model.

Remainders MAE Eqution
The small sample size

Result and Analysis
This section focuses on the efficiency of the proposed model with respect to computational accuracy and interpretability.To consider the small sample size modeling ability of the SVR model and conduct fair comparisons, we perform a real case experiment with relatively small sample size in the first experiment.The next experiment with 1104 datapoints is focused on illustrating the relationship between sample size and accuracy.

Parameter Settings of the Employed Forecasting Models
As mentioned by Taylor [25], and to be based on the same comparison condition with Wang et al. [26], some parameter settings of the employed forecasting models are set as followings.For the PSO-BP model, we use 90 percent of all training samples as the training set, and the rest as the evaluation set.The parameters used in the PSO-BP are as follows: (i) The first set related to BP neural network: input layer dimension indim = 2, hidden layer dimension hiddennum = 3, output layer dimension outdim = 1; (ii) The second set related to PSO: maximum iteration number itmax = 300, number of particles N = 40, length of particle D = 3, weight c 1 = c 2 = 2.
Because the PSO-SVR model embeds the construction and prediction algorithm of SVR in the fitness value iteration step of PSO, it will take a long time to train the PSO-SVR using the full training dataset.
For the above reason, we draw a small part of all training samples as training set, and the rest as evaluation set.The parameters used in the PSO are as follows: For small sample size: maximum iteration number itmax = 50, number of particles N = 20, length of particle D = 3, weight c 1 = c 2 = 2.For large sample size: maximum iteration number itmax = 20, number of particles N = 5, length of particle D = 3, weight c 1 = c 2 = 2.

Forecasting Evaluation Methods
For the purpose of evaluating the forecasting capability, we examine the forecasting accuracy by calculating three different statistical metrics, the root mean square error (RMSE), the mean absolute error (MAE) and the mean absolute percentage error (MAPE).The definitions of RMSE, MAE and MAPE are expressed as Equations (15-17): ) Where P i and A i are the i-th predicted and actual values respectively, and n is the total number of predictions.

Empirical Results and Analysis
For the first experiment, the forecasting results (the electric load on 8 May 2007) of the original SVR model, the PSO-SVR model and the proposed EMDSVRAR model are shown in Figure 7a.Notice that the forecasting curve of the proposed EMDSVRAR model fits better than other alternative models.
The second experiment shows the one-week-ahead forecasting for the large sample size data.The peak load values of testing set are bigger than that of training set shown in Figure 3b.The detailed forecasted results of this experiment are shown in Figure 7b.It indicates that the results obtained from the EMDSVRAR model fits the peak load values exceptionally well.In other words, the EMDSVRAR model has better generalization ability than the three comparison models.
The forecasting results from these models are summarized in Table 3.The proposed EMDSVRAR model is compared with four alternative models.It is found that our hybrid model outperforms all other alternatives in terms of all the evaluation criteria.One of the general observations is that the proposed model tends to fit closer to the actual value with a smaller forecasting error.
The proposed model shows the higher forecasting accuracy in terms of three different statistical metrics.In view of the model effectiveness and efficiency on the whole, we can conclude that the proposed model is quite competitive against four comparison models, the PSO-BP, SVR, PSO-SVR, and AFCM models.In other words, the hybrid model leads to better accuracy and statistical interpretation.Several observations can also be noticed from the results.Firstly, from the comparisons among these models, we point out that the proposed model outperforms other alternative models.Secondly, the EMDSVRAR model has better generalization ability for different input patterns as shown in the second experiment.Thirdly, from the comparison between the different sample sizes of these two experiments, we conclude that the hybrid model can tolerate more redundant information and construct the model for the larger sample size data set.Finally, since the proposed model generates good results with good accuracy and interpretability, it is robust and effective as shown in Table 3. Overall, the proposed model provides a very powerful tool to implement easily for forecasting electric load.
Furthermore, to verify the significance of the accuracy improvement of the EMDSVRAR model, the forecasting accuracy comparison among original SVR, PSO-SVR, PSO-BP, AFCM, and EMDSVRAR models is conducted by a statistical test, namely a Wilcoxon signed-rank test, at the 0.025 and 0.05 significance levels in one-tail-tests.The test results are shown in Table 4. Clearly, the proposed EMDSVRAR model has statistical significance (under a significant level 0.05) among the other alternative models, particularly comparing with original SVR, PSO-SVR, PSO-BP, and AFCM models.

Conclusions
The proposed model achieves superiority and outperforms the original SVR model while forecasting based on the unbalanced data.In addition, the goal of the training model is not to learn an exact representation of the training set itself, but rather to set up a statistical model that generalizes better forecasting values for the new inputs.In practical applications of a SVR model, if the SVR model is overtrained to some sub-classes with overwhelming size, it memorizes the training data and gives poor generalization of other sub-classes with small size.The EMD term of the proposed EMDSVRAR model has been employed in the present research, details of which have discussed in the above section.
The interest in applying the EMD forecast systems arises from the fact that those systems consider both accuracy and comprehensibility of the forecast result simultaneously.To this end, a combined model has been proposed and its effectiveness in forecasting the electric load data has been compared with three other alternative models.In this study, various data characteristics of electric load are identified where the proposed model performs better than the other algorithms in terms of its forecasting capability.Based on the obtained experimental results, we conclude that the proposed EMDSVRAR model algorithm can generate not only human-understandable rules, but also better forecasting accuracy levels.Our proposed model also outperforms other alternative models in terms of interpretability, forecasting accuracy and generalization ability, which are especially true for forecasting with unbalanced data and very complex systems.
ε Θ is employed to find out an optimum hyperplane on the high dimensional feature space (Figure 1b) to maximize the distance separating the training data into two subsets.Thus, the SVR focuses on finding the optimum hyper plane and minimizing the training error between the training data and the ε -insensitive loss function.Then, the SVR minimizes the overall errors, shown as Equation (10):

Figure 2 .
Figure 2. The forecasting process of a SVR model.

2 (
Model Equation (14) expresses a p -step autoregressive model, referring as ( ) AR p model [24].Stationary time series{ } t X that meet the model ( ) AR p is called the ( ) AR p sequence.That 1

Figure 3 .
Figure 3. (a) Half-hourly electric load in New South Wales from 2 May 2007 to 8 May 2007; (b) Half-hourly electric load in New South Wales from 2 May 2007 to 24 May 2007.

Figure 5 .Table 1 .
Figure 5.Comparison of the data-I and the forecasted electric load of train and test by the SVR model for the small sample and large sample data: (a) One-day ahead prediction of 8 May 2007 are performed by the model; (b) One-week ahead prediction from 18 May 2007 to 24 May 2007 are performed by the model.

Figure 6 .
Figure 6.Comparison of the data-II and the forecasted electric load by the AR model for the two experiments: (a) One-day ahead prediction of 8 May 2007 performed by the model; (b) One-week ahead prediction from 18 May 2007 to 24 May 2007 performed by the model.

Figure 7 .
Figure 7.Comparison of the original data and the forecasted electric load by the EMDSVRAR Model, the SVR model and the PSO-SVR model for (a) the small sample size (One-day ahead prediction of 8 May 8, 2007 are performed by the models); (b) the large sample size (One-week ahead prediction from May 18, 2007 May 24, 2007 are performed by the models).

Table 2 .
Summary of results of the AR forecasting model for data-II.

Table 3 .
Summary of results of the forecasting models.
a denotes that the EMDSVRAR model significantly outperforms other alternative models.