A Feature-Weighted SVR Method Based on Kernel Space Feature

: Support Vector Regression (SVR), which converts the original low-dimensional problem to a high-dimensional kernel space linear problem by introducing kernel functions, has been successfully applied in system modeling. Regarding the classical SVR algorithm, the value of the features has been taken into account, while its contribution to the model output is omitted. Therefore, the construction of the kernel space may not be reasonable. In the paper, a Feature-Weighted SVR (FW-SVR) is presented. The range of the feature is matched with its contribution by properly assigning the weight of the input features in data pre-processing. FW-SVR optimizes the distribution of the sample points in the kernel space to make the minimizing of the structural risk more reasonable. Four synthetic datasets and seven real datasets are applied. A superior generalization ability is obtained by the proposed method.


Introduction
SVR is a powerful kernel-based method for regression problems [1][2][3].It converts the original low-dimensional problem to a high-dimensional kernel space linear problem by introducing kernel functions.Regarding the system modeling with limited training samples, it balances the empirical risk and the confidence interval based on the principle of structural risk minimization.It avoids the over-fitting problem resulting from the overcomplex model and ensures the generalization performance of the model when it is sufficiently close to the training sample data [4][5][6][7][8] The generalization ability of the SVR model is determined by the kernel space feature [9].The value of kernel elements can be regarded as the similarity measure between samples in kernel space.The kernel function can simplify the calculation of the inner product in kernel space, and the curse of dimensionality is avoided.The contribution of the feature to the output is omitted in classical SVR.In some cases, such as when the dynamic range of an unimportant feature is large, the similarity of samples in kernel space may be dominated by the feature, so that the kernel matrix cannot deliver sufficient information about the training set to the model.Then, the optimization of structural risk minimization is affected.
At present, the research about SVR modeling focuses on the construction of the model and the optimization of the parameters [10][11][12][13], while the preprocessing of data is neglected.Data normalization methods, such as min-max normalization and Z-normalization, are the most widely-used preprocessing methods [14][15][16].Min-max normalization converts raw data to [0, 1] or [−1, 1] by linearization.The Z-normalization method normalize the raw dataset to a dataset with a mean value of zero and a variance of one.The normalization method can overcome the numerical difficulties caused by the large difference of the dynamic range among input features.However, there is no evidence showing that normalization method will definitely improve the generalization performance.Whether to adopt the normalization is still based on the experience of engineers.In some literature, the feature selection is used to remove unimportant features from the training dataset and avoid the dominant influence of unimportant features on kernel space feature [17][18][19].However, this will obviously lead to a lack of training information.
The weighted method is also used to improve the generalization ability.Zhang fan et al. developed a forecasting model using weighted SVR in which the weights were determined by the DE algorithm, and this model yielded high accuracy for building energy consumption forecasting [20].Limei Liu combined weighted support vector regression machine with feature selection to predict electricity load, and the algorithm gave good prediction results [21].Han, Xadditionally added weights to the slack variables in the constraints to predict house prices [22].The above weighted SVR algorithms took the importance of sample points into account, and they can be used to minimize the influence of outliers or noises.However, the importance of the individual features is omitted.Recently, there have been some research works on feature weighting for the Support Vector Machine (SVM) classification problem [23][24][25][26][27]. Regretfully, it cannot be applied to the SVR because of the differences in the output.
The paper proposes an Feature-Weighted (FW)-SVR modeling method based on the kernel space feature.Firstly, we concluded that the classical methods are not reasonable by analyzing the similarity of sample points in the kernel space; because the value of the features has been taken into account, while the contribution to the model output is omitted.Then, we present the FW-SVR algorithm that makes the value of the features match their contribution by analyzing the limitation of the normalization algorithm and feature selection SVR algorithm.
The contribution of this work is two-fold.Firstly, the feature importance should be matched with the influence of the kernel space by analyzing the similarity of sample points in the kernel space.Secondly, a data pre-processing method of feature weighting based on the above conclusion is given.By adjusting the range of feature values by properly assigning the weight, the feature importance is matched with the influence of the kernel space, and the generalization ability of the model is improved.Then, the first conclusion is verified.The proposed method can be used to guide the data pre-processing of SVR modeling.
The paper is organized as follows: In Section 2 "Basic Review of SVR", SVR theory is briefly described.In Section 3 "Feature-Weighted Support Vector Regression", the necessity of the feature weighting is analyzed in theory, and then, the realization process of FW-SVR is introduced in detail.Simulation examples are given in Section 4 "Simulation Examples".In Section 5 "Conclusions", we come to a conclusion.

Basic Review of SVR
The training set is given as T = {(x i , y i ), i = 1, • • • , l}, where each x i ∈ R n is the i-th input sample containing n features and y i ∈ R is the output sample.The model function determined by the SVR method can be regarded as a hyperplane in the kernel space.It is expressed as follows: where φ (x) maps the raw data of input features to a high-dimensional kernel space, w ∈ R n is a weight vector of the hyperplane and b is a bias term.An insensitive loss function ε > 0 is introduced to avoid over-fitting, and additional nonnegative slack variables ξ i , ξ * i are adopted to weaken the constraints of some certain sample points.SVR modeling is formulated as a convex quadratic programming problem expressed as follows: subject to: where C > 0 is a penalty parameter.The above convex quadratic programming problem can be solved by constructing a Lagrange function: The above convex quadratic programming problem can be reformulated by constructing the Lagrange function: where α i , α * i ≥ 0 and η i , η * i ≥ 0 are Lagrange multipliers.The kernel function K (•, •), which satisfies the Mercer condition, is introduced to replace the inner product of the high dimensional kernel space in Equation ( 4).The commonly-used kernel functions are Gaussian kernel, linear kernel, sigmoid kernel, polynomial kernel, and so on [28][29][30].These kernel functions are listed in Table 1.

Name
Definition Parameters The optimized problem can be expressed as follows: subject to: The optimal solution can be obtained as follows: The model function Equation ( 1) can be further developed as follows:

The Necessity of Feature Weighting
The similarity between sample points x i and x j in kernel space can be measured by calculating the distance d ij between φ (x i ) and φ x j .
When the Gauss kernel is adopted, d ij can be expressed as: We can deduce that the greater similarity of the sample points, the smaller the distance between the mapping in the kernel space from Equation (10).When x i =x j , the most similarity is shown, and the distance d ij is zero.
A simple example is given to illustrate that the construction of the kernel space is not reasonable when the value of the feature is the only consideration and its contribution to the model output is neglected.There is a set of sample points {x 1 , x 2 , the first item of the sample point is Feature 1, and the second is Feature 2. Regarding φ (x 1 ) and φ (x 2 ), we can measure which one is more similar to φ (x 3 ) by comparing the value of d 13 and d 23 .
The difference of the similarities of the two sample groups is decided by ρ 1 and ρ 2 accordingly.However, the contribution of the two features to the model output is quite different for the actual system in some situations.Assume that Feature 1 has a great contribution to the output and a small change of it can lead to a great change in the output.On the contrary, assume that the contribution of Feature 2 to the output is very small, and a large change can lead to a slight change in the output.When ρ 1 = ρ 2 , it is obvious that the impact of the sample point x 2 on the output is more similar to that of x 3 .When ρ 1 < ρ 2 , the influence of the sample point x 2 on the output may be more similar to that of x 3 .The similarity to the contrary is deduced without considering the contribution of the features to the output.Therefore, the similarity of the sample points generated by the classical algorithm may be influenced greatly by the unimportant features with a large value range, resulting in the inconsistency of the similarities in the kernel space and in the actual situation.
We can deduce that the kernel element is used to simplify the computation of the inner product in the kernel space in solving the convex quadratic programming problems from Formulas (4)- (5).If the similarity cannot reflect the actual rule of the dataset and is dominated by unimportant features, the solution to the optimization problem by applying the structural risk minimization principle is unreasonable.
In the paper, the feature-weighting method is used to match the effect of the feature on the kernel space feature with its contribution to the model output.Regarding the k-th feature, if a weight value w k ∈ [0, 1] is given, the kernel element will be changed as: Likewise, the weighted elements of the linear kernel can be rewritten as follows: The weighted sigmoid kernel can be expressed as follows: If the contribution of each feature to the output can be confirmed before model training and an appropriate weight value is assigned, the role played by the feature in the kernel space matches its contribution.When all the weights of the feature are one, the FW-SVR is degraded to the classical SVR.When the weight of a feature tends to be zero, it shows that the input feature has little influence on the output and means a dimension reduction.Moreover, the distance between the sample points is shortened, and the distribution of the samples is more compact.

The Implementation of the FW-SVR
The optimal combination of weights is the premise of realizing FW-SVR.In order to verify the conclusion of Section 3.1, we use the grid search method to get the optimal weight combination.Grid search is an exhaustive search method.Each feature has a set of weight values to select.All combinations are listed to generate the "grid".Every combination is tested by SVR, and the optimal one is obtained.
The SVR training that introduces the weight value is shown as follows.
The model function of FW-SVR can be expressed as follows:

Simulation Examples
In this Section, four synthetic datasets and seven real datasets are employed to verify the feasibility of the FW-SVR.All the simulations are implemented on a Windows 10 PC with Intel Core i5-3740 CPU (3.2 GHz) and 4.0 GB RAM by MATLAB R2013a.The SVR training and the test algorithm are implemented in LIBSVM 3.22 [31].The parameters for each approach on each dataset are optimized by using grid search with five-fold cross-validation on a sample of the training set [32,33].
The Root Mean Square Error (RMSE) is employed to evaluate the feasibility of the FW-SVR method.
where y i is the actual output sample and f (x i ) is its corresponding predicted value.The smaller the value of RMSE, the better its generalization ability.

Synthetic Datasets
The definitions of these functions are listed in Table 2.
Table 2. Functions used to generate synthetic datasets.

Name Function Definition Domain of Definition
Where, σ is the added Gaussian noise with a mean of zero and a standard deviation of 0.01.The synthetic dataset "F1" is chosen as an example.Feature 1 (x i1 ) and Feature 2 (x i2 ) in the training data are taken from the sinusoidal signals of 0.01 Hz and 0.05 Hz, respectively.Their corresponding test data are extracted from linear functions and the sinusoidal signal 0.125 Hz, respectively, as shown in Figure 1.The expression of F1 shows that the range of the first item affected by Feature 1 is [−0.2172,1] and that of the second item affected by Feature 2 is restricted to [−0.03, 0.03].It is deduced that Feature 1, which has a great contribution to the output, is an important feature, while Feature 2 is to the contrary.However, the contribution of Feature 2 to the output of the model is neglected when the kernel matrix is calculated.In order to observe the influence of Feature 1 and Feature 2 on the kernel matrix, a kernel width γ = 0.01 is used to compare the three kernel matrices visualized in 2D heat-maps as follows.
According to Equation ( 12), the kernel matrix can fully reflect the similarity between the sample points in kernel space.The similarity of the sample points in Figure 2c is clearly shown in Figure 2a, while the influence of Feature 1 with a high contribution to the output is obviously weakened.A grid search method is applied to obtain the corresponding RMSE for each possible combination of w 1 , w 2 to verify the necessity of feature weighting [34].The value of w 1 and w 2 is searched from the set 10 −4 , 10 −3.5 , 10 −3 , • • • , 10 0 .The model performance is shown as Figure 3 accordingly.According to Figure 3, a better generalization ability occurs when the value of w 1 /w 2 is around 100, and the best is acquired when w 1 = 10 −2 and w 2 = 10 −4 .As a whole, a better generalization ability can be obtained when w 1 > w 2 vs. when w 1 < w 2 .The FW-SVR is degraded to the classical SVR when w 1 = w 2 = 1, and the generalization ability of the model is poor.
We compare the feature weighting method with the feature selection and the normalization.In feature weighting, w 1 = 10 −2 and w 2 = 10 −4 .In feature selection, Feature 2 is deleted.In the normalization method, the min-max normalization and the Z-normalization are employed.
Min-max normalization converts raw data to [0, 1] by linearization.The k-th feature of the i-th sample x ik is normalized to x ik : x ik = x ik − x kmin x kmax − x kmin (19) where x kmax , x kmin is the maximum and minimum value of the k-th feature, respectively.The Z-normalization method normalizes the raw data to a dataset with a mean value of zero and a variance of one.
x ik = where µ k , σ k is the mean and standard deviation of the k-th feature, respectively.Firstly, the kernel matrix generated by feature weighting and by other methods is compared.In order to facilitate the comparison, when the parameter γ is selected, the result consistency of Feature 1 is calculated as the standard of nuclear element calculation, because weighting and normalization will change the value of the feature.The kernel matrix generated by the above method is shown in Figure 4.
As can be seen from Figure 4, the feature weighting reduces the influence of Feature 2 on the kernel matrix.Figure 4a is similar to Figure 2b.However, the feature weighting preserves the information of Feature 2 compared to the feature selection.It can be deduced from Figure 4b,c that the normalization method can weaken the influence of Feature 2. However, its influence is still great.As the range of Feature 1 and Feature 2 is essentially the same in normalization, the contributions of Feature 1 and Feature 2 are much the same.When the range of unimportant features is greatly wider than that of important features, the normalization method can largely reduce the influence of unimportant features.On the contrary, the normalization method will increase their influence.Then, the feature weighting is compared with the raw dataset, the feature selection and the normalization to observe the differences in the generalization ability.The search range of parameters C and γ is 2 −8 , 2 9 and 2 −8 , 2 10 , respectively, and ε is set to 0.0064.The optimal hyper-parameter is obtained by five-fold cross-validation.The prediction outputs for the test set are shown in Figure 5.As can be seen from Figure 5, the prediction curve with raw data is quite different from the real output and the two prediction curves with normalized data, as well.The feature selection achieves better results.However, under-fitting occurs because of the deletion of Feature 2. The best prediction result is derived from the feature weighting method.
We model the synthetic datasets in Table 2 to compare the above methods.For the feature selection, all possible feature combinations are tested in order to get the optimal one.It is used to be compared with the feature weighting.The program is repeated 10 times.The optimal combination of weights is shown in Table 3, and the results are shown in Table 4, in which bold values indicate the method with the best performance.Finally, we randomly chose seven UCIbenchmark datasets [35].The grid search method is used to select the optimal combination of weights from the set 10 −4 , 10 −3 , 10 −2 , 10 −1 , 10 0 .The optimal combination of weights is shown in Table 5, and the results are compared as in Table 6, in which bold values indicate the method with the best performance.According to Tables 4 and 6, FW-SVR achieves a competitive generalization performance with both synthetic datasets and real datasets.The FW-SVR that uses the Gaussian kernel performs reasonably well on all 11 datasets.Note that the results on F1 and F2 are both unacceptable because of under-fitting for the linear kernel and the sigmoid kernel.The two datasets are not included in the following comparison.As for the linear kernel, three optimal results and three suboptimal results are obtained by FW-SVR.In addition, there are three results that are close to the optimal ones.As for the sigmoid kernel, FW-SVR achieves a competitive generalization performance on synthetic datasets.For example, the mean RMSE o 0.0024 on F4 is better than the value of 0.0120 of the Gaussian kernel.However, FW-SVR is not the optimal choice for the UCI datasets, as shown in Table 6.In general, the overall results obtained by the Wilcoxon tests presented in Table 7 show that the FW-SVR achieves the best generalization performance in comparison with the other five data pre-processing methods when the most suitable kernel type is selected.Comparing the five methods, we deduce that the contribution of the feature to the output is taken into account by FW-SVR, which reduces the influence of unimportant features on the kernel space feature.

Conclusions
In the paper, we propose an FW-SVR that matches the effect of the feature on the kernel space feature with its contribution to the model output.Analyzing the similarity of sample points in kernel space, we concluded that the FW-SVR makes the distribution of the sample points in kernel space more reasonable and is important to increase the generalization ability.Numerical experiments show the effectiveness of the proposed algorithm.Our future work will focus on automatic identification of the contribution to assign an optimal weight combination.

2 Figure 1 .
Figure 1.The input feature data of the training dataset and the test dataset: (a) training dataset; (b) test dataset.

Figure 2 .
Figure 2. Three 2D heat-maps of kernel matrices: (a) kernel matrix generated by two features; (b) kernel matrix generated by Feature 1; (c) kernel matrix generated by Feature 2.

Figure 3 .
Figure 3. Model generalization ability of different weights' combination.

Figure 5 .
Figure 5.The prediction outputs for test set: (a) raw data and normalized data; (b) feature weighted and feature selected dataset.

Table 3 .
The optimal combination of weights for the synthetic datasets.

Table 4 .
Performance comparison of SVR modeling for the synthetic datasets.

Table 5 .
The optimal combination of weights for the UCIdatasets.

Table 7 .
Wilcoxon signed rank test for the prediction results.
Cell: p value.