Adaptive L0 Regularization for Sparse Support Vector Regression

: In this work, we proposed a sparse version of the Support Vector Regression (SVR) al-gorithm that uses regularization to achieve sparsity in function estimation. To achieve this, we used an adaptive L 0 penalty that has a ridge structure and, therefore, does not introduce additional computational complexity to the algorithm. In addition to this, we used an alternative approach based on a similar proposal in the Support Vector Machine (SVM) literature. Through numerical studies, we demonstrated the effectiveness of our proposals. We believe that this is the ﬁrst time someone discussed a sparse version of Support Vector Regression (in terms of variable selection and not in terms of support vector selection).


Introduction
The Support Vector Machine (SVM), which was introduced by Cortes and Vapnik (1995) [1], is probably the most popular classification algorithm in the literature.A very popular variant of SVM is the Support Vector Regression (SVR), which is used for function estimation, for example, in a regression setting, and it can be used as an alternative algorithm to the Ordinary Least Squares (OLS) estimation.The main advantage of SVR is the use of a piecewise linear loss function, which diminishes the effect of outliers on function estimation (compared to the effect of outliers on OLS estimation) leading to a robust estimation of the regression function.
As SVR has been introduced alongside SVM, their developments are similar, that is, a lot of the extensions that were introduced in the classification framework for SVM have also been discussed in the function estimation framework for SVR.Although sparsity in terms of variable regularization was discussed extensively in the literature for SVM, there seems to be a lack of literature for sparse SVR in this context.There are some suggestions of different two-step procedures where variable selection occurs before the application of SVR and, therefore, only a reduced number of variables is used to fit the model, see, for example, Wang et al. (2006) [2], but we have not come across literature that demonstrates the advantages of using regularization in SVR.Here, it is important to note that in the literature of SVR, we came across a different type of sparsity where researchers refer to the sparsity of the support vectors, i.e., the number of points that affect the construction of the optimal separating hyperplane (see, for example, Ertin and Potter (2005) [3]) rather than the regularization of the variables.
Sparsity, in terms of variable selection, is very common in the statistics literature.It is usually introduced through penalization or regularization and, although there are a number of methods, we briefly discuss the most well-known methodology.L2 (or ridge) regularization is computationally fast but does not really achieve sparsity.There is also L 0 regularization as well as LASSO (Tibshirani (1996) [4]) or L1 regularization, which are computationally more expensive but they are more effective than L2 in achieving sparsity.Furthermore, there is also a combination of these classic regularization methods, such as the elastic net, which is a combination of L1 and L2 regularization (see Zou and Hastie (2005) [5]) and SCAD (see Fan and Li (2001) [6]).This is just a very small sample of the methodology in the literature.Other algorithms also exist but, in all cases, there is a trade-off between accuracy and computational cost.
In this work, we combined the classic SVR algorithm with a relatively recently introduced penalization procedure, which is known as an adaptive L 0 penalty (Frommlet and Nuel (2016) [7]).This penalty has the advantage of achieving similar regularization but without the computational complexity of the classic L 0 penalty.In the next section, we present an overview of the fundamental concepts/elements used in our methodology, that is, we discuss the SVR and the adaptive L 0 penalty.In Section 3, we discuss the mathematical development of the adaptive L 0 penalized SVR algorithm and we also provide an alternative approach based on a similar penalty introduced in the SVM literature.In Section 4, we have the results of the numerical experiments and we close the paper with a discussion.

Fundamental Concepts
In this section, we discuss the general ideas behind SVR and the adaptive L 0 penalty, which are the major elements we used in our algorithm.

SVR
As we mentioned above, SVR is a variant of SVM, which is used for function estimation.Before we start the discussion on SVR, we emphasize that we have changed the classic SVR notation in this paper to align it with the fact that the optimal vector estimates the regression coefficients.Therefore, instead of denoting the normal vector with w, which is the usual notation in the literature, we denote it with β.The main idea is that a bilinear function is used, which provides an area where all the residuals in a regression framework will lie.Mathematically, this means that if y is the output and f (x) is the function we are trying to estimate, then one defines the residuals r(x, y) = y − f (x) and tries to find the function f (x) such that all the absolute residuals are less than > 0, that is, |r(x, y)| ≤ .To help visualize this, it means that there is an area of radius around f (x) where all the outputs y lie.In the SVR literature, is also called the margin.As one can change the value of the margin, it means there are many functions f (x) that can satisfy the condition |r(x, y)| ≤ .At the same time, our purpose in SVR is to find f (x), which has the maximum generalization ability.This is achieved by maximizing or, equivalently, if we minimize β 2 where β is the coefficient vector satisfying f (x) = β T x + b 0 , where b 0 is the offset.One can see f (x) as the equation of the hyperplane that passes from the middle of the area that contains all of the output y.
Let us assume we have n pairs of datapoints (x i , y i ), where i = 1, . . ., n.In an ideal world, where all the points lie within distance of f (x), we have the hard-margin optimization, which is: subject to : In reality, though, the optimal f (x) might be one that has some of its residuals bigger than or, in other words, one that allows for points to lie beyond the margin.In that case, one needs to solve the soft-margin optimization as follows: where λ is the margin parameter, i.e., a value that assigns a trade-off between having a large margin, or a larger additive distance between all the possible points outside the margin, and ξ i and ξ * i are slack variables, which measure how far outside the margin a point is or, mathematically, ξ i = max{0, r(x, y) − } and ξ * i = max{0, −r(x, y) − }.To solve the above soft-margin optimization, one uses the Lagrangian approach, then finds the KKT equations, and finally uses quadratic programming optimization.This procedure is standard in the SVM and SVR literature.First, one needs to find the Lagrangian, which we express below in matrix form to simplify the notation: T are the Lagrangian multipliers, y = (y 1 , . . ., y n ) T denotes the response vector, X = (x T 1 , . . ., x T n ) T is the p × n predictor matrix where each row vector denotes the predictor vector of each observation, 1 n ∈ R n is the n-dimensional vector with all entries equal to one, b n ∈ R n , an n-dimensional vector with all entries equal to b 0 , and n ∈ R n , an n-dimensional vector with all entries equal to .
To find the solution to the Lagrangian in (2), one needs to take the partial derivatives with respect to β, t, ξ, and ξ * and set them equal to zero.Therefore, we have: Substituting the equations of the four derivatives in (3) into the Lagrangian in (2) gives the following dual optimization problem where one tries to maximize: subject to the following conditions: Note that the optimization problem in (4) finds the values for (α − α * ) and, therefore, this allows us to estimate the β based on the solution of the equation of the first partial derivative of the Lagrangian with respect to β.Here, we emphasize that there are a number of variants of SVM that can be extended in the SVR framework and their solution is found using a similar procedure to the above.

L0 Penalty
In this section, we will introduce the adaptive L 0 penalty, which was introduced by Frommlet and Nuel (2016) [7] for the linear regression model.The main idea of the adaptive L 0 penalty is to find a differentiable approach to approximate the L 0 penalty.This will give us a penalty that achieves regularization, that is, reduces the coefficients of the irrelevant variables to zero (a very well-known property of the L 0 penalty) without the computational complexity the L 0 penalty introduces.It is also called a ridge-type L 0 penalty.
First, let us discuss the L 0 penalty.The L 0 penalty is the most explicit penalty one can use, in the sense that it is penalizing the number of nonzero coefficients among the variables used in the function estimation.Therefore, in this case, it takes the form Frommlet and Nuel (2016) [7] suggest roughly (we only roughly state the suggestion by Frommlet and Nuel (2016) [7] and we will discuss the details in later sections) replacing the penalty with the function β T Λβ, where Λ is a p × p dimensional diagonal matrix and the (i, i) th element is Λ (i,i) = 1/β 2 i .One can see that for each j when β j = 0 then β j Λ (j,j) β j = 0 and when β j = 0, then β j Λ (j,j) β j = 1.Putting these together, one can show that The idea here is to use the above development to obtain a sparse β vector after iteratively applying the penalization.Therefore, for the adaptive L 0 penalty, we try to minimize: where λ > 0 is a constant that forces more sparsity as it becomes larger, f (•) is the main objective function we are trying to minimize to obtain the coefficients β (for example, in Ordinary Least Squares regression, this is (y − β T X)), and w i is the i th weight.Frommlet and Nuel (2016) [7] suggest using the weight: where γ > 0 is a constant used to define the quality of the estimation of the function F λ,w in (5) and, to maintain the approximation, we want it set as γ = 2, δ > 0, which is a constant that calibrates the size that is considered significant (and makes sure weight is not set to zero as we divide with it) and it is set to δ = 10 −5 and q = 2. Here, we note that, for example, q = 1 would have given an approximated LASSO penalty.Finally, we note that we need an iterative procedure and, therefore: i , and then we repeat the process until convergence.As a starting point, we set all the weights equal to one, which is equivalent to running the classic SVR algorithm without regularization.

Adaptive L 0 Support Vector Regression
In this section, we discuss the main method we introduced in this paper, where we try to have a sparse function estimation in SVR by applying the adaptive L 0 penalty.This will change the Lagrangian we presented before in Equation (2).In this section, we could have added a subscript L0 to differentiate the optimization problem presented in SVR but, to keep things simple (as the development is relatively clear), we omitted it throughout this section.
The main idea comes by combining the adaptive L 0 penalty of Frommlet and Nuel (2016) [7] with the optimization of SVR.Therefore, the new optimization problem takes the following form expressed in matrix form: Similarly to the SVM and SVR literature, we have the following Lagrangian: and, as in SVR, we take the derivatives to obtain the KKT equations in this problem: Substituting the equations of the four derivatives in (8) into the Lagrangian L 0 in (7), one obtains the following dual optimization problem: subject to the following conditions: The solution of the optimization problem will give us the values of the vector (α − α * ), which we will put into the equation for β in the first derivative in (8).

Estimation Procedure
In this section, we discuss the estimation procedure.There are a number of issues that need to be addressed.We will discuss them one by one and, at the end of the section, we will give a detailed outline of the algorithm.
The first thing we need to discuss is adjusting the algorithm to demonstrate how existing packages can be used.Although our objective function looks similar to the objective function for SVR, there is a very important difference since the first term includes the weight matrix Λ. Packages in R, which run SVR, such as e1071 (Meyer and Wien (2015)) [8] and kernlab (Karatzoglou et al. (2023) [9]) are able to solve the classic SVR optimization problem given in Section 2.1 rather than the new optimization problem with the adaptive L 0 penalty that includes Λ and was presented at the beginning of Section 3. By setting β = Λ 1/2 β and X = XΛ −1/2 , one can rewrite the optimization in (6) as follows: which is exactly the same as the objective function for SVR presented in (1), where β and X have been replaced with β and X, respectively.This modification allows us to use existing software to solve the optimization to obtain β by using X as the predictors and then using β = Λ −1/2 β to find β.
The second point we need to address is the fact that the procedure is an iterative process.By definition, Λ has entries that depend on the entries of vector β, which is essentially the vector we are trying to estimate.Therefore, one approach to alleviate this technicality here is to set all initial weights to one, which is equivalent to running the standard SVR algorithm without the adaptive L 0 penalty.Then, based on this estimator, we construct an initial estimate for Λ (0) , which we use to estimate β (1) .We need to set a stopping rule that will stop when β (t) − β (t−1) ≤ κ, where κ in our simulation is usually set to something smaller than 10 −3 .
This also leads to another modification of the optimization problem.To have an accurate optimization problem, we need to indicate the iterative nature of this procedure by using superscripts β(t) and X(t) .Therefore, the optimization takes the following form: where we use Λ (t−1) to construct X(t−1) as X(t−1) = X(Λ (t−1) ) −1/2 .Then, we put these in the optimization problem in (10) to obtain the solution β(t) .Based on β(t) , we obtain β (t) , which we use to obtain an updated estimate Λ (t) , and we use it to construct X(t) , which is plugged into the optimization problem in (10) to obtain a solution β (t) .This becomes more clear in the following estimation procedure:

•
Step 1: Obtain an initial estimate β (0) for the coefficients of the function using any method you like (one example is the OLS approach or the classic SVR approach).

Alternative Approach to Adaptive L 0 SVR
In this section, we discuss an alternative approach to adaptive L 0 SVR.A similar approach to the one proposed in this manuscript has been presented in the SVM literature, although it was not discussed at all in the SVR literature.In Li et al. (2015) [10], the authors proposed what they call a sparse Least Squares SVM using L 0 in the primal state.We introduce this approach into the SVR method in a similar way and we call it the Alternative Adaptive L 0 (AAL0) penalty.The starting point for this is the optimization problem: subject to : where, if one compares it to the method introduced in the previous section, it is clear that there is an extra term (1/2)β T β and it also introduces a constant c, which gives different weights to the second term and is an additional parameter that needs to be tuned.If one combines the first two terms in the optimization above, then we have a similar problem as in (6), which is: subject to : where Λ = I + cΛ and I is the identity matrix.This last optimization is equivalent to (6), where Λ is replaced with Λ.The development of the dual optimization solution is, therefore, very similar and we omit it here.

Numerical Experiments
In this section, we demonstrate the performance of our method in a variety of numerical experiments.We first discuss a number of simulated models and then we discuss a number of real data.

Simulated Data
First, we ran a few experiments with simulated data to see how well our algorithm performs.We chose similar settings to the ones Frommlet and Nuel (2016) [7] used in their work for the linear model.To demonstrate how well the methods work, we used two metrics, the percentage of actual predictors that have nonzero coefficients that are recovered from the algorithm (one can see them as true positives) and the percentage of the zero coefficients where the algorithm correctly puts a zero coefficient (one can see them as true negatives).Therefore, the closer these percentages are to one, the better the model is at recovering the true solution.
As we can see in Table 1, there are very good results in the estimates by all three methods.The three methods seem to have similar performances when identifying the true positives, that is, the coefficients that are different than zero.There seem to be different results when we try to find the true negatives, that is, the coefficients that are actually zero.The SVR-AL0 seems to perform well for Model 1 and Model 2 when there is no correlation between the predictors (ρ = 0), but not as well as the OLS-AL0 when there is a correlation.In Model 3, this is reversed and the SVR-AL0 seems to work better than the OLS-AL0 in recovering the true negatives.

Real Data
Now, we discuss a number of real datasets from the UC Irvine Machine Learning Repository (Dua and Graff, 2019) [11].
To perform the comparison, we ran the SVR method and we obtained the regression function estimation, which is produced by the non-regularized procedure.Then, we ran the regularized SVR methods, including both SVR-AL0 and SVR-AAL0.The scalar c is set to take three different values, i.e., 0.1, 1, and 10, in the SVR-AAL0 algorithm.If we only present one answer, that means the three different values have shown the same answers.
We ran the algorithms on three datasets and, here, we present the results.The datasets are the Concrete Compression Strength (I-Cheng, 1998) [12], the power plant (Tüfeksi, 2014 [13]), and the wine quality (Cortez et al., 2009 [14]).In the Concrete Compression Strength data (Table 2), we see that the classic SVR algorithm produces a vector of nonzero coefficients, which means all the variables are important while both the sparse algorithms push five of the eight coefficients to zero.Furthermore, to demonstrate the effect of the value of scalar c in SVR-AAL0 on the results, we show that if we change c and set it equal to 0.1, two of the five zero coefficients are nonzero (although very close to zero).Similarly,

Discussion
In this work, we proposed what is possibly the first application of regularization to SVR to obtain sparse function estimation in statistical procedures, such as regression.We combined the classic SVR algorithm with a newly proposed adaptive L 0 penalty.We also proposed an alternative approach to the adaptive L 0 regularization of SVR, which led to a slightly different optimization and solution.We demonstrated in our numerical experiments that, although both of the methods we proposed performed very well, the alternative approach depends on the value of the parameter c that needs to be tuned.
The adaptive L 0 procedure applied to SVR in this paper has been applied to linear regression by Frommlet and Nuel (2016) [7] and variations of this idea have been implemented in the SVM literature, see, for example, Li et al. (2015) [10] and Huang et al. (2008) [15].It has also been used by Smallman et al. (2020) [16] for sparse feature extraction in principal component analysis (PCA) for Poisson distributed data, which is useful in dimension reduction for text data.It is a very appealing approach as it is computationally efficient (although it needs to be applied iteratively) to obtain an accurate solution and it is also efficient in providing sparse solutions.It has the potential to be applied to different settings and we are actively working toward this direction.In addition, this methodology depends on a number of parameters that need to be tuned and an interesting direction we have not explored is whether different settings will require different optimal parameters.
(k) denotes the k th iteration.Therefore, we use β (k−1) i to estimate w (k−1) i , which we use as a weight to estimate β (k)

Table 1 .
Comparing performance of true recovery of nonzero (true positives) and zero (true negatives) coefficients for the three methods.

Table 4 .
Comparing coefficients of the different methods in the power plant dataset.