Stochastic Subgradient for Large-Scale Support Vector Machine Using the Generalized Pinball Loss Function

: In this paper, we propose a stochastic gradient descent algorithm, called stochastic gradient descent method-based generalized pinball support vector machine (SG-GPSVM), to solve data classiﬁcation problems. This approach was developed by replacing the hinge loss function in the conventional support vector machine (SVM) with a generalized pinball loss function. We show that SG-GPSVM is convergent and that it approximates the conventional generalized pinball support vector machine (GPSVM). Further, the symmetric kernel method was adopted to evaluate the performance of SG-GPSVM as a nonlinear classiﬁer. Our suggested algorithm surpasses existing methods in terms of noise insensitivity, resampling stability, and accuracy for large-scale data scenarios, according to the experimental results. numerical experiments on artiﬁcial datasets and datasets from UCI with noises of different variances. The experimental


Introduction
Support vector machine (SVM) is a popular supervised binary classification algorithm based on statistical learning theory. Initially proposed by Vapnik [1][2][3], it has been gaining more and more attention. There have been many algorithmic and modeling variations of it, and it is a powerful pattern classification tool that has had many applications in various fields during recent years, including face detection [4,5], text categorization [6], electroencephalogram signal classification [7], financial regression [8][9][10][11], image retrieval [12], remote sensing [13], feature extraction [14], etc.
The concept of the conventional SVM is to find an optimal hyperplane that results in the greatest separation of the different categories of observations. The conventional SVM uses the well-known and common hinge loss function; however, it is sensitive to noise, especially noise around the decision boundary, and is not stable in resampling [15]. As a result, many researcher have proposed new SVM methods by changing the loss function. The SVM model with the pinball loss function (Pin-SVM) was proposed by Huang [16] to treat noise sensitivity and instability in resampling. The outcome is less sensitive and is related to the quantile distance. On the other hand, Pin-SVM cannot achieve sparsity. To achieve sparsity, a modified -insensitive zone for Pin-SVM was proposed. This method does not consider the patterns that lie in the insensitive zone while building the classifier, and its formulation requires the value of to be specified beforehand; therefore, a bad choice may affect its performance. Motivated by these developments, Rastogi [17] recently proposed the modified ( 1 , 2 )-insensitive zone support vector machine. This method is an extension of existing loss functions that account for noise sensitivity and resampling stability.
However, practical problems require processing large-scale datasets, while the existing solvers are not computationally efficient. Since the generalized pinball loss SVM (GPSVM) still needs to solve a large quadratic programming problem and the dual problem of

Related Work and Background
The purpose of this section is to review related methods for binary classification problems. This section contains some notations and definitions that are used throughout the paper. Let us consider a binary classification problem of m data points and n-dimensional Euclidean space R n . Here, we denote the set of training samples by X ∈ R n×m , where x ∈ R n is a sample with a label y ∈ {1, −1}. Below, we give a brief outline of several related methods.

Support Vector Machine
The SVM model consists of maximizing the distance between the two bounding hyperplanes that bound the classes, so the SVM model is generally formulated as a convex quadratic programming problem. Let · denote the Euclidean norm, or two-norm, of a vector in R n . Given a training set S = {(x i , y i ) ∈ R × {1, −1} : i = 1, 2, 3, . . . , m}, the strategy of SVM is to find the maximum margin separating hyperplane w x + b = 0 between two classes by solving the following problem: where C is a penalty parameter and ξ i are the slack variables. By introducing the Lagrangian multipliers α i , we derive its dual QPP as follows: After optimizing this dual QPP, we obtain the following decision function: where α * is the solution of the dual problem (Equation (2)) and N SV represents the number of support vectors satisfying 0 < α < C. In fact, the SVM problem (Equation (1)) can be rewritten as the unconstrained optimization problem as follows: [30]: is the so-called hinge loss function. This loss is related to the shortest distance between the sets, and the corresponding classifier leads to its sensitivity to noise and instability in resampling [31]. To address the noise sensitivity, Huang [16] proposed the pinball loss function instead of the hinge loss function by using the pinball loss function with the SVM classifier (Pin-SVM). This model works by penalizing correctly classified samples, which is evident from the pinball loss function, which is defined as follows: where v = 1 − y i (w x i + b) and τ ≥ 0. Although the pinball loss function achieves noise insensitivity, in the process, it cannot achieve sparsity. This is because the pinball loss function's subgradient is nonzero almost everywhere. Therefore, to achieve sparsity, in the same publication, Huang [16] proposed the -insensitive pinball loss function, which is insensitive to noise and stable in resampling. The -insensitive pinball loss function is defined by: where τ ≥ 0 and ≥ 0 are user-defined parameters. On the other hand, the -insensitive pinball loss function necessitates the selection of an ideal parameter. Rastogi [17] introduced the ( 1 , 2 )-insensitive zone pinball loss function, also known as the generalized pinball loss function. The generalized pinball loss function is defined by: where τ 1 , τ 2 , 1 , 2 are user-defined parameters. After employing the generalized pinball loss function instead of the hinge loss function in problem (1), we obtain the following optimization problem: By introducing the Lagrangian multipliers α i , β i , we derive its dual QPP as follows: A new data point x ∈ R n is classified as positive or negative according to the final decision function: where α * and β * are the solution to the dual problem (Equation (9)). In the next section, we go over the core of our proposed technique and present its pseudocode.

Proposed Stochastic Subgradient Generalized Pinball Support Vector Machine
For our SG-GPSVM formulation, which is based on the generalized pinball loss function, we apply the stochastic subgradient approach in this section. Our SG-GPSVM can be used in both linear and nonlinear cases.

Linear Case
Following the method of formulating SVM problems (discussed in Equation (4)), we incorporated the generalized pinball loss function (Equation (7)) in the objective function to obtain the convex unconstrained minimization problems: where u = (w , b) , z = (x , 1) and w ∈ R n , b ∈ R are the weight vectors and biases, respectively, and C > 0 are the penalty parameters. To apply the stochastic subgradient approach, for each iteration t, we propose a more general method that uses k samples where 1 ≤ k ≤ m. We chose a subset A t ⊆ [m] with |A t | = k where k samples are drawn uniformly at random from the training set and [m] = {1, 2, 3, . . . , m}. Let us consider a model of stochastic optimization problems. Let u t denote the current hyperplane iterate, and we obtain an approximate objective function: When k = m, each iteration t, an approximate objective function is the original objective function in problem (11). Let ∇ t be subgradients of f (u t ) associated with the index set of minibatches A t at point u t , that is: where: With the above notations and the existence of ρ i ∈ ∂L 1 , 2 τ 1 ,τ 2 (1 − y i u z i ), ∇ t can be written as: Then, u t+1 = u t − η t ∇ t is updated using a step size of η t = 1 t . Additionally, a new sample z is predicted by: The above steps can be outlined as Algorithm 1.

Nonlinear Case
Support vector machine has the advantage of being able to be employed using symmetric kernels rather than having direct access to the feature vectors x, that is instead of considering predictors that are linear functions of the training samples x themselves, predictors that are linear functions of some implicit mapping φ(x) of the instances are considered. In order to extend the linear SG-GPSVM to the nonlinear case by a symmetric kernel trick [32,33], the symmetric-kernel-generated surfaces are considered instead of hyperplanes and are given by: Then, the primal problem for the nonlinear SG-GPSVM is as follows: where φ(x) is representative of a nonlinear mapping function, which maps x into a higherdimensional feature space. To apply the stochastic subgradient approach, for each iteration t, we propose a more general method that uses k samples where Then, consider the subgradient of the above approximate objective, and let ∇ t be the subgradient of f at u t , that is: where ∂L 1 , 2 τ 1 ,τ 2 is defined in Equation (14), and similar to the linear case, Then, ∇ t can be written as: Then, update u t+1 = u t − η t ∇ t using a step size of η t = 1 t . Additionally, a new sample x can be predicted by: The above steps can be outlined as Algorithm 2.

Convergence Analysis
In this section, we analyze the convergence of the proposed SG-GPSVM model. For convenience, we only consider the optimization problem (Equation (11)) together with the conclusions for another nonlinear algorithm, which can be obtained similarly. u 1 and the step size η t = 1 t , u t+1 for t ≥ 1 are updated by: i.e., To prove the convergence of Algorithm 1, consider the boundedness of u t first. In fact, we have the following lemma.  (15) and (22), respectively.
Proof of Lemma 1. Equation (22) can be rewritten as: (23), we have: where: For any i ≥ 2, Therefore, Next, consider Ḡ i . In Case I for i < t, we have: and in Case II, for i = t, we have: From Case I and Case II, we have: Thus, Let M 0 be the largest norm of the samples in the dataset and: The following theorem demonstrates that we derive the proof of the convergence property of Algorithm 1 by using Lemma 1. Theorem 1. When using Algorithm 1 to solve the optimization problem Equation (11), the iterative Equation (22) is convergent.
Proof of Theorem 1. From Equation (26) in the proof of Lemma 1, we therefore obtain: which implies that: By using Equation (30), we have: which indicates that: Note that an infinite series of vectors is convergent if its norm series is convergent [34]. Therefore, the following limit exists: Combining Equations (33) and (36), we conclude that the series of u t+1 is convergent if t → ∞.
The following theorem gives the relation between GPSVM and our proposed SG-GPSVM.
Theorem 2. Let f be a convex function and u t be defined by Equations (11) and (21), respectively. Ifū = 1 T ∑ T t=1 u t is the solution output of SG-GPSVM, then: where M 1 and M 2 are the upper bounds of u t and ∇ t , respectively.

Proof of Theorem 2.
By the definition ofū, we have: As f is convex and ∇ t is the subgradient of f at u t , we have: By summing over t = 1 to T and dividing by T, we obtain the following inequality: Since u t+1 = u t − η t ∇ t , we have: Summing over t = 1 to T, we obtain: Multiplying Equation (41) by 1/T, we have: By Equations (37), (39), and (42), we have our result: With Theorem 2, we showed that in a random iteration T, the resulting expected error is bounded by O( 1+ln T T ), and the above theorem provides the approximations of f (u * ) by f (ū), that is the average instantaneous objective of SG-GPSVM correlates with the objective of GPSVM.

Numerical Experiments
In this section, to demonstrate the validity of our proposed SG-GPSVM, we compare SG-GPSVM with the conventional SVM [35] and Pegasos [25] using artificial datasets and the UCI Machine Learning Repository [29] with noises of different variances. All of experiments were performed with Python 3.6.3 on a Windows 8 machine with an Intel i5 Processor 2.50 GHz with 4 GB RAM.

Artificial Datasets
We conducted experiments on a two-dimensional example, for which the samples came from two Gaussian distributions with equal probability: This noise affects the labels around the boundary. The level of noise was controlled by the ratio of the noisy data in the training set, denoted by r. The value of r was fixed at r = 0 (i.e., noise-free), 0.05, 0.1, and 0.2. All the results of the two algorithms, SG-GPSVM and Pegasos, were derived using the two-dimensional dataset shown in Figure 1. Here, green circles denote samples from Class 1 and red triangles represent data samples from Class −1. From Figure 1, we can see that the noisy samples affect the labels around the decision boundary. As we increase the amount of noise from r = 0% to r = 20%, the hyperplanes of Pegasos start deviating from the ideal slope of 2.14, whereas the deviation in the slopes of the hyperplanes is significantly less in our SG-GPSVM. This implies that our purposed algorithm is insensitive to noise around the decision boundary.

UCI Datasets
We also performed experiments on 11 benchmark datasets available in the UCI Machine Learning Repository. Table 1 lists the datasets and their descriptions. For the benchmarks, we compared SG-GPSVM with two other SVM-based classical classifiers: conventional SVM and Pegasos. Among these SVM models, Pegasos and SG-GPSVM have extra hyperparameters, in addition to the hyperparameter C. The performance of different algorithms depends on the choices of the parameters [33,36,37]. All the hyperparameters were chosen through the grid search method and manual adjustment. The 10-fold cross-validation evaluation method was also employed. In each algorithm, the optimal parameter C was searched from set {10 i |i = −2, −1, 0, 1, 2}, and other parameters were searched from set {0.1, 0.25, 0.5, 0.75, 1}. Further, the kernel method was adopted to evaluate the performance of SG-GPSVM as a nonlinear classifier, where the RBF kernel was adopted. The parameter of the RBF kernel is γ, searched from set {10 i |i = −2, −1, 0, 1, 2}.
The results of the numerical experiments are shown in Table 2, and the optimal parameters used in Table 2 are summarized in Table 3.  Table 2 shows the results for a linear and nonlinear kernel on six different UCI datasets by applying conventional SVM, Pegasos, and our proposed SG-GPSVM. From the experimental outcomes, it can be seen that the classification performance of SG-GPSVM is better than that of the conventional SVM and Pegasos on most of the datasets in terms of accuracy. From Table 2, our proposed SG-GPSVM yielded the best prediction accuracy on four of six datasets. In addition, when we varied the number of noisy samples from r = 0 (noise-free) to r = 0.2, our proposed SG-GPSVM exhibited better classification accuracy and stability than the other algorithms no matter the noise factor value, as shown in Figure 2. On the Appendicitis dataset, when the noise factor increased, the accuracy of the conventional SVM and Pegasos fluctuated greatly. On the contrary, the classification accuracies of our proposed SG-GPSVM were stable. Similarly, on the other datasets, such as Monk-3 and Appendicitis, the accuracies of the conventional SVM and Pegasos not only varied greatly, but they also performed worse than our proposed algorithm. This demonstrates that our proposed SG-GPSVM is relatively stable, regardless of the noise factor. In terms of computational time, our proposed algorithm cost nearly the same as Pegasos. The computational time of Pegasos and our proposed SG-GPSVM on the largest dataset was longer than the computational time of the conventional SVM on the smaller datasets. The experimental results of three different minibatches for SG-GPSVM (1, 32, 512) and the accuracy and computational time results of the methods are illustrated in Figure 3. Subfigure (a) shows how the accuracy changed in terms of the batch size, and the curves in Subfigure (b) show how the computational time changed in terms of the batch size. We can see that when the size of the minibatch increased, the accuracy increased, and the computational time increased approximately linearly when the batch size increased. The datasets Spambase and WDBC were taken into consideration. It is evident that if the minibatch size k is small enough, k = 32 is more accurate than k = 512. However, when using a large minibatch size of k = 512, the accuracy decreased and the computational time increased. This suggests that in the scenarios in which we care more about the accuracy, we should choose a relatively medium batch size (it does not need to be large).

Large-Scale Dataset
In order to validate the classification efficiency of SG-GPSVM, we conducted comparisons of our methods with the two most related methods (SVM and Pegasos) on four large-scale UCI datasets. We show that our SG-GPSVM is capable of solving large-scale problems. Note that the results of this section were solved on a machine equipped with an Intel CPU E5-2658 v3 at 2.20 GHz and 256 GB RAM running the Ubuntu Linux operating system. The scikit-learn package was sued for the conventional SVM [35]. The large-scale datasets are presented in Table 4. For the four large-scale datasets, we compare the performance of SG-GPSVM against the conventional SVM and Pegasos in Table 5. The reduced kernel [38] was employed in the nonlinear situation, and the kernel size was set to 100. From the results in Table 5, SG-GPSVM outperformed the conventional SVM and Pegasos on three out of the four datasets in terms of accuracy. The best ones are highlighted in bold. However, SG-GPSVM's accuracy on some datasets, such as Credit card, was not the best. On the Skin dataset, the conventional SVM performed much worse than SG-GPSVM and Pegasos, and it was not possible to use the conventional SVM on the Kddcup and Susy datasets due to the high memory requirements. This was because SVM requires the complete training set to be stored in the main memory. On the Credit card dataset, our proposed SG-GPSVM took almost the same amount of time as the conventional SVM and Pegasos. Our suggested approach, SG-GPSVM, took almost the same amount of time as Pegasos on the Kddcup and Susy datasets.

Conclusions
In this paper, we used the generalized pinball loss function in SVM to perform classification and proposed the SG-GPSVM classifier. This paper adapted the stochastic subgradient descent method as stochastic subgradient descent method-based generalized pinball SVM (SG-GPSVM). Compared to the hinge loss SVM and Pegasos, the major advantage of our proposed method is that SG-GPSVM is less sensitive to noise, especially the feature noise around the decision boundary. In addition, we investigated the convergence of SG-GPSVM and the theoretical approximation between GPSVM and SG-GPSVM. The validity of our proposed SG-GPSVM was demonstrated by numerical experiments on artificial datasets and datasets from UCI with noises of different variances. The experimental results clearly showed that our suggested SG-GPSVM outperformed the existing classifier approach in terms of accuracy, and SG-GPSVM has a significant advantage in handling large-scale classification problems. The results imply that the SG-GPSVM approach is the strongest candidate for solving binary classification problems.
In further work, we would like to consider applications of SG-GPSVM to activity recognition datasets and image retrieval datasets, and we also plan to improve our approach to deal with multicategory classification scenarios.
Author Contributions: Conceptualization, W.P. and R.W.; writing-original draft, W.P.; writingreview and editing, W.P. and R.W. Both authors read and agreed to the published version of the manuscript.