You are currently viewing a new version of our website. To view the old version click .
Symmetry
  • Article
  • Open Access

8 September 2021

Stochastic Subgradient for Large-Scale Support Vector Machine Using the Generalized Pinball Loss Function

and
1
Department of Mathematics, Faculty of Science, Naresuan University, Phitsanulok 65000, Thailand
2
Research Center for Academic Excellence in Mathematics, Naresuan University, Phitsanulok 65000, Thailand
*
Author to whom correspondence should be addressed.
This article belongs to the Section Mathematics

Abstract

In this paper, we propose a stochastic gradient descent algorithm, called stochastic gradient descent method-based generalized pinball support vector machine (SG-GPSVM), to solve data classification problems. This approach was developed by replacing the hinge loss function in the conventional support vector machine (SVM) with a generalized pinball loss function. We show that SG-GPSVM is convergent and that it approximates the conventional generalized pinball support vector machine (GPSVM). Further, the symmetric kernel method was adopted to evaluate the performance of SG-GPSVM as a nonlinear classifier. Our suggested algorithm surpasses existing methods in terms of noise insensitivity, resampling stability, and accuracy for large-scale data scenarios, according to the experimental results.

1. Introduction

Support vector machine (SVM) is a popular supervised binary classification algorithm based on statistical learning theory. Initially proposed by Vapnik [1,2,3], it has been gaining more and more attention. There have been many algorithmic and modeling variations of it, and it is a powerful pattern classification tool that has had many applications in various fields during recent years, including face detection [4,5], text categorization [6], electroencephalogram signal classification [7], financial regression [8,9,10,11], image retrieval [12], remote sensing [13], feature extraction [14], etc.
The concept of the conventional SVM is to find an optimal hyperplane that results in the greatest separation of the different categories of observations. The conventional SVM uses the well-known and common hinge loss function; however, it is sensitive to noise, especially noise around the decision boundary, and is not stable in resampling [15]. As a result, many researcher have proposed new SVM methods by changing the loss function. The SVM model with the pinball loss function (Pin-SVM) was proposed by Huang [16] to treat noise sensitivity and instability in resampling. The outcome is less sensitive and is related to the quantile distance. On the other hand, Pin-SVM cannot achieve sparsity. To achieve sparsity, a modified ϵ -insensitive zone for Pin-SVM was proposed. This method does not consider the patterns that lie in the insensitive zone while building the classifier, and its formulation requires the value of ϵ to be specified beforehand; therefore, a bad choice may affect its performance. Motivated by these developments, Rastogi [17] recently proposed the modified ( ϵ 1 , ϵ 2 ) -insensitive zone support vector machine. This method is an extension of existing loss functions that account for noise sensitivity and resampling stability.
However, practical problems require processing large-scale datasets, while the existing solvers are not computationally efficient. Since the generalized pinball loss SVM (GPSVM) still needs to solve a large quadratic programming problem and the dual problem of GPSVM requires solving the quadratic programming problem (QPP), these techniques have difficulty handling large-scale problems [18,19]. Different solving methods have been proposed for large-scale SVM problems. The optimization techniques such as sequential minimal optimization (SMO) [20], successive over-relaxation (SOR) [21], and the dual coordinate descent method (DCD) [22] have been proposed to solve the SVM problem and the dual problem of SVM. However, these methods necessitate computing the inverse of large matrices. Consequently, the dual solutions of SVM cannot be effectively applied to large-scale problems. On the one hand, a main challenge for the conventional SVM is the high computational complexity of the training samples, i.e., O ( m 3 ) , where m is the total number of training samples. As a result, the stochastic gradient descent algorithm (SGD) [23,24] has been proposed to solve the primal problem of SVM. As an application of the stochastic subgradient descent method, Pegasos [25] obtained an ϵ -accuracy solution for the primal problem in O ˜ ( 1 / ϵ ) iterations with the cost per iteration being O ( n ) , where n is the feature dimension. This technique partitions a large-scale problem into a series of subproblems by stochastic sampling with a suitable size. The SGD for SVM has been shown to be the fastest method among the SVM-type classifiers for large-scale problems [23,26,27,28].
In order to overcome the above-mentioned limitations of large-scale problems and inspired by the studies of SVM and the generalized pinball loss function, we propose a novel stochastic subgradient descent method with generalized pinball support vector machine (SG-GPSVM). The proposed technique is an efficient method for real-world datasets, especially large-scale ones. Furthermore, we prove the theorem that guarantees the approximation and convergence of our proposed algorithm. Finally, the experimental results show that the proposed SG-GPSVM approach outperforms the existing approaches in terms of accuracy, and the proposed SG-GPSVM may result in a more efficient and faster process of convergence rather than using the conventional SVM. The results also show that our proposed SG-GPSVM is noise insensitive and more stable in resampling and can handle large-scale problems.
The structure of the paper is as follows. Section 2 outlines the background. SG-GPSVM is proposed in Section 3, which includes both linear and nonlinear cases. In Section 4, we theoretically compare the proposed SG-GPSVM with two other algorithms, i.e., conventional SVM and Pegasos [25]. Experiments on benchmark datasets, available in the UCI Machine Learning Repository [29], with different noises are conducted to verify the effectiveness of our SG-GPSVM. The results are given in Section 5. Finally, the conclusion is given in Section 6.

3. Proposed Stochastic Subgradient Generalized Pinball Support Vector Machine

For our SG-GPSVM formulation, which is based on the generalized pinball loss function, we apply the stochastic subgradient approach in this section. Our SG-GPSVM can be used in both linear and nonlinear cases.

3.1. Linear Case

Following the method of formulating SVM problems (discussed in Equation (4)), we incorporated the generalized pinball loss function (Equation (7)) in the objective function to obtain the convex unconstrained minimization problems:
min u f ( u ) = 1 2 u 2 + C m i = 1 m L τ 1 , τ 2 ϵ 1 , ϵ 2 ( 1 y i u z i ) ,
where u = ( w , b ) , z = ( x , 1 ) and w R n , b R are the weight vectors and biases, respectively, and C > 0 are the penalty parameters. To apply the stochastic subgradient approach, for each iteration t, we propose a more general method that uses k samples where 1 k m . We chose a subset A t [ m ] with | A t | = k where k samples are drawn uniformly at random from the training set and [ m ] = { 1 , 2 , 3 , , m } . Let us consider a model of stochastic optimization problems. Let u t denote the current hyperplane iterate, and we obtain an approximate objective function:
f ( u t ) = 1 2 u t 2 + C k i A t L τ 1 , τ 2 ϵ 1 , ϵ 2 ( 1 y i u t z i ) .
When k = m , each iteration t, an approximate objective function is the original objective function in problem (11). Let t be subgradients of f ( u t ) associated with the index set of minibatches A t at point u t , that is:
t f ( u t ) = u t C k i A t y i z i L τ 1 , τ 2 ϵ 1 , ϵ 2 ( 1 y i u t z i ) ,
where:
L τ 1 , τ 2 ϵ 1 , ϵ 2 ( v ) = { τ 1 } , v > ϵ 1 τ 1 , [ 0 , τ 1 ] , v = ϵ 1 τ 1 , { 0 } , ϵ 2 τ 2 < v < ϵ 1 τ 1 , [ τ 2 , 0 ] , v = ϵ 2 τ 2 , { τ 2 } , v < ϵ 2 τ 2 .
With the above notations and the existence of ρ i L τ 1 , τ 2 ϵ 1 , ϵ 2 ( 1 y i u z i ) , t can be written as:
t = u t C k i A t ρ i y i z i .
Then, u t + 1 = u t η t t is updated using a step size of η t = 1 t . Additionally, a new sample z is predicted by:
y = sgn ( u z ) .
The above steps can be outlined as Algorithm 1.
Algorithm 1 SG-GPSVM.
Input: Training samples are represented by X R n × m ,
            positive parameters c 1 , c 2 , τ 1 , τ 2 ,
             k { 1 , 2 , , m } , and tolerance t o l ; t y p i c a l l y ,
            t o l = 10 4 .
1: Set u 1 to zero;
2: while t   t o l do
3:      Choose A t [ m ] , where | A t |   = k , uniformly at random.
4:      Compute stochastic subgradient t using Equation (15).
5:      Update u t + 1 by u t + 1 = u t η t t .
6:      t = t + 1
7: end
Output: Optimal hyperplane parameters u = 1 T t = 1 T u t .

3.2. Nonlinear Case

Support vector machine has the advantage of being able to be employed using symmetric kernels rather than having direct access to the feature vectors x, that is instead of considering predictors that are linear functions of the training samples x themselves, predictors that are linear functions of some implicit mapping ϕ ( x ) of the instances are considered. In order to extend the linear SG-GPSVM to the nonlinear case by a symmetric kernel trick [32,33], the symmetric-kernel-generated surfaces are considered instead of hyperplanes and are given by:
w ϕ ( x ) + b = 0 .
Then, the primal problem for the nonlinear SG-GPSVM is as follows:
min u f ( u ) = 1 2 u 2 + C m i = 1 m L τ 1 , τ 2 ϵ 1 , ϵ 2 1 y i ( w ϕ ( x i ) + b ) ,
where ϕ ( x ) is representative of a nonlinear mapping function, which maps x into a higher-dimensional feature space. To apply the stochastic subgradient approach, for each iteration t, we propose a more general method that uses k samples where 1 k m . We chose a subset A t [ m ] with | A t | = k where k samples are drawn uniformly at random from the training set and [ m ] = { 1 , 2 , 3 , , m } . Consider an approximate objective function:
f ( u t ) = 1 2 u t 2 + C k i A t L τ 1 , τ 2 ϵ 1 , ϵ 2 1 y i ( w t ϕ ( x i ) + b ) ,
Then, consider the subgradient of the above approximate objective, and let t be the subgradient of f at u t , that is:
t f ( u t ) = u t C k i A t y i ϕ ( x i ) L τ 1 , τ 2 ϵ 1 , ϵ 2 1 y i ( w t ϕ ( x i ) + b ) ,
where L τ 1 , τ 2 ϵ 1 , ϵ 2 is defined in Equation (14), and similar to the linear case, ρ i L τ 1 , τ 2 ϵ 1 , ϵ 2 1 y i ( w ϕ ( x i ) + b ) exists. Then, t can be written as:
t = u t C k i A t ρ i y i ϕ ( x i ) .
Then, update u t + 1 = u t η t t using a step size of η t = 1 t . Additionally, a new sample x can be predicted by:
y = sgn ( w ϕ ( x ) + b ) .
The above steps can be outlined as Algorithm 2.
Algorithm 2 Nonlinear SG-GPSVM.
Input: Training samples are represented by X R n × m ,
            positive parameters c 1 , c 2 , τ 1 , τ 2 ,
             k { 1 , 2 , , m } , and tolerance t o l ; t y p i c a l l y ,
            t o l = 10 4 .
1: Set u 1 to zero;
2: while t   t o l do
3:      Choose A t [ m ] , where | A t |   = k , uniformly at random.
4:      Compute stochastic subgradient t using Equation (20).
5:      Update u t + 1 by u t + 1 = u t η t t .
6:      t = t + 1
7: end
Output: Optimal hyperplane parameters u = 1 T t = 1 T u t .
However, the mapping ϕ ( x ) is never specified explicitly, but rather through a symmetric kernel operator K ( x , x ) = ϕ ( x ) ϕ ( x ) yielding the inner products after the mapping ϕ ( x ) .

4. Convergence Analysis

In this section, we analyze the convergence of the proposed SG-GPSVM model. For convenience, we only consider the optimization problem (Equation (11)) together with the conclusions for another nonlinear algorithm, which can be obtained similarly. u 1 and the step size η t = 1 t , u t + 1 for t 1 are updated by:
u t + 1 = u t η t t ,
i.e.,
u t + 1 = u t η t t = u t 1 t u t C k i A t ρ i y i t z i = 1 1 t u t + C k t i A t ρ i y i t z i .
To prove the convergence of Algorithm 1, consider the boundedness of u t first. In fact, we have the following lemma.
Lemma 1. 
The sequences { t : t = 1 , 2 , } and { u t : t = 1 , 2 , } have upper bounds, where t and u t are defined by Equations (15) and (22), respectively.
Proof of Lemma 1. 
Equation (22) can be rewritten as:
u t + 1 = G t u t + 1 t v t
where G t = ( 1 1 t ) I , I is the identity matrix, and v t = C k i A t ρ i y i x i . For t 2 , G t is positive definite, and the largest eigenvalue λ t of G t is equal to t 1 t . From Equation (23), we have:
u t + 1 = i = 2 t G i u 2 + i = 2 t G i ,
where:
G i = 1 i j = i + 1 t u i v i if i < t , 1 i u i if i = t .
For any i 2 ,
G i u 2 λ i u 2 i 1 i u 2 .
Therefore,
i = 2 t G i u 2 = G 2 G 3 G t u 2 G 2 G 3 G t u 2 2 1 2 · 3 1 3 · · ( t 1 ) t u 2 = 1 t u 2 .
Next, consider G i . In Case I for i < t , we have:
G i = 1 i j = i + 1 t G j v i
= 1 i G i + 1 G i + 2 G t v i 1 i G i + 1 G i + 2 G t v i 1 i · ( i + 1 ) 1 i + 1 · ( i + 2 ) 1 i + 2 t 1 t v i = 1 t v i 1 t max i < t v i ,
and in Case II, for i = t , we have:
G i = 1 i u i = 1 t u t 1 t max i = t v i .
From Case I and Case II, we have:
G i 1 t max i t v i .
Thus,
u t + 1 = i = 2 t G i u 2 + i = 2 t G i 1 t u 2 + i = 2 t 1 t max i t v i + 1 t max i t v i 1 t max i t v i = 1 t u 2 + t 1 t max i t v i 1 t max i t v i = 1 t u 2 + t 1 t max i t v i u 2 + C max { τ 1 , τ 2 } max i [ m ] z i .
Let M 0 be the largest norm of the samples in the dataset and:
M 1 = max { u 1 , u 2 + C max { τ 1 , τ 2 } M 0 } ,
that is u t M 1 , t = 1 , 2 , . For t 1 ,
t = u t C k i A t ρ i y i t z i u t + C k i A t ρ i y i t z i M 1 + C max { τ 1 , τ 2 } M 0 = M 2 .
 □
The following theorem demonstrates that we derive the proof of the convergence property of Algorithm 1 by using Lemma 1.
Theorem 1. 
When using Algorithm 1 to solve the optimization problem Equation (11), the iterative Equation (22) is convergent.
Proof of Theorem 1. 
From Equation (26) in the proof of Lemma 1, we therefore obtain:
lim t i = 2 t G i u 2 = 0 ,
which implies that:
lim t i = 2 t G i u 2 = 0 .
By using Equation (30), we have:
G i 1 t max i t v i C max { τ 1 , τ 2 } max i [ m ] z i ,
which indicates that:
lim t i = 2 t G i < .
Note that an infinite series of vectors is convergent if its norm series is convergent [34]. Therefore, the following limit exists:
lim t i = 2 t G i < .
Combining Equations (33) and (36), we conclude that the series of u t + 1 is convergent if t . □
The following theorem gives the relation between GPSVM and our proposed SG-GPSVM.
Theorem 2. 
Let f be a convex function and u t be defined by Equations (11) and (21), respectively. If u = 1 T t = 1 T u t is the solution output of SG-GPSVM, then:
f ( u ) f ( u * ) + M 2 ( M 1 + u * ) + M 2 2 2 T ( 1 + ln T )
where M 1 and M 2 are the upper bounds of u t and t , respectively.
Proof of Theorem 2. 
By the definition of u , we have:
f ( u ) f ( u * ) = f 1 T t = 1 T u t f ( u * ) 1 T t = 1 T f ( u t ) f ( u * ) = 1 T t = 1 T f ( u t ) f ( u * ) .
As f is convex and t is the subgradient of f at u t , we have:
f ( u t ) f ( u * ) ( u t u * ) t .
By summing over t = 1 to T and dividing by T, we obtain the following inequality:
1 T t = 1 T f ( u t ) f ( u * ) 1 T t = 1 T ( u t u * ) T t .
Since u t + 1 = u t η t t , we have:
( u t u * ) t = 1 2 η t ( u t u * 2 u t + 1 u * 2 ) + η t 2 t 2 .
Summing over t = 1 to T, we obtain:
t = 1 T ( ( u t u * ) t ) = 1 2 t = 1 T 1 η t ( u t u * 2 u t + 1 u * 2 ) + 1 2 t = 1 T ( η t t 2 ) = 1 2 t = 1 T t u t u * 2 t = 1 T t u t + 1 u * 2 + 1 2 t = 1 T ( η t t 2 ) = 1 2 t = 1 T u t u * 2 T u T + 1 u * 2 + 1 2 t = 1 T ( η t t 2 ) ( M 1 + u * ) t = 1 T u T + 1 u t + 1 2 M 2 2 ( 1 + ln T ) = ( M 1 + u * ) t = 1 T t = 1 T 1 i i + 1 2 M 2 2 ( 1 + ln T ) T M 2 ( M 1 + u * ) + 1 2 M 2 2 ( 1 + ln T ) .
Multiplying Equation (41) by 1 / T , we have:
1 T t = 1 T ( ( u t u * ) t ) M 2 ( M 1 + u * ) + 1 2 T M 2 2 ( 1 + ln T ) .
By Equations (37), (39), and (42), we have our result:
f ( u ) f ( u * ) + M 2 ( M 1 + u * ) + 1 2 T M 2 2 ( 1 + ln T ) .
 □
With Theorem 2, we showed that in a random iteration T, the resulting expected error is bounded by O ( 1 + ln T T ) , and the above theorem provides the approximations of f ( u * ) by f ( u ) , that is the average instantaneous objective of SG-GPSVM correlates with the objective of GPSVM.

5. Numerical Experiments

In this section, to demonstrate the validity of our proposed SG-GPSVM, we compare SG-GPSVM with the conventional SVM [35] and Pegasos [25] using artificial datasets and the UCI Machine Learning Repository [29] with noises of different variances. All of experiments were performed with Python 3.6.3 on a Windows 8 machine with an Intel i5 Processor 2.50 GHz with 4 GB RAM.

5.1. Artificial Datasets

We conducted experiments on a two-dimensional example, for which the samples came from two Gaussian distributions with equal probability: x i , i { i : y i = 1 } N ( μ 1 , Σ 1 ) and x i , i { i : y i = 1 } N ( μ 2 , Σ 2 ) where μ 1 = [ 1 , 3 ] , μ 2 = [ 1 , 3 ] and Σ 1 = Σ 2 = 0.2 0 0 3 . We added noise to the dataset. The labels of the noise points were selected from { 1 , 1 } with equal probabilities. Each noisy sample was drawn from a Gaussian distribution N ( μ n , Σ n ) where μ n = [ 0 , 0 ] and Σ n = 1 0.8 0.8 1 . This noise affects the labels around the boundary. The level of noise was controlled by the ratio of the noisy data in the training set, denoted by r. The value of r was fixed at r = 0 (i.e., noise-free), 0.05 , 0.1 , and 0.2 . All the results of the two algorithms, SG-GPSVM and Pegasos, were derived using the two-dimensional dataset shown in Figure 1. Here, green circles denote samples from Class 1 and red triangles represent data samples from Class 1 . From Figure 1, we can see that the noisy samples affect the labels around the decision boundary. As we increase the amount of noise from r = 0 % to r = 20 % , the hyperplanes of Pegasos start deviating from the ideal slope of 2.14, whereas the deviation in the slopes of the hyperplanes is significantly less in our SG-GPSVM. This implies that our purposed algorithm is insensitive to noise around the decision boundary.
Figure 1. SG-GPSVM and Pegasos on a noisy artificial dataset. The above four figures demonstrate the noise-insensitive properties of our purposed SG-GPSVM compared to Pegasos when we have: (a) r = 0 (noise free); (b) r = 0.05 ; (c) r = 0.1 ; (d) r = 0.2 .

5.2. UCI Datasets

We also performed experiments on 11 benchmark datasets available in the UCI Machine Learning Repository. Table 1 lists the datasets and their descriptions. For the benchmarks, we compared SG-GPSVM with two other SVM-based classical classifiers: conventional SVM and Pegasos. Among these SVM models, Pegasos and SG-GPSVM have extra hyperparameters, in addition to the hyperparameter C. The performance of different algorithms depends on the choices of the parameters [33,36,37]. All the hyperparameters were chosen through the grid search method and manual adjustment. The 10-fold cross-validation evaluation method was also employed. In each algorithm, the optimal parameter C was searched from set { 10 i | i = 2 , 1 , 0 , 1 , 2 } , and other parameters were searched from set { 0.1 , 0.25 , 0.5 , 0.75 , 1 } . Further, the kernel method was adopted to evaluate the performance of SG-GPSVM as a nonlinear classifier, where the RBF kernel was adopted. The parameter of the RBF kernel is γ , searched from set { 10 i | i = 2 , 1 , 0 , 1 , 2 } . The results of the numerical experiments are shown in Table 2, and the optimal parameters used in Table 2 are summarized in Table 3.
Table 1. Description of the UCI datasets.
Table 2. Accuracy obtained on the UCI datasets using the linear and nonlinear ( * ) kernel.
Table 3. The optimal parameters of conventional SVM, Pegasos, SG-GPSVM, and SG-GPSVM * .
Table 2 shows the results for a linear and nonlinear kernel on six different UCI datasets by applying conventional SVM, Pegasos, and our proposed SG-GPSVM. From the experimental outcomes, it can be seen that the classification performance of SG-GPSVM is better than that of the conventional SVM and Pegasos on most of the datasets in terms of accuracy. From Table 2, our proposed SG-GPSVM yielded the best prediction accuracy on four of six datasets. In addition, when we varied the number of noisy samples from r = 0 (noise-free) to r = 0.2 , our proposed SG-GPSVM exhibited better classification accuracy and stability than the other algorithms no matter the noise factor value, as shown in Figure 2. On the Appendicitis dataset, when the noise factor increased, the accuracy of the conventional SVM and Pegasos fluctuated greatly. On the contrary, the classification accuracies of our proposed SG-GPSVM were stable. Similarly, on the other datasets, such as Monk-3 and Appendicitis, the accuracies of the conventional SVM and Pegasos not only varied greatly, but they also performed worse than our proposed algorithm. This demonstrates that our proposed SG-GPSVM is relatively stable, regardless of the noise factor. In terms of computational time, our proposed algorithm cost nearly the same as Pegasos. The computational time of Pegasos and our proposed SG-GPSVM on the largest dataset was longer than the computational time of the conventional SVM on the smaller datasets.
Figure 2. The accuracy of four algorithms with different noise factors on six datasets: (a) Appendicitis; (b) Ionosphere; (c) Monk-2; (d) Monk-3; (e) Phoneme; (f) Saheart.
The experimental results of three different minibatches for SG-GPSVM ( 1 , 32 , 512 ) and the accuracy and computational time results of the methods are illustrated in Figure 3. Subfigure (a) shows how the accuracy changed in terms of the batch size, and the curves in Subfigure (b) show how the computational time changed in terms of the batch size. We can see that when the size of the minibatch increased, the accuracy increased, and the computational time increased approximately linearly when the batch size increased. The datasets Spambase and WDBC were taken into consideration. It is evident that if the minibatch size k is small enough, k = 32 is more accurate than k = 512 . However, when using a large minibatch size of k = 512 , the accuracy decreased and the computational time increased. This suggests that in the scenarios in which we care more about the accuracy, we should choose a relatively medium batch size (it does not need to be large).
Figure 3. Classification results on the UCI datasets with minibatch SG-GPSVM. (a) Accuracy of minibatch SG-GPSVM. (b) Computational time of minibatch SG-GPSVM.

5.3. Large-Scale Dataset

In order to validate the classification efficiency of SG-GPSVM, we conducted comparisons of our methods with the two most related methods (SVM and Pegasos) on four large-scale UCI datasets. We show that our SG-GPSVM is capable of solving large-scale problems. Note that the results of this section were solved on a machine equipped with an Intel CPU E5-2658 v3 at 2.20 GHz and 256 GB RAM running the Ubuntu Linux operating system. The scikit-learn package was sued for the conventional SVM [35]. The large-scale datasets are presented in Table 4.
Table 4. The details of the large-scale datasets.
For the four large-scale datasets, we compare the performance of SG-GPSVM against the conventional SVM and Pegasos in Table 5. The reduced kernel [38] was employed in the nonlinear situation, and the kernel size was set to 100. From the results in Table 5, SG-GPSVM outperformed the conventional SVM and Pegasos on three out of the four datasets in terms of accuracy. The best ones are highlighted in bold. However, SG-GPSVM’s accuracy on some datasets, such as Credit card, was not the best. On the Skin dataset, the conventional SVM performed much worse than SG-GPSVM and Pegasos, and it was not possible to use the conventional SVM on the Kddcup and Susy datasets due to the high memory requirements. This was because SVM requires the complete training set to be stored in the main memory. On the Credit card dataset, our proposed SG-GPSVM took almost the same amount of time as the conventional SVM and Pegasos. Our suggested approach, SG-GPSVM, took almost the same amount of time as Pegasos on the Kddcup and Susy datasets.
Table 5. Classification results on the large-scale UCI datasets.

6. Conclusions

In this paper, we used the generalized pinball loss function in SVM to perform classification and proposed the SG-GPSVM classifier. This paper adapted the stochastic subgradient descent method as stochastic subgradient descent method-based generalized pinball SVM (SG-GPSVM). Compared to the hinge loss SVM and Pegasos, the major advantage of our proposed method is that SG-GPSVM is less sensitive to noise, especially the feature noise around the decision boundary. In addition, we investigated the convergence of SG-GPSVM and the theoretical approximation between GPSVM and SG-GPSVM. The validity of our proposed SG-GPSVM was demonstrated by numerical experiments on artificial datasets and datasets from UCI with noises of different variances. The experimental results clearly showed that our suggested SG-GPSVM outperformed the existing classifier approach in terms of accuracy, and SG-GPSVM has a significant advantage in handling large-scale classification problems. The results imply that the SG-GPSVM approach is the strongest candidate for solving binary classification problems.
In further work, we would like to consider applications of SG-GPSVM to activity recognition datasets and image retrieval datasets, and we also plan to improve our approach to deal with multicategory classification scenarios.

Author Contributions

Conceptualization, W.P. and R.W.; writing—original draft, W.P.; writing—review and editing, W.P. and R.W. Both authors read and agreed to the published version of the manuscript.

Funding

This research was funded by the NSRF and NU, Thailand, with Grant Number R2564B024.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The authors would like to thank the referees for careful reading and constructive comments. This research is partially supported by the Development and Promotion of the Gifted in Science and Technology Project and Naresuan University.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Vapnik, V.N. The Nature of Statistical Learning Theory; Springer: New York, NY, USA, 1995. [Google Scholar]
  2. Vapnik, V.N. Statistical Learning Theory; Wiley: New York, NY, USA, 1998. [Google Scholar]
  3. Vapnik, V.N. An overview of statistical learning theory. IEEE Trans. Neural Netw. 1999, 10, 988–999. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  4. Colmenarez, A.J.; Huang, T.S. Face Detection With Information-Based Maximum Discrimination. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Juan, PR, USA, 17–19 June 1997; pp. 782–787. [Google Scholar]
  5. Osuna, E.; Freund, R.; Girosi, F. Training support vector machines: An application to face detection. In Proceedings of the IEEE Computer Vision and Pattern Recognition, San Juan, PR, USA, 17–19 June 1997. [Google Scholar]
  6. Joachims, T.; Ndellec, C.; Rouveriol, C. Text categorization with support vector machines: Learning with many relevant features. In Proceedings of the European Conference on Machine Learning, Chemnitz, Germany, 21–23 April 1998; pp. 137–142. [Google Scholar]
  7. Richhariya, B.; Tanveer, M. EEG signal classification using universum support vector machine. Expert Syst. Appl. 2018, 106, 169–182. [Google Scholar] [CrossRef]
  8. Mukherjee, S.; Osuna, E.; Girosi, F. Nonliner prediction of chaotic time series using a support vector machine. In Proceedings of the 1997 IEEE Workshop on Neural Networks for Signal Processing, Amelia Island, FL, USA, 24–26 September 1997. [Google Scholar]
  9. Ince, H.; Trafalis, T.B. Support vector machine for regression and applications to financial forecasting. In Proceedings of the International Joint Conference on Neural Networks (IEEE-INNSENNS), Como, Italy, 27 July 2000. [Google Scholar]
  10. Huang, Z.; Chen, H.; Hsua, C.-J.; Chen, W.-H.; Wu, S. Credit rating analysis with support vector machines and neural networks: A market comparative study. Decis. Support Syst. 2004, 37, 543–558. [Google Scholar] [CrossRef]
  11. Khemchandani, R.; Jayadeva; Chandra, S. Regularized least squares fuzzy support vector regression for nancial time series forecasting. Expert Syst. Appl. 2009, 36, 132–138. [Google Scholar] [CrossRef]
  12. Tao, D.; Tang, X.; Li, X.; Wu, X. Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 28, 1088–1099. [Google Scholar]
  13. Pal, M.; Mather, P. Support vector machines for classification in remote sensing. Int. J. Remote Sens. 2005, 26, 1007–1011. [Google Scholar] [CrossRef]
  14. Li, Y.Q.; Guan, C.T. Joint feature re-extraction and classification using an iterative semi-supervised support vector machine algorithm. Mach. Learn. 2008, 71, 33–53. [Google Scholar] [CrossRef]
  15. Bi, J.; Zhang, T. Support vector classification with input data uncertainty. In Proceedings of the Neural Information Processing Systems, Vancouver, BC, Canada, 13–18 December 2004. [Google Scholar]
  16. Huang, X.; Shi, L.; Suykens, J.A.K. Support vector machine classifier with pinball loss. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 984–997. [Google Scholar] [CrossRef]
  17. Rastogi, R.; Pal, A.; Chandra, S. Generalized pinball loss SVMs. Neurocomputing 2018, 322, 151–165. [Google Scholar] [CrossRef]
  18. Ñanculef, R.; Frandi, E.; Sartori, C.; Allende, H. A Novel Frank-Wolfe Algorithm. Analysis and Applications to Large-Scale SVM Training. Inf. Sci. 2014, 285, 66–99. [Google Scholar] [CrossRef] [Green Version]
  19. Xu, J.; Xu, C.; Zou, B.; Tang, Y.Y.; Peng, J.; You, X. New incremental learning algorithm with support vector machines. IEEE Trans. Syst. Man Cybern. Syst. 2018, 49, 2230–2241. [Google Scholar] [CrossRef]
  20. Platt, J. Fast training of support vector machines using sequential minimal optimization. In Advances in Kernel Methods—Support Vector Learning; MIT Press: Cambridge, MA, USA, 1999; pp. 185–208. [Google Scholar]
  21. Mangasarian, O.; Musicant, D. Successive overrelaxation for support vector machines. IEEE Trans. Neural Netw. 1999, 10, 1032–1037. [Google Scholar] [CrossRef] [Green Version]
  22. Fan, R.; Chang, K.; Hsieh, C.; Wang, X.; Lin, C. LIBLINEAR: A library for large linear classification. J. Mach. Learn. Res. 2008, 9, 1871–1874. [Google Scholar]
  23. Zhang, T. Solving large-scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the International Conference on Machine Learning, Banff, AB, Canada, 4–8 July 2004. [Google Scholar]
  24. Xu, W. Towards optimal one pass large-scale learning with averaged stochastic gradient descent. arXiv 2011, arXiv:1107.2490. [Google Scholar]
  25. Shai, S.; Singer, Y.; Srebro, N.; Cotter, A. Pegasos: Primal estimated subgradient solver for SVM. Math. Program. 2011, 127, 3–30. [Google Scholar]
  26. Alencar, M.; Oliveira, D.J. Online learning early skip decision method for the hevc inter process using the SVM-based Pegasos algorithm. Electron. Lett. 2016, 52, 1227–1229. [Google Scholar]
  27. Reyes-Ortiz, J.; Oneto, L.; Anguita, D. Big data analytics in the cloud: Spark on Hadoop vs MPI/OpenMP on Beowulf. Procedia Comput. Sci. 2015, 53, 121–130. [Google Scholar] [CrossRef] [Green Version]
  28. Sopyla, K.; Drozda, P. Stochastic gradient descent with barzilaicborwein update step for SVM. Inf. Sci. 2015, 316, 218–233. [Google Scholar] [CrossRef]
  29. Dua, D.; Taniskidou, E.K. UCI Machine Learning Repository; School of Information and Computer Science, University of California: Irvine, CA, USA, 2019; Available online: http://archive.ics.uci.edu/ml (accessed on 24 September 2018).
  30. Shwartz, S.; Ben-David, S. Understanding Machine Learning Theory Algorithms; Cambridge University Press: Cambridge, MA, USA, 2014; 207p. [Google Scholar]
  31. Xu, Y.; Yang, Z.; Pan, X. A novel twin support-vector machine with pinball loss. IEEE Trans. Neural Netw. Learn. Syst. 2017, 28, 359–370. [Google Scholar] [CrossRef]
  32. Shao, Y.; Zhang, C.; Wang, X.; Deng, N. Improvements on twin support vector machines. IEEE Trans. Neural Netw. 2011, 22, 962–968. [Google Scholar] [CrossRef] [PubMed]
  33. Khemchandani, R.; Jayadeva; Chandra, S. Optimal kernel selection in twin support vector machines. Optim. Lett. 2009, 3, 77–88. [Google Scholar] [CrossRef]
  34. Rudin, W. Principles of Mathematical Analysis, 3rd ed.; McGraw-Hill: New York, NY, USA, 1964. [Google Scholar]
  35. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
  36. Cristianini, N.; Shawe-Taylor, J. An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods; Cambridge University Press: Cambridge, UK, 2000. [Google Scholar]
  37. Shawe-Taylor, J.; Cristianini, N. Kernel Methods for Pattern Analysis; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
  38. Lee, Y.; Mangasarian, O. RSVM: Reduced support vector machines. In Proceedings of the First SIAM International Conference on Data Mining, Chicago, IL, USA, 5–7 April 2001. [Google Scholar]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.