Next Article in Journal
Some Results of New Subclasses for Bi-Univalent Functions Using Quasi-Subordination
Previous Article in Journal
Common Neighborhood Energy of Commuting Graphs of Finite Groups
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Stochastic Subgradient for Large-Scale Support Vector Machine Using the Generalized Pinball Loss Function

1
Department of Mathematics, Faculty of Science, Naresuan University, Phitsanulok 65000, Thailand
2
Research Center for Academic Excellence in Mathematics, Naresuan University, Phitsanulok 65000, Thailand
*
Author to whom correspondence should be addressed.
Symmetry 2021, 13(9), 1652; https://doi.org/10.3390/sym13091652
Submission received: 18 July 2021 / Revised: 29 August 2021 / Accepted: 1 September 2021 / Published: 8 September 2021
(This article belongs to the Section Mathematics)

Abstract

:
In this paper, we propose a stochastic gradient descent algorithm, called stochastic gradient descent method-based generalized pinball support vector machine (SG-GPSVM), to solve data classification problems. This approach was developed by replacing the hinge loss function in the conventional support vector machine (SVM) with a generalized pinball loss function. We show that SG-GPSVM is convergent and that it approximates the conventional generalized pinball support vector machine (GPSVM). Further, the symmetric kernel method was adopted to evaluate the performance of SG-GPSVM as a nonlinear classifier. Our suggested algorithm surpasses existing methods in terms of noise insensitivity, resampling stability, and accuracy for large-scale data scenarios, according to the experimental results.

1. Introduction

Support vector machine (SVM) is a popular supervised binary classification algorithm based on statistical learning theory. Initially proposed by Vapnik [1,2,3], it has been gaining more and more attention. There have been many algorithmic and modeling variations of it, and it is a powerful pattern classification tool that has had many applications in various fields during recent years, including face detection [4,5], text categorization [6], electroencephalogram signal classification [7], financial regression [8,9,10,11], image retrieval [12], remote sensing [13], feature extraction [14], etc.
The concept of the conventional SVM is to find an optimal hyperplane that results in the greatest separation of the different categories of observations. The conventional SVM uses the well-known and common hinge loss function; however, it is sensitive to noise, especially noise around the decision boundary, and is not stable in resampling [15]. As a result, many researcher have proposed new SVM methods by changing the loss function. The SVM model with the pinball loss function (Pin-SVM) was proposed by Huang [16] to treat noise sensitivity and instability in resampling. The outcome is less sensitive and is related to the quantile distance. On the other hand, Pin-SVM cannot achieve sparsity. To achieve sparsity, a modified ϵ -insensitive zone for Pin-SVM was proposed. This method does not consider the patterns that lie in the insensitive zone while building the classifier, and its formulation requires the value of ϵ to be specified beforehand; therefore, a bad choice may affect its performance. Motivated by these developments, Rastogi [17] recently proposed the modified ( ϵ 1 , ϵ 2 ) -insensitive zone support vector machine. This method is an extension of existing loss functions that account for noise sensitivity and resampling stability.
However, practical problems require processing large-scale datasets, while the existing solvers are not computationally efficient. Since the generalized pinball loss SVM (GPSVM) still needs to solve a large quadratic programming problem and the dual problem of GPSVM requires solving the quadratic programming problem (QPP), these techniques have difficulty handling large-scale problems [18,19]. Different solving methods have been proposed for large-scale SVM problems. The optimization techniques such as sequential minimal optimization (SMO) [20], successive over-relaxation (SOR) [21], and the dual coordinate descent method (DCD) [22] have been proposed to solve the SVM problem and the dual problem of SVM. However, these methods necessitate computing the inverse of large matrices. Consequently, the dual solutions of SVM cannot be effectively applied to large-scale problems. On the one hand, a main challenge for the conventional SVM is the high computational complexity of the training samples, i.e., O ( m 3 ) , where m is the total number of training samples. As a result, the stochastic gradient descent algorithm (SGD) [23,24] has been proposed to solve the primal problem of SVM. As an application of the stochastic subgradient descent method, Pegasos [25] obtained an ϵ -accuracy solution for the primal problem in O ˜ ( 1 / ϵ ) iterations with the cost per iteration being O ( n ) , where n is the feature dimension. This technique partitions a large-scale problem into a series of subproblems by stochastic sampling with a suitable size. The SGD for SVM has been shown to be the fastest method among the SVM-type classifiers for large-scale problems [23,26,27,28].
In order to overcome the above-mentioned limitations of large-scale problems and inspired by the studies of SVM and the generalized pinball loss function, we propose a novel stochastic subgradient descent method with generalized pinball support vector machine (SG-GPSVM). The proposed technique is an efficient method for real-world datasets, especially large-scale ones. Furthermore, we prove the theorem that guarantees the approximation and convergence of our proposed algorithm. Finally, the experimental results show that the proposed SG-GPSVM approach outperforms the existing approaches in terms of accuracy, and the proposed SG-GPSVM may result in a more efficient and faster process of convergence rather than using the conventional SVM. The results also show that our proposed SG-GPSVM is noise insensitive and more stable in resampling and can handle large-scale problems.
The structure of the paper is as follows. Section 2 outlines the background. SG-GPSVM is proposed in Section 3, which includes both linear and nonlinear cases. In Section 4, we theoretically compare the proposed SG-GPSVM with two other algorithms, i.e., conventional SVM and Pegasos [25]. Experiments on benchmark datasets, available in the UCI Machine Learning Repository [29], with different noises are conducted to verify the effectiveness of our SG-GPSVM. The results are given in Section 5. Finally, the conclusion is given in Section 6.

2. Related Work and Background

The purpose of this section is to review related methods for binary classification problems. This section contains some notations and definitions that are used throughout the paper. Let us consider a binary classification problem of m data points and n-dimensional Euclidean space R n . Here, we denote the set of training samples by X R n × m , where x R n is a sample with a label y { 1 , 1 } . Below, we give a brief outline of several related methods.

Support Vector Machine

The SVM model consists of maximizing the distance between the two bounding hyperplanes that bound the classes, so the SVM model is generally formulated as a convex quadratic programming problem. Let · denote the Euclidean norm, or two-norm, of a vector in R n . Given a training set S = { ( x i , y i ) R × { 1 , 1 } : i = 1 , 2 , 3 , , m } , the strategy of SVM is to find the maximum margin separating hyperplane w x + b = 0 between two classes by solving the following problem:
min w , b 1 2 w 2 + C i = 1 m ξ i s . t . 1 y i ( w x i + b ) 1 ξ i , ξ i 0 , i = 1 , , m ,
where C is a penalty parameter and ξ i are the slack variables. By introducing the Lagrangian multipliers α i , we derive its dual QPP as follows:
min α 1 2 i = 1 m j = 1 m y i y j ( x i x j ) α i α j i = 1 m α i s . t . i = 1 m y i α i = 0 , 0 α i C , i = 1 , , m .
After optimizing this dual QPP, we obtain the following decision function:
f ( x ) = sign i = 1 N S V α i * y i ( x i x ) + b
where α * is the solution of the dual problem (Equation (2)) and N S V represents the number of support vectors satisfying 0 < α < C . In fact, the SVM problem (Equation (1)) can be rewritten as the unconstrained optimization problem as follows: [30]:
min w , b 1 2 w 2 + C m i = 1 m L h i n g e 1 y i ( w x i + b ) ,
where L h i n g e 1 y i ( w x i + b ) = max 0 , 1 y i ( w x i + b ) is the so-called hinge loss function. This loss is related to the shortest distance between the sets, and the corresponding classifier leads to its sensitivity to noise and instability in resampling [31]. To address the noise sensitivity, Huang [16] proposed the pinball loss function instead of the hinge loss function by using the pinball loss function with the SVM classifier (Pin-SVM). This model works by penalizing correctly classified samples, which is evident from the pinball loss function, which is defined as follows:
L p i n ( v ) = v , v 0 , τ v , v < 0 ,
where v = 1 y i ( w x i + b ) and τ 0 . Although the pinball loss function achieves noise insensitivity, in the process, it cannot achieve sparsity. This is because the pinball loss function’s subgradient is nonzero almost everywhere. Therefore, to achieve sparsity, in the same publication, Huang [16] proposed the ϵ -insensitive pinball loss function, which is insensitive to noise and stable in resampling. The ϵ -insensitive pinball loss function is defined by:
L p i n ϵ ( v ) = v ϵ , v > ϵ , 0 , ϵ τ v ϵ , τ v + ϵ τ , v < ϵ τ ,
where τ 0 and ϵ 0 are user-defined parameters. On the other hand, the ϵ -insensitive pinball loss function necessitates the selection of an ideal parameter. Rastogi [17] introduced the ( ϵ 1 , ϵ 2 ) -insensitive zone pinball loss function, also known as the generalized pinball loss function. The generalized pinball loss function is defined by:
L τ 1 , τ 2 ϵ 1 , ϵ 2 ( v ) = τ 1 v ϵ 1 τ 1 , v > ϵ 1 τ 1 , 0 , ϵ 2 τ 2 v ϵ 1 τ 1 , τ 2 v + ϵ 2 τ 2 , v < ϵ 2 τ 2 ,
where τ 1 , τ 2 , ϵ 1 , ϵ 2 are user-defined parameters. After employing the generalized pinball loss function instead of the hinge loss function in problem (1), we obtain the following optimization problem:
min w , b , ξ 1 2 w 2 + C i = 1 m ξ i s . t . y i ( w T x i + b ) 1 1 τ 1 ( ξ i + ϵ 1 ) , i = 1 , 2 , 3 , , m , y i ( w T x i + b ) 1 + 1 τ 2 ( ξ i + ϵ 2 ) , i = 1 , 2 , 3 , , m , ξ i 0 , i = 1 , 2 , 3 , , m .
By introducing the Lagrangian multipliers α i , β i , we derive its dual QPP as follows:
min α , β 1 2 i = 1 m j = 1 m ( α i β i ) ( α j β j ) y i y j ( x i x j ) ( 1 ϵ 1 τ 1 ) i = 1 m α i + ( 1 + ϵ 2 τ 2 ) i = 1 m β i s . t . i = 1 m ( α i β i ) y i = 0 , 0 α i τ 1 + β i τ 2 C , i = 1 , , m .
A new data point x R n is classified as positive or negative according to the final decision function:
f ( x ) = sign i = 1 m ( α i * β i * ) y i ( x i x ) + b
where α * and β * are the solution to the dual problem (Equation (9)).
In the next section, we go over the core of our proposed technique and present its pseudocode.

3. Proposed Stochastic Subgradient Generalized Pinball Support Vector Machine

For our SG-GPSVM formulation, which is based on the generalized pinball loss function, we apply the stochastic subgradient approach in this section. Our SG-GPSVM can be used in both linear and nonlinear cases.

3.1. Linear Case

Following the method of formulating SVM problems (discussed in Equation (4)), we incorporated the generalized pinball loss function (Equation (7)) in the objective function to obtain the convex unconstrained minimization problems:
min u f ( u ) = 1 2 u 2 + C m i = 1 m L τ 1 , τ 2 ϵ 1 , ϵ 2 ( 1 y i u z i ) ,
where u = ( w , b ) , z = ( x , 1 ) and w R n , b R are the weight vectors and biases, respectively, and C > 0 are the penalty parameters. To apply the stochastic subgradient approach, for each iteration t, we propose a more general method that uses k samples where 1 k m . We chose a subset A t [ m ] with | A t | = k where k samples are drawn uniformly at random from the training set and [ m ] = { 1 , 2 , 3 , , m } . Let us consider a model of stochastic optimization problems. Let u t denote the current hyperplane iterate, and we obtain an approximate objective function:
f ( u t ) = 1 2 u t 2 + C k i A t L τ 1 , τ 2 ϵ 1 , ϵ 2 ( 1 y i u t z i ) .
When k = m , each iteration t, an approximate objective function is the original objective function in problem (11). Let t be subgradients of f ( u t ) associated with the index set of minibatches A t at point u t , that is:
t f ( u t ) = u t C k i A t y i z i L τ 1 , τ 2 ϵ 1 , ϵ 2 ( 1 y i u t z i ) ,
where:
L τ 1 , τ 2 ϵ 1 , ϵ 2 ( v ) = { τ 1 } , v > ϵ 1 τ 1 , [ 0 , τ 1 ] , v = ϵ 1 τ 1 , { 0 } , ϵ 2 τ 2 < v < ϵ 1 τ 1 , [ τ 2 , 0 ] , v = ϵ 2 τ 2 , { τ 2 } , v < ϵ 2 τ 2 .
With the above notations and the existence of ρ i L τ 1 , τ 2 ϵ 1 , ϵ 2 ( 1 y i u z i ) , t can be written as:
t = u t C k i A t ρ i y i z i .
Then, u t + 1 = u t η t t is updated using a step size of η t = 1 t . Additionally, a new sample z is predicted by:
y = sgn ( u z ) .
The above steps can be outlined as Algorithm 1.
Algorithm 1 SG-GPSVM.
Input: Training samples are represented by X R n × m ,
            positive parameters c 1 , c 2 , τ 1 , τ 2 ,
             k { 1 , 2 , , m } , and tolerance t o l ; t y p i c a l l y ,
            t o l = 10 4 .
1: Set u 1 to zero;
2: while t   t o l do
3:      Choose A t [ m ] , where | A t |   = k , uniformly at random.
4:      Compute stochastic subgradient t using Equation (15).
5:      Update u t + 1 by u t + 1 = u t η t t .
6:      t = t + 1
7: end
Output: Optimal hyperplane parameters u = 1 T t = 1 T u t .

3.2. Nonlinear Case

Support vector machine has the advantage of being able to be employed using symmetric kernels rather than having direct access to the feature vectors x, that is instead of considering predictors that are linear functions of the training samples x themselves, predictors that are linear functions of some implicit mapping ϕ ( x ) of the instances are considered. In order to extend the linear SG-GPSVM to the nonlinear case by a symmetric kernel trick [32,33], the symmetric-kernel-generated surfaces are considered instead of hyperplanes and are given by:
w ϕ ( x ) + b = 0 .
Then, the primal problem for the nonlinear SG-GPSVM is as follows:
min u f ( u ) = 1 2 u 2 + C m i = 1 m L τ 1 , τ 2 ϵ 1 , ϵ 2 1 y i ( w ϕ ( x i ) + b ) ,
where ϕ ( x ) is representative of a nonlinear mapping function, which maps x into a higher-dimensional feature space. To apply the stochastic subgradient approach, for each iteration t, we propose a more general method that uses k samples where 1 k m . We chose a subset A t [ m ] with | A t | = k where k samples are drawn uniformly at random from the training set and [ m ] = { 1 , 2 , 3 , , m } . Consider an approximate objective function:
f ( u t ) = 1 2 u t 2 + C k i A t L τ 1 , τ 2 ϵ 1 , ϵ 2 1 y i ( w t ϕ ( x i ) + b ) ,
Then, consider the subgradient of the above approximate objective, and let t be the subgradient of f at u t , that is:
t f ( u t ) = u t C k i A t y i ϕ ( x i ) L τ 1 , τ 2 ϵ 1 , ϵ 2 1 y i ( w t ϕ ( x i ) + b ) ,
where L τ 1 , τ 2 ϵ 1 , ϵ 2 is defined in Equation (14), and similar to the linear case, ρ i L τ 1 , τ 2 ϵ 1 , ϵ 2 1 y i ( w ϕ ( x i ) + b ) exists. Then, t can be written as:
t = u t C k i A t ρ i y i ϕ ( x i ) .
Then, update u t + 1 = u t η t t using a step size of η t = 1 t . Additionally, a new sample x can be predicted by:
y = sgn ( w ϕ ( x ) + b ) .
The above steps can be outlined as Algorithm 2.
Algorithm 2 Nonlinear SG-GPSVM.
Input: Training samples are represented by X R n × m ,
            positive parameters c 1 , c 2 , τ 1 , τ 2 ,
             k { 1 , 2 , , m } , and tolerance t o l ; t y p i c a l l y ,
            t o l = 10 4 .
1: Set u 1 to zero;
2: while t   t o l do
3:      Choose A t [ m ] , where | A t |   = k , uniformly at random.
4:      Compute stochastic subgradient t using Equation (20).
5:      Update u t + 1 by u t + 1 = u t η t t .
6:      t = t + 1
7: end
Output: Optimal hyperplane parameters u = 1 T t = 1 T u t .
However, the mapping ϕ ( x ) is never specified explicitly, but rather through a symmetric kernel operator K ( x , x ) = ϕ ( x ) ϕ ( x ) yielding the inner products after the mapping ϕ ( x ) .

4. Convergence Analysis

In this section, we analyze the convergence of the proposed SG-GPSVM model. For convenience, we only consider the optimization problem (Equation (11)) together with the conclusions for another nonlinear algorithm, which can be obtained similarly. u 1 and the step size η t = 1 t , u t + 1 for t 1 are updated by:
u t + 1 = u t η t t ,
i.e.,
u t + 1 = u t η t t = u t 1 t u t C k i A t ρ i y i t z i = 1 1 t u t + C k t i A t ρ i y i t z i .
To prove the convergence of Algorithm 1, consider the boundedness of u t first. In fact, we have the following lemma.
Lemma 1. 
The sequences { t : t = 1 , 2 , } and { u t : t = 1 , 2 , } have upper bounds, where t and u t are defined by Equations (15) and (22), respectively.
Proof of Lemma 1. 
Equation (22) can be rewritten as:
u t + 1 = G t u t + 1 t v t
where G t = ( 1 1 t ) I , I is the identity matrix, and v t = C k i A t ρ i y i x i . For t 2 , G t is positive definite, and the largest eigenvalue λ t of G t is equal to t 1 t . From Equation (23), we have:
u t + 1 = i = 2 t G i u 2 + i = 2 t G i ,
where:
G i = 1 i j = i + 1 t u i v i if i < t , 1 i u i if i = t .
For any i 2 ,
G i u 2 λ i u 2 i 1 i u 2 .
Therefore,
i = 2 t G i u 2 = G 2 G 3 G t u 2 G 2 G 3 G t u 2 2 1 2 · 3 1 3 · · ( t 1 ) t u 2 = 1 t u 2 .
Next, consider G i . In Case I for i < t , we have:
G i = 1 i j = i + 1 t G j v i
= 1 i G i + 1 G i + 2 G t v i 1 i G i + 1 G i + 2 G t v i 1 i · ( i + 1 ) 1 i + 1 · ( i + 2 ) 1 i + 2 t 1 t v i = 1 t v i 1 t max i < t v i ,
and in Case II, for i = t , we have:
G i = 1 i u i = 1 t u t 1 t max i = t v i .
From Case I and Case II, we have:
G i 1 t max i t v i .
Thus,
u t + 1 = i = 2 t G i u 2 + i = 2 t G i 1 t u 2 + i = 2 t 1 t max i t v i + 1 t max i t v i 1 t max i t v i = 1 t u 2 + t 1 t max i t v i 1 t max i t v i = 1 t u 2 + t 1 t max i t v i u 2 + C max { τ 1 , τ 2 } max i [ m ] z i .
Let M 0 be the largest norm of the samples in the dataset and:
M 1 = max { u 1 , u 2 + C max { τ 1 , τ 2 } M 0 } ,
that is u t M 1 , t = 1 , 2 , . For t 1 ,
t = u t C k i A t ρ i y i t z i u t + C k i A t ρ i y i t z i M 1 + C max { τ 1 , τ 2 } M 0 = M 2 .
 □
The following theorem demonstrates that we derive the proof of the convergence property of Algorithm 1 by using Lemma 1.
Theorem 1. 
When using Algorithm 1 to solve the optimization problem Equation (11), the iterative Equation (22) is convergent.
Proof of Theorem 1. 
From Equation (26) in the proof of Lemma 1, we therefore obtain:
lim t i = 2 t G i u 2 = 0 ,
which implies that:
lim t i = 2 t G i u 2 = 0 .
By using Equation (30), we have:
G i 1 t max i t v i C max { τ 1 , τ 2 } max i [ m ] z i ,
which indicates that:
lim t i = 2 t G i < .
Note that an infinite series of vectors is convergent if its norm series is convergent [34]. Therefore, the following limit exists:
lim t i = 2 t G i < .
Combining Equations (33) and (36), we conclude that the series of u t + 1 is convergent if t . □
The following theorem gives the relation between GPSVM and our proposed SG-GPSVM.
Theorem 2. 
Let f be a convex function and u t be defined by Equations (11) and (21), respectively. If u = 1 T t = 1 T u t is the solution output of SG-GPSVM, then:
f ( u ) f ( u * ) + M 2 ( M 1 + u * ) + M 2 2 2 T ( 1 + ln T )
where M 1 and M 2 are the upper bounds of u t and t , respectively.
Proof of Theorem 2. 
By the definition of u , we have:
f ( u ) f ( u * ) = f 1 T t = 1 T u t f ( u * ) 1 T t = 1 T f ( u t ) f ( u * ) = 1 T t = 1 T f ( u t ) f ( u * ) .
As f is convex and t is the subgradient of f at u t , we have:
f ( u t ) f ( u * ) ( u t u * ) t .
By summing over t = 1 to T and dividing by T, we obtain the following inequality:
1 T t = 1 T f ( u t ) f ( u * ) 1 T t = 1 T ( u t u * ) T t .
Since u t + 1 = u t η t t , we have:
( u t u * ) t = 1 2 η t ( u t u * 2 u t + 1 u * 2 ) + η t 2 t 2 .
Summing over t = 1 to T, we obtain:
t = 1 T ( ( u t u * ) t ) = 1 2 t = 1 T 1 η t ( u t u * 2 u t + 1 u * 2 ) + 1 2 t = 1 T ( η t t 2 ) = 1 2 t = 1 T t u t u * 2 t = 1 T t u t + 1 u * 2 + 1 2 t = 1 T ( η t t 2 ) = 1 2 t = 1 T u t u * 2 T u T + 1 u * 2 + 1 2 t = 1 T ( η t t 2 ) ( M 1 + u * ) t = 1 T u T + 1 u t + 1 2 M 2 2 ( 1 + ln T ) = ( M 1 + u * ) t = 1 T t = 1 T 1 i i + 1 2 M 2 2 ( 1 + ln T ) T M 2 ( M 1 + u * ) + 1 2 M 2 2 ( 1 + ln T ) .
Multiplying Equation (41) by 1 / T , we have:
1 T t = 1 T ( ( u t u * ) t ) M 2 ( M 1 + u * ) + 1 2 T M 2 2 ( 1 + ln T ) .
By Equations (37), (39), and (42), we have our result:
f ( u ) f ( u * ) + M 2 ( M 1 + u * ) + 1 2 T M 2 2 ( 1 + ln T ) .
 □
With Theorem 2, we showed that in a random iteration T, the resulting expected error is bounded by O ( 1 + ln T T ) , and the above theorem provides the approximations of f ( u * ) by f ( u ) , that is the average instantaneous objective of SG-GPSVM correlates with the objective of GPSVM.

5. Numerical Experiments

In this section, to demonstrate the validity of our proposed SG-GPSVM, we compare SG-GPSVM with the conventional SVM [35] and Pegasos [25] using artificial datasets and the UCI Machine Learning Repository [29] with noises of different variances. All of experiments were performed with Python 3.6.3 on a Windows 8 machine with an Intel i5 Processor 2.50 GHz with 4 GB RAM.

5.1. Artificial Datasets

We conducted experiments on a two-dimensional example, for which the samples came from two Gaussian distributions with equal probability: x i , i { i : y i = 1 } N ( μ 1 , Σ 1 ) and x i , i { i : y i = 1 } N ( μ 2 , Σ 2 ) where μ 1 = [ 1 , 3 ] , μ 2 = [ 1 , 3 ] and Σ 1 = Σ 2 = 0.2 0 0 3 . We added noise to the dataset. The labels of the noise points were selected from { 1 , 1 } with equal probabilities. Each noisy sample was drawn from a Gaussian distribution N ( μ n , Σ n ) where μ n = [ 0 , 0 ] and Σ n = 1 0.8 0.8 1 . This noise affects the labels around the boundary. The level of noise was controlled by the ratio of the noisy data in the training set, denoted by r. The value of r was fixed at r = 0 (i.e., noise-free), 0.05 , 0.1 , and 0.2 . All the results of the two algorithms, SG-GPSVM and Pegasos, were derived using the two-dimensional dataset shown in Figure 1. Here, green circles denote samples from Class 1 and red triangles represent data samples from Class 1 . From Figure 1, we can see that the noisy samples affect the labels around the decision boundary. As we increase the amount of noise from r = 0 % to r = 20 % , the hyperplanes of Pegasos start deviating from the ideal slope of 2.14, whereas the deviation in the slopes of the hyperplanes is significantly less in our SG-GPSVM. This implies that our purposed algorithm is insensitive to noise around the decision boundary.

5.2. UCI Datasets

We also performed experiments on 11 benchmark datasets available in the UCI Machine Learning Repository. Table 1 lists the datasets and their descriptions. For the benchmarks, we compared SG-GPSVM with two other SVM-based classical classifiers: conventional SVM and Pegasos. Among these SVM models, Pegasos and SG-GPSVM have extra hyperparameters, in addition to the hyperparameter C. The performance of different algorithms depends on the choices of the parameters [33,36,37]. All the hyperparameters were chosen through the grid search method and manual adjustment. The 10-fold cross-validation evaluation method was also employed. In each algorithm, the optimal parameter C was searched from set { 10 i | i = 2 , 1 , 0 , 1 , 2 } , and other parameters were searched from set { 0.1 , 0.25 , 0.5 , 0.75 , 1 } . Further, the kernel method was adopted to evaluate the performance of SG-GPSVM as a nonlinear classifier, where the RBF kernel was adopted. The parameter of the RBF kernel is γ , searched from set { 10 i | i = 2 , 1 , 0 , 1 , 2 } . The results of the numerical experiments are shown in Table 2, and the optimal parameters used in Table 2 are summarized in Table 3.
Table 2 shows the results for a linear and nonlinear kernel on six different UCI datasets by applying conventional SVM, Pegasos, and our proposed SG-GPSVM. From the experimental outcomes, it can be seen that the classification performance of SG-GPSVM is better than that of the conventional SVM and Pegasos on most of the datasets in terms of accuracy. From Table 2, our proposed SG-GPSVM yielded the best prediction accuracy on four of six datasets. In addition, when we varied the number of noisy samples from r = 0 (noise-free) to r = 0.2 , our proposed SG-GPSVM exhibited better classification accuracy and stability than the other algorithms no matter the noise factor value, as shown in Figure 2. On the Appendicitis dataset, when the noise factor increased, the accuracy of the conventional SVM and Pegasos fluctuated greatly. On the contrary, the classification accuracies of our proposed SG-GPSVM were stable. Similarly, on the other datasets, such as Monk-3 and Appendicitis, the accuracies of the conventional SVM and Pegasos not only varied greatly, but they also performed worse than our proposed algorithm. This demonstrates that our proposed SG-GPSVM is relatively stable, regardless of the noise factor. In terms of computational time, our proposed algorithm cost nearly the same as Pegasos. The computational time of Pegasos and our proposed SG-GPSVM on the largest dataset was longer than the computational time of the conventional SVM on the smaller datasets.
The experimental results of three different minibatches for SG-GPSVM ( 1 , 32 , 512 ) and the accuracy and computational time results of the methods are illustrated in Figure 3. Subfigure (a) shows how the accuracy changed in terms of the batch size, and the curves in Subfigure (b) show how the computational time changed in terms of the batch size. We can see that when the size of the minibatch increased, the accuracy increased, and the computational time increased approximately linearly when the batch size increased. The datasets Spambase and WDBC were taken into consideration. It is evident that if the minibatch size k is small enough, k = 32 is more accurate than k = 512 . However, when using a large minibatch size of k = 512 , the accuracy decreased and the computational time increased. This suggests that in the scenarios in which we care more about the accuracy, we should choose a relatively medium batch size (it does not need to be large).

5.3. Large-Scale Dataset

In order to validate the classification efficiency of SG-GPSVM, we conducted comparisons of our methods with the two most related methods (SVM and Pegasos) on four large-scale UCI datasets. We show that our SG-GPSVM is capable of solving large-scale problems. Note that the results of this section were solved on a machine equipped with an Intel CPU E5-2658 v3 at 2.20 GHz and 256 GB RAM running the Ubuntu Linux operating system. The scikit-learn package was sued for the conventional SVM [35]. The large-scale datasets are presented in Table 4.
For the four large-scale datasets, we compare the performance of SG-GPSVM against the conventional SVM and Pegasos in Table 5. The reduced kernel [38] was employed in the nonlinear situation, and the kernel size was set to 100. From the results in Table 5, SG-GPSVM outperformed the conventional SVM and Pegasos on three out of the four datasets in terms of accuracy. The best ones are highlighted in bold. However, SG-GPSVM’s accuracy on some datasets, such as Credit card, was not the best. On the Skin dataset, the conventional SVM performed much worse than SG-GPSVM and Pegasos, and it was not possible to use the conventional SVM on the Kddcup and Susy datasets due to the high memory requirements. This was because SVM requires the complete training set to be stored in the main memory. On the Credit card dataset, our proposed SG-GPSVM took almost the same amount of time as the conventional SVM and Pegasos. Our suggested approach, SG-GPSVM, took almost the same amount of time as Pegasos on the Kddcup and Susy datasets.

6. Conclusions

In this paper, we used the generalized pinball loss function in SVM to perform classification and proposed the SG-GPSVM classifier. This paper adapted the stochastic subgradient descent method as stochastic subgradient descent method-based generalized pinball SVM (SG-GPSVM). Compared to the hinge loss SVM and Pegasos, the major advantage of our proposed method is that SG-GPSVM is less sensitive to noise, especially the feature noise around the decision boundary. In addition, we investigated the convergence of SG-GPSVM and the theoretical approximation between GPSVM and SG-GPSVM. The validity of our proposed SG-GPSVM was demonstrated by numerical experiments on artificial datasets and datasets from UCI with noises of different variances. The experimental results clearly showed that our suggested SG-GPSVM outperformed the existing classifier approach in terms of accuracy, and SG-GPSVM has a significant advantage in handling large-scale classification problems. The results imply that the SG-GPSVM approach is the strongest candidate for solving binary classification problems.
In further work, we would like to consider applications of SG-GPSVM to activity recognition datasets and image retrieval datasets, and we also plan to improve our approach to deal with multicategory classification scenarios.

Author Contributions

Conceptualization, W.P. and R.W.; writing—original draft, W.P.; writing—review and editing, W.P. and R.W. Both authors read and agreed to the published version of the manuscript.

Funding

This research was funded by the NSRF and NU, Thailand, with Grant Number R2564B024.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The authors would like to thank the referees for careful reading and constructive comments. This research is partially supported by the Development and Promotion of the Gifted in Science and Technology Project and Naresuan University.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Vapnik, V.N. The Nature of Statistical Learning Theory; Springer: New York, NY, USA, 1995. [Google Scholar]
  2. Vapnik, V.N. Statistical Learning Theory; Wiley: New York, NY, USA, 1998. [Google Scholar]
  3. Vapnik, V.N. An overview of statistical learning theory. IEEE Trans. Neural Netw. 1999, 10, 988–999. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  4. Colmenarez, A.J.; Huang, T.S. Face Detection With Information-Based Maximum Discrimination. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Juan, PR, USA, 17–19 June 1997; pp. 782–787. [Google Scholar]
  5. Osuna, E.; Freund, R.; Girosi, F. Training support vector machines: An application to face detection. In Proceedings of the IEEE Computer Vision and Pattern Recognition, San Juan, PR, USA, 17–19 June 1997. [Google Scholar]
  6. Joachims, T.; Ndellec, C.; Rouveriol, C. Text categorization with support vector machines: Learning with many relevant features. In Proceedings of the European Conference on Machine Learning, Chemnitz, Germany, 21–23 April 1998; pp. 137–142. [Google Scholar]
  7. Richhariya, B.; Tanveer, M. EEG signal classification using universum support vector machine. Expert Syst. Appl. 2018, 106, 169–182. [Google Scholar] [CrossRef]
  8. Mukherjee, S.; Osuna, E.; Girosi, F. Nonliner prediction of chaotic time series using a support vector machine. In Proceedings of the 1997 IEEE Workshop on Neural Networks for Signal Processing, Amelia Island, FL, USA, 24–26 September 1997. [Google Scholar]
  9. Ince, H.; Trafalis, T.B. Support vector machine for regression and applications to financial forecasting. In Proceedings of the International Joint Conference on Neural Networks (IEEE-INNSENNS), Como, Italy, 27 July 2000. [Google Scholar]
  10. Huang, Z.; Chen, H.; Hsua, C.-J.; Chen, W.-H.; Wu, S. Credit rating analysis with support vector machines and neural networks: A market comparative study. Decis. Support Syst. 2004, 37, 543–558. [Google Scholar] [CrossRef]
  11. Khemchandani, R.; Jayadeva; Chandra, S. Regularized least squares fuzzy support vector regression for nancial time series forecasting. Expert Syst. Appl. 2009, 36, 132–138. [Google Scholar] [CrossRef]
  12. Tao, D.; Tang, X.; Li, X.; Wu, X. Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 28, 1088–1099. [Google Scholar]
  13. Pal, M.; Mather, P. Support vector machines for classification in remote sensing. Int. J. Remote Sens. 2005, 26, 1007–1011. [Google Scholar] [CrossRef]
  14. Li, Y.Q.; Guan, C.T. Joint feature re-extraction and classification using an iterative semi-supervised support vector machine algorithm. Mach. Learn. 2008, 71, 33–53. [Google Scholar] [CrossRef]
  15. Bi, J.; Zhang, T. Support vector classification with input data uncertainty. In Proceedings of the Neural Information Processing Systems, Vancouver, BC, Canada, 13–18 December 2004. [Google Scholar]
  16. Huang, X.; Shi, L.; Suykens, J.A.K. Support vector machine classifier with pinball loss. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 984–997. [Google Scholar] [CrossRef]
  17. Rastogi, R.; Pal, A.; Chandra, S. Generalized pinball loss SVMs. Neurocomputing 2018, 322, 151–165. [Google Scholar] [CrossRef]
  18. Ñanculef, R.; Frandi, E.; Sartori, C.; Allende, H. A Novel Frank-Wolfe Algorithm. Analysis and Applications to Large-Scale SVM Training. Inf. Sci. 2014, 285, 66–99. [Google Scholar] [CrossRef] [Green Version]
  19. Xu, J.; Xu, C.; Zou, B.; Tang, Y.Y.; Peng, J.; You, X. New incremental learning algorithm with support vector machines. IEEE Trans. Syst. Man Cybern. Syst. 2018, 49, 2230–2241. [Google Scholar] [CrossRef]
  20. Platt, J. Fast training of support vector machines using sequential minimal optimization. In Advances in Kernel Methods—Support Vector Learning; MIT Press: Cambridge, MA, USA, 1999; pp. 185–208. [Google Scholar]
  21. Mangasarian, O.; Musicant, D. Successive overrelaxation for support vector machines. IEEE Trans. Neural Netw. 1999, 10, 1032–1037. [Google Scholar] [CrossRef] [Green Version]
  22. Fan, R.; Chang, K.; Hsieh, C.; Wang, X.; Lin, C. LIBLINEAR: A library for large linear classification. J. Mach. Learn. Res. 2008, 9, 1871–1874. [Google Scholar]
  23. Zhang, T. Solving large-scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the International Conference on Machine Learning, Banff, AB, Canada, 4–8 July 2004. [Google Scholar]
  24. Xu, W. Towards optimal one pass large-scale learning with averaged stochastic gradient descent. arXiv 2011, arXiv:1107.2490. [Google Scholar]
  25. Shai, S.; Singer, Y.; Srebro, N.; Cotter, A. Pegasos: Primal estimated subgradient solver for SVM. Math. Program. 2011, 127, 3–30. [Google Scholar]
  26. Alencar, M.; Oliveira, D.J. Online learning early skip decision method for the hevc inter process using the SVM-based Pegasos algorithm. Electron. Lett. 2016, 52, 1227–1229. [Google Scholar]
  27. Reyes-Ortiz, J.; Oneto, L.; Anguita, D. Big data analytics in the cloud: Spark on Hadoop vs MPI/OpenMP on Beowulf. Procedia Comput. Sci. 2015, 53, 121–130. [Google Scholar] [CrossRef] [Green Version]
  28. Sopyla, K.; Drozda, P. Stochastic gradient descent with barzilaicborwein update step for SVM. Inf. Sci. 2015, 316, 218–233. [Google Scholar] [CrossRef]
  29. Dua, D.; Taniskidou, E.K. UCI Machine Learning Repository; School of Information and Computer Science, University of California: Irvine, CA, USA, 2019; Available online: http://archive.ics.uci.edu/ml (accessed on 24 September 2018).
  30. Shwartz, S.; Ben-David, S. Understanding Machine Learning Theory Algorithms; Cambridge University Press: Cambridge, MA, USA, 2014; 207p. [Google Scholar]
  31. Xu, Y.; Yang, Z.; Pan, X. A novel twin support-vector machine with pinball loss. IEEE Trans. Neural Netw. Learn. Syst. 2017, 28, 359–370. [Google Scholar] [CrossRef]
  32. Shao, Y.; Zhang, C.; Wang, X.; Deng, N. Improvements on twin support vector machines. IEEE Trans. Neural Netw. 2011, 22, 962–968. [Google Scholar] [CrossRef] [PubMed]
  33. Khemchandani, R.; Jayadeva; Chandra, S. Optimal kernel selection in twin support vector machines. Optim. Lett. 2009, 3, 77–88. [Google Scholar] [CrossRef]
  34. Rudin, W. Principles of Mathematical Analysis, 3rd ed.; McGraw-Hill: New York, NY, USA, 1964. [Google Scholar]
  35. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
  36. Cristianini, N.; Shawe-Taylor, J. An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods; Cambridge University Press: Cambridge, UK, 2000. [Google Scholar]
  37. Shawe-Taylor, J.; Cristianini, N. Kernel Methods for Pattern Analysis; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
  38. Lee, Y.; Mangasarian, O. RSVM: Reduced support vector machines. In Proceedings of the First SIAM International Conference on Data Mining, Chicago, IL, USA, 5–7 April 2001. [Google Scholar]
Figure 1. SG-GPSVM and Pegasos on a noisy artificial dataset. The above four figures demonstrate the noise-insensitive properties of our purposed SG-GPSVM compared to Pegasos when we have: (a) r = 0 (noise free); (b) r = 0.05 ; (c) r = 0.1 ; (d) r = 0.2 .
Figure 1. SG-GPSVM and Pegasos on a noisy artificial dataset. The above four figures demonstrate the noise-insensitive properties of our purposed SG-GPSVM compared to Pegasos when we have: (a) r = 0 (noise free); (b) r = 0.05 ; (c) r = 0.1 ; (d) r = 0.2 .
Symmetry 13 01652 g001
Figure 2. The accuracy of four algorithms with different noise factors on six datasets: (a) Appendicitis; (b) Ionosphere; (c) Monk-2; (d) Monk-3; (e) Phoneme; (f) Saheart.
Figure 2. The accuracy of four algorithms with different noise factors on six datasets: (a) Appendicitis; (b) Ionosphere; (c) Monk-2; (d) Monk-3; (e) Phoneme; (f) Saheart.
Symmetry 13 01652 g002
Figure 3. Classification results on the UCI datasets with minibatch SG-GPSVM. (a) Accuracy of minibatch SG-GPSVM. (b) Computational time of minibatch SG-GPSVM.
Figure 3. Classification results on the UCI datasets with minibatch SG-GPSVM. (a) Accuracy of minibatch SG-GPSVM. (b) Computational time of minibatch SG-GPSVM.
Symmetry 13 01652 g003
Table 1. Description of the UCI datasets.
Table 1. Description of the UCI datasets.
Datasets# of Samples# of Features# in Negative Class# in Positive Class
Appendicitis10678521
Ionosphere35133126225
Monk-24326204228
Monk-34326204228
Saheart4629302160
WDBC56930357212
Australian69014383307
Spambase45975727851812
Phoneme5404538181586
Twonorm74002137033697
Coil20009822869236586
Table 2. Accuracy obtained on the UCI datasets using the linear and nonlinear ( * ) kernel.
Table 2. Accuracy obtained on the UCI datasets using the linear and nonlinear ( * ) kernel.
DatasetsrSVMPegasosSG-GPSVMSG-GPSVM *
Time (s)Time (s)Time (s)Time (s)
Appendicitis085.82 ± 14.3383.82 ± 13.7284.82 ± 13.0287.91 ± 10.45
0.00080.31980.48000.0899
0.0583.00 ± 14.1781.82 ± 15.1684.73 ± 11.3488.91 ± 10.74
0.00200.30990.43210.0883
0.1084.00 ± 8.6185.55 ± 14.9685.64 ± 13.3386.91 ± 9.69
0.00200.31690.43930.0887
0.2080.18 ± 13.5179.18 ± 12.7484.82 ± 13.5586.82 ± 12.38
0.00150.29620.49720.0880
Phoneme077.48 ± 1.4175.98 ± 1.9677.31 ± 1.4077.15 ± 2.64
0.55020.31020.40150.0893
0.0577.42 ± 1.4477.15 ± 2.02077.44 ± 1.5676.87 ± 2.07
0.51530.34900.41970.0891
0.1077.18 ± 1.5876.59 ± 1.7176.48 ± 1.8176.61 ± 2.15
0.50200.30870.40830.0891
0.2076.76 ± 1.6975.61 ± 1.6376.33 ± 1.5776.44 ± 2.33
0.52100.30160.39170.0893
Monk-2080.32 ± 8.9679.85 ± 8.1280.80 ± 7.6681.48 ± 7.54
0.01340.29930.42680.0891
0.0580.55 ± 8.9679.38 ± 7.9379.16 ± 6.3379.86 ± 7.98
0.00270.28350.40380.0894
0.1079.39 ± 8.5576.60 ± 6.3380.54 ± 7.8280.53 ± 6.56
0.00230.28860.38760.0894
0.2079.40 ± 7.0777.09 ± 7.5680.32 ± 8.1480.11 ± 5.54
0.00250.30990.38150.0900
Monk-3080.53 ± 4.6177.75 ± 5.6880.30 ± 4.5084.47 ± 4.82
0.00470.28350.48990.0947
0.0580.35 ± 4.7578.01 ± 4.0380.30 ± 4.3884.23 ± 8.64
0.00670.29320.48560.0970
0.1078.69 ± 3.4679.40 ± 4.5980.07 ± 4.5083.31 ± 8.07
0.00310.29740.46750.0956
0.2077.99 ± 4.8677.32 ± 6.2880.32 ± 3.3681.93 ± 6.99
0.00160.28750.49180.0994
Saheart071.83 ± 7.2270.52 ± 6.6272.49 ± 7.4572.32 ± 4.42
0.00470.30710.40381.4542
0.0571.39 ± 7.6671.84 ± 5.0172.48 ± 6.7071.87 ± 3.91
0.00470.29890.38161.4453
0.1072.27 ± 7.1167.98 ± 4.1971.38 ± 8.6571.00 ± 5.38
0.00520.32490.41071.4201
0.2072.50 ± 7.6572.47 ± 8.1971.82 ± 10.1070.80 ± 6.23
0.00510.31600.38301.4503
Ionosphere088.32 ± 6.9484.02 ± 8.3286.60 ± 6.1591.46 ± 5.25
0.00800.35410.39130.0944
0.0587.46 ± 5.7586.03 ± 5.3586.88 ± 6.3091.45 ± 2.86
0.00460.31760.41100.0937
0.1083.74 ± 7.1584.60 ± 6.3186.03 ± 7.8391.17 ± 3.25
0.00760.31600.86830.0941
0.2082.31 ± 7.7885.46 ± 4.8887.73 ± 5.7490.03 ± 5.12
0.00790.32330.41220.0936
Note: * Nonlinear case.
Table 3. The optimal parameters of conventional SVM, Pegasos, SG-GPSVM, and SG-GPSVM * .
Table 3. The optimal parameters of conventional SVM, Pegasos, SG-GPSVM, and SG-GPSVM * .
DatasetsSVMPegasosSG-GPSVMSG-GPSVM *
C C C , τ 1 , τ 2 ,
ϵ 1 , ϵ 2
C , τ 1 , τ 2 ,
ϵ 1 , ϵ 2 , γ
Appendicitis110100, 0.5, 0.75,
0.1, 1
10, 1, 1,
0.5, 0.5, 0.1
Phoneme1110, 0.5, 0.5,
0.1, 1
100, 1, 1,
0.25, 0.25, 1
Monk-210.01100, 1, 0.25,
0.1, 0.25
100, 0.25, 0.25,
0.1, 0.25, 0.1
Monk-310.11, 1, 0.25,
0.5, 0.5
10, 1, 0.75,
0.25, 0.25, 0.1
Saheart111, 0.5, 0.25,
0.25, 0.1
100, 1, 0.75,
0.25, 0.25, 0.1
Ionosphere111, 0.75, 0.1,
0.25, 0.1
100, 1, 1,
1, 0.5, 0.1
Note: * Nonlinear case.
Table 4. The details of the large-scale datasets.
Table 4. The details of the large-scale datasets.
Datasets# of Samples# of Features# in Negative Class# in Positive Class
Credit card30,00032423,3646636
Skin245,0573194,19850,859
Kddcup4,898,43141972,7813,925,650
SUSY5,000,000182,712,1732,287,827
Table 5. Classification results on the large-scale UCI datasets.
Table 5. Classification results on the large-scale UCI datasets.
DatasetsMean Accuracy in %
Credit CardSkinKddcupSusy
SVM80.82 ± 0.9092.90 ± 0.16**
Time (s)0.4398289.4454**
Pegasos79.74 ± 2.3893.22 ± 0.7895.31 ± 0.3199.25 ± 1.37
Time (s)0.24790.26761.98234.5164
SG-GPSVM80.72 ± 0.9593.98 ± 0.3898.82 ± 0.6099.97 ± 0.06
Time (s)0.44170.37512.04745.3744
SG-GPSVM * 77.88 ± 0.8794.06 ± 1.1396.92 ± 0.3599.99 ± 0.02
Time (s)1.08810.97382.81252.3218
Note: * Nonlinear case.
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Panup, W.; Wangkeeree, R. Stochastic Subgradient for Large-Scale Support Vector Machine Using the Generalized Pinball Loss Function. Symmetry 2021, 13, 1652. https://doi.org/10.3390/sym13091652

AMA Style

Panup W, Wangkeeree R. Stochastic Subgradient for Large-Scale Support Vector Machine Using the Generalized Pinball Loss Function. Symmetry. 2021; 13(9):1652. https://doi.org/10.3390/sym13091652

Chicago/Turabian Style

Panup, Wanida, and Rabian Wangkeeree. 2021. "Stochastic Subgradient for Large-Scale Support Vector Machine Using the Generalized Pinball Loss Function" Symmetry 13, no. 9: 1652. https://doi.org/10.3390/sym13091652

APA Style

Panup, W., & Wangkeeree, R. (2021). Stochastic Subgradient for Large-Scale Support Vector Machine Using the Generalized Pinball Loss Function. Symmetry, 13(9), 1652. https://doi.org/10.3390/sym13091652

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop