A Multiclass Nonparallel Parametric-Margin Support Vector Machine

: The twin parametric-margin support vector machine (TPMSVM) is an excellent kernel-based nonparallel classiﬁer. However, TPMSVM was originally designed for binary classiﬁcation, which is unsuitable for real-world multiclass applications. Therefore, this paper extends TPMSVM for multiclass classiﬁcation and proposes a novel K multiclass nonparallel parametric-margin support vector machine (MNP-KSVC). Speciﬁcally, our MNP-KSVC enjoys the following characteristics. (1) Under the “one-versus-one-versus-rest” multiclass framework, MNP-KSVC encodes the complicated multiclass learning task into a series of subproblems with the ternary output {− 1,0, + 1 } . In contrast to the “one-versus-one” or “one-versus-rest” strategy, each subproblem not only focuses on separating the two selected class instances but also considers the side information of the remaining class instances. (2) MNP-KSVC aims to ﬁnd a pair of nonparallel parametric-margin hyperplanes for each subproblem. As a result, these hyperplanes are closer to their corresponding class and at least one distance away from the other class. At the same time, they attempt to bound the remaining class instances into an insensitive region. (3) MNP-KSVC utilizes a hybrid classiﬁcation and regression loss joined with the regularization to formulate its optimization model. Then, the optimal solutions are derived from the corresponding dual problems. Finally, we conduct numerical experiments to compare the proposed method with four state-of-the-art multiclass models: Multi-SVM, MBSVM, MTPMSVM, and Twin-KSVC. Experimental results demonstrate the feasibility and effectiveness of MNP-KSVC in terms of multiclass accuracy and learning time.


Introduction
Data mining has become an essential tool for integrating information technology and industrialization due to the growing size of available databases [1]. One of the main applications in data mining is supervised classification. This aims to assign a label to an unseen instance that is as correct as possible from a given set of classes based on the learning model. Recently, support vector machine (SVM) [2,3] has been a preeminent maximum-margin learning paradigm for data classification. The basic idea of SVM is to find an optimal decision boundary via maximizing the margin between two parallel support hyperplanes. Compared with the neural network, SVM has the following attractive features [3]: (1) the structural risk minimization principle is implemented in SVM to control the upper bound of the generalization error, leading to an excellent generalization ability; (2) the global optimum can be achieved by optimizing a quadratic programming problem (QPP). Furthermore, kernel techniques enable SVM to handle complicated nonlinear learning tasks effectively. During recent decades, SVM has been successfully applied in a wide variety of fields ranging from scene classification [4], fault diagnosis [5,6], EEG classification [7,8], pathological diagnosis [9,10], and bioinformatics [8] to power applications [11].
One limitation in the classical SVM is the strictly parallel requirements for support vector hyperplanes. Namely, parallel hyperplanes are challenging to use to capture a data structure with a cross-plane distribution [12,13], such as "XOR" problems. To alleviate the above issue, the nonparallel SVM models have been proposed in the literature [13][14][15][16][17] during the past years. This approach relaxes the parallel requirement in SVM and seeks nonparallel hyperplanes for different classes. The pioneering work is the generalized eigenvalue proximal SVM (GEPSVM) proposed by Mangasarian and Wild [12], which attempts to find a pair of nonparallel hyperplanes via solving eigenvalue problems (EPs). Subsequently, Jayadeva et al. [13] proposed a novel QPP-type nonparallel model for classification, named TWSVM. The idea of TWSVM is to generate two nonparallel hyperplanes such that each hyperplane is closer to one of the two classes and is at least one apart from the other class. Compared with the classical SVM, the nonparallel SVM models (GEPSVM and TWSVM) have lower computational complexity and better generalization ability. Therefore, in the last few years, they have been studied extensively and developed rapidly, including a least squares version of TWSVM (LSTSVM) [16], structural risk minimization version of TWSVM (TBSVM) [17], ν-PTSVM [18], nonparallel SVM (NPSVM) [19,20], nonparallel projection SVM (NPrSVM) [21], and so on [21][22][23][24][25][26][27][28].
The above nonparallel models were mainly proposed for binary classification problems. However, most real-world applications [29][30][31][32] are related to multiclass classifications such as disease diagnosis, fault detection, image recognition, and text categorization. Therefore, many researchers are interested in extending SVM models from binary to multiclass classification. Generally, the decomposition procedure has been considered to be an effective way to achieve multiclass extensions. Yang et al. [33] proposed a multiple birth SVM (MBSVM) for multiclass classification based on the "one-versus-rest" strategy, which is the first multiclass extension of the nonparallel SVM model. Angulo et al. [34] proposed an "one-versus-one-versus-rest" multiclass framework. In contrast to the "one-versus-one" strategy, it constructs k(k − 1)/2 binary classifiers with all data points, which can avoid the risk of information loss and class distortion problems. Following this framework, Xu et al. [35] proposed a multiclass extension of TWSVM, termed Twin-KSVC. Results show that Twin-KSVC has a better generalization ability than MBSVM in most cases. Nasiri et al. [36] formulated Twin-KSVC in the least-squared sense to boost learning efficiency and further presented the LST-KSVC model. Lima et al. [32] proposed an improvement on LST-KSVC (ILST-KSVC) with regularization to implement the structural risk minimization principle.
As a successful extension of SVM, the twin parametric-margin support vector machine (TPMSVM) [15] was proposed to pursue a pair of parametric-margin nonparallel hyperplanes. Unlike GEPSVM and TWSVM, each hyperplane in TPMSVM aims to be closer to its class and far away from the other class. The parametric-margin mechanism enables TPMSVM to be suitable for many cases and results in better generalization performance. However, TPMSVM can only deal with binary classification learning tasks. The above motivates us to propose a novel K multiclass nonparallel parametric-margin support vector machine, termed MNP-KSVC. The proposed MNP-KSVC is endowed with the following attractive advantages: • The proposed MNP-KSVC encodes the K multiclass learning task into a series of "one-versus-one-versus-rest" subproblems with all the training instances. Then, it encodes the outputs of subproblems with the ternary output {−1, 0, +1}, which helps to deal with imbalanced cases. • For each subproblem, MNP-KSVC aims to find a pair of nonparallel parametric-margin to separate the two selected classes together with the remaining classes. Unlike TMPSVM, each parametric-margin hyperplane is closer to its class and at least at a distance away from the other class, meanwhile mapping the remaining instances into a region.
• To implement the empirical risks, MNP-KSVC considers the hybrid classification and regression loss. The Hinge loss is utilized to penalize the errors of the focused two class instances and the ε-intensive loss for the remaining class instances. • Extensive numerical experiments are performed on several multiclass UCI benchmark datasets, and their results are compared with four models (Multi-SVM, MBSVM, MTPMSVM, and Twin-KSVC). The comparative results indicate the effectiveness and feasibility of the proposed MNP-KSVC for multiclass classification.
The remainder of this paper is organized as follows. Section 2 briefly introduces notations and related works. Section 3 proposes our MNP-KSVC. The model optimization is also discussed in Section 3. The nonlinear version of MNP-KSVC is extended in Section 4. Experimental results are described in Sections 5 and 6 presents a discussion and future work.

Preliminaries
In this section, we first describe the notations used throughout the paper. Then, we briefly revisit the nonparallel classifier PTSVM and its variants.

Notations
In this paper, scalars are denoted by lowercase italic letters, vectors by lowercase bold face letters, and matrices by uppercase letters. All vectors are column vectors unless transformed to row vectors by a prime superscript (·) . A vector of zeros of arbitrary dimensions is represented by 0. In addition, we denote e as a vector of ones and I as an identity matrix of arbitrary dimensions. Moreover, let · stand for the L 2 -norm.

TPMSVM
The TPMSVM [15] is originally proposed for binary classification with a heteroscedastic noise structure. It attempts to seek two nonparallel parametric-margin hyperplanes via the following QPPs: and min w 2 ,b 2 where ν 1 , ν 2 , c 1 , c 2 are positive parameters, and the decision hyperplane is half of the sum of the two hyperplanes. To obtain the solutions of problems (1) and (2), one needs to resort to the dual problems and min α 2 Then, the solutions (w 1 , b 1 ) and (w 2 , b 2 ) of dual problems (3) and (4) can be calculated according to the Karush-Kuhn-Tucker (KKT) conditions [3], where I SV 1 and I SV 2 are indices of the support vector set. Note that TPMSVM can capture more complex heteroscedastic error structures via parametric-margin hyperplanes compared with TWSVM [13]. However, the variable b in TMPSVM is not strictly convex, leading to a lack of a unique solution. Moreover, the optimization problems (1) and (2) are only designed for binary classification tasks, and thus are unsuitable for many real-world multiclass learning applications.

Model Formulation
To address the above issues in TPMSVM, this subsection proposes a novel K multiclass nonparallel parametric-margin support vector machine, termed MNP-KSVC. Inspired by the "hybrid classification and regression" learning paradigm [35], MNP-KSVC decomposes the complicated K multiclass learning task into a series of "one-versus-one-versus-rest" subproblems. Each subproblem focuses on the separation of the two selected classes together with the remaining classes. Here, we utilize {−1, 1} to represent the label of the two selected classes and 0 to label the rest. Namely, the subproblem is encoded with the ternary output {−1, 0, 1}. The main idea of MNP-KSVC is to find a pair of nonparallel parametric-margin hyperplanes for each subproblem, such that each one is approximate to the corresponding class while as far as possible from the other class on one side. Moreover, the remaining classes are restricted in a region between these hyperplanes. Formally, we use I k to express the set of indices for instances belonging to the k label in the subproblem, where k is in {−1, 0, 1}. Inspired by the TPMSVM [15], our MNP-KSVC considers the following two loss functions for the above ternary output learning problem: and R emp 2 where ν 1 , ν 2 , c 1 , c 2 , c 3 , c 4 > 0 are penalty parameters and ε ∈ (0, 1] is a margin parameter. Introducing the regularization term w 2 2 + b 2 yields the primal problems of MNP-KSVC: and min w 2 ,ξ 2 ,η 2 where ξ 1 , η 1 , ξ 2 , η 2 are non-negative slack vectors. To deliver the mechanism of MNP-KSVC, we now give the following analysis and geometrical explanation for problem (10): • The first term is the L 2 -norm of w 1 and b 1 . Minimizing this is done with the aim of regulating the model complexity of MNP-KSVC and avoiding over-fitting. Furthermore, this regularization term makes the QPPs strictly convex, leading to a unique solution. • The second term is the sum of the projection value of the −1 labeled instances x j∈I − on f 1 (x j ). Optimizing this term leads instances x j∈I − to be as far as possible from the +1 labeled parametric-margin hyperplane The third term with the first constraint requires the projection values of the +1 labeled instances x i∈I + on hyperplane f 1 (x j ) to be not less than 1. Otherwise, a slack vector ξ 1i is introduced to measure its error when the constraint is violated. • The last term with the second constraint aims for the projection values of the remaining 0 labeled instances x l∈I 0 on hyperplane f 1 (x j ) to be not more than 1 − ε. Otherwise, a slack variable η 1l is utilized to measure the corresponding error. Optimizing this is done with the aim of keeping instances x l∈I 0 at least ε distance from the +1 labeled instances. Moreover, ε controls the margin between "+" and "0" labeled instances.
The geometrical explanation for problem (11) is similar. Let For the sake of simplicity, denote A = {x} i∈I + , B = {x} j∈I − and C = {x} l∈I 0 as the instances belonging to +1, −1, and 0 labels, respectively. Then, the matrix formulations of problems (10) and (11) can be expressed as min u 1 ,ξ 1 ,η 1 and min u 2 ,ξ 2 ,η 2 In what follows, we discuss the solutions of problems (12) and (13).

Model Optimization
To obtain solutions to problems (12) and (13), we first derive their dual problems by Theorem 1.

Decision Rule
As mentioned in Section 3.1, our MNP-KSVC decomposes the multiclass learning task into a series of subproblems with the "one-versus-one-versus-rest" strategy. Specifically, we construct K(K − 1)/2 MNP-KSVC classifiers for K-class classification. For each (k 1 , k 2 )-pair classifier, we relabel the dataset with ternary outputs {−1, 0, +1} according to the two selected and the remaning class instances. Namely, we label "+1", "−1", and "0" to instances belonging to the k i class, k j class, and all the remaining classes, respectively. Then, we train it on the new relabeled dataset by solving problems (12) and (13).
As for the decision, we predict the label of an unseen instance x on the voting strategy from the ensemble results of K(K − 1)/2 MNP-KSVC classifiers. Namely, we determine the vote via each classifier according to the regions in which x is located. Taking the (k 1 , k 2 )-pair classifier as an example, firstly, we reformulate the parametric-margin hyperplanes (7) w.r.t. u = [w; b] and If x is located above the "+" hyperplane f 1 (x) > 1 − ε-i.e., satisfying the condition u 1x > 1 − ε-we vote it to the k i class. On the other hand, if x is located below the "-" hyperplane f 2 (x) < −1 + ε, we vote it to the k j class. Otherwise, it belongs to the remaining classes. In summary, the decision function for the (k 1 , k 2 )-pair classifier can be expressed as Finally, the given unseen instance x is assigned to the class label that gets the most votes. In summary, the whole procedure of MNP-KSVC is established in Algorithm 1 with the Figure 1.

Model Extension to the Nonlinear Case
In practice, the linear classifier is sometimes not suitable for many real-world nonlinear learning tasks [18,21,30]. One of the effective solutions is to map linearly non-separable instances into the feature space. Thus, in this section, we focus on the nonlinear extension of MNP-KSVC.
In what follows, we define the kernel operation for MNP-KSVC.

Definition 1.
Suppose that K(·, ·) is an appropriate kernel function; then, the kernel operation in matrix form is defined as whose ij-th element can be computed by Then, we can derive the dual problems of (40) and (41) are the dual problems of (40) and (41), respectively.
The procedure of the nonlinear MNP-KSVC is similar to that of the linear one, but with the following minor modifications in Algorithm 1: • In contrast to some existing nonparallel SVMs, we do not need to consider the extra kernel-generated technique since only inner products appear in dual problems (14) and (15) we can obtain the dual formation of the nonlinear MNP-KSVC in (44) and (45). • Once we obtain the solution q 1 = [α 1 ; β 1 ] and q 2 = [α 2 ; β 2 ] to problems (44) and (45) respectively, the corresponding primal solutions u 1 and u 2 in feature space can be formulated by and • For an unseen instance x, construct the following decision functions for (i, j)-pair nonlinear MNP-KSVC classifier as , the auxiliary functions f φ1 (x) and f φ2 (x) in feature space can be expressed as and f φ2 (x)) = K(u 2 ,x) = −α 2 K(B,x) + β 2 K(C,x) + ν 2 e + K(A,x) (61)

Experimental Setting
To demonstrate the validity of MNP-KSVC, we perform extensive experiments on several benchmark datasets that are commonly used for testing machine learning algorithms. In experiments, we focus on comparing MNP-KSVC and four state-of-the-art multiclass models-Multi-SVM, MBSVM, MTPMSVM, and Twin-KSVC-detailed as follows: • Multi-SVM [38]: The idea is similar to the "one-versus-all" SVM [3]. However, it generates K binary SVM classifiers by solving the one large dual QPP. That is, the k-th classifier is trained with the k-th class instances encoded with positive labels and the remaining class instances with negative labels. Then, the label of an unseen instance is assigned by the "voting" scheme. The penalty parameter for each classifier in Multi-SVM is c. • MBSVM [33]: It is the multiclass extension of the binary TWSVM, which is based on the "one-versus-all" strategy. MBSVM aims to find K nonparallel hyperplanes by solving K QPPs simultaneously. Specifically, the k-th class instances are as far away as the k-th hyperplane while the remaining instances are proximal to the k-th hyperplane. An unseen instance is assigned to the label depending on to which of the K hyperplanes it lies farthest away. The penalty parameter for each classifier in MBSVM is c. • MTPMSVM: Inspired by MBSVM [33], we use the "one-versus-all" strategy to implement the multiclass version of TPMSVM [15] as a baseline. In contrast to MBSVM, it aims to find K parametric-margin hyperplanes, such that each hyperplane is closer to its corresponding class instances and as far away from the remaining class instances. The penalty parameters for each classifier in MTPMSVM are (ν, c). • Twin-KSVC [35]: It is another novel multiclass extension of TWSVM. Twin-KSVC evaluates all the training instances in a "one-versus-one-versus-rest" structure with the ternary output {−1, 0, +1}. It aims to find a pair of nonparallel hyperplanes for each of two kinds of samples selected from K classes. The remaining class instances are mapped into a region within these two nonparallel hyperplanes. The penalty parameters for each classifier in Twin-KSVC are (c 1 , c 2 , c 3 , c 4 , ε).
All methods are implemented by MATLAB on a PC with an i7 Intel Core processor with 32 GB RAM. The quadratic programming problems (QPPs) of all the classifiers are solved by the "quadprog" function in MATLAB. Now, we describe the setting of our experiments: • Similar to [35,38], we use multiclass accuracy to measure each classifier, defined as where N is the scalar of the dataset, K is the total classes,ĝ(x) is the prediction of the classifier, and I(·) is an indicator function, which returns 1 if the class matches and 0 otherwise. Moreover, we adopt training time to represent the learning efficiency. • To reduce the complexity of parameter selection for multiclass classifiers, we use the same parameter setting for each learning subproblem. Specifically, we set the same for all classifiers c in MBSVM and MBSVM, ν, c in MTPSVM, in Twin-KSVC, and ν 1 = ν 2 , c 1 = c 2 , c 3 = c 4 in MNP-KSVC. For the nonlinear case, the RBF kernel K(x i , x j ) = exp( It is usually unknown beforehand which parameters are optimal for classifiers at hand. Thus, we employ the 10-fold cross-validation technique [3] for parameter selection. In detail, each dataset is randomly partitioned into 10 subsets with similar sizes and distributions. Then, the union of 9 subsets is used as the training set, while the remaining 1 is used as the testing set. Furthermore, we apply the grid-based approach [3] to obtain the optimal parameters of each classifier. Namely, the penalty parameters c, c 1 , c 2 , ν, ν 1 and the kernel parameter γ are selected from {2 i |i = −6, ..., 6}, while the margin parameter ε is chosen from {i|i = 0.1, 0.2, ..., 0.9}. Once selected, we return them to learn the final decision function.

Result Comparison and Discussion
For comparison, we consider 10 real-world multiclass datasets from the UCI machine learning repository (the UCI datasets are available at http://archive.ics.uci.edu/ml (accessed on 10 September 2021)), whose statistics are summarized in Table 1. These datasets represent a wide range of domains (include phytology, bioinformatics, pathology, and so on), sizes (from 178 to 2175), features (from 4 to 34), and classes (from 3 to 10). All datasets are normalized before training such that features are transformed into [−1, 1]. Moreover, we carry out experiments as follows. Firstly, each dataset is divided into 2 subsets: 70% for training and 30% for testing. Then, we train classifiers with 10-fold cross-validation executions. Finally, we predict the testing set with the fine-tuning classifiers. Each experiment is repeated 10 times.  Balance  625  438  187  4  3  Ecoli  327  229  98  7  5  Iris  150  105  45  4  3  Glass  214  150  64  13  6  Wine  178  125  53  13  3  Thyroid  215  150  65  5  3  Dermatology  358  251  107  34  6  Shuttle  2175  1522  653  9  5  Contraceptive  1473  1031  442  9  3  Pen Based  1100  770  330  16  10   Table 2 and 3 contain a summary of learning results for the proposed MNP-KSVC model with other compared methods using linear and nonlinear kernels, respectively. The results on 10 benchmark datasets include the mean and standard of the testing multiclass accuracy (%), whose best performance is highlighted in bold. The comparison results reveal the following: MTPMSVM is another multiclass extension, which is based on the "one-versus-rest" strategy. With the help of the "hybrid classification and regression" learning paradigm, our MNP-KSVC can learn more multiclass discriminate information. • Furthermore, we count the number of Superior/Inferior (W/L) instances to the compared classifier on all datasets for both linear and nonlinear cases, listed at the bottom of Tables 2 and 3. The results indicate that MNP-KSVC achieves the best results against others in terms of both W/L and average accuracy.
where N is the number of datasets, k is the number of classifiers, and r i is the average rank on N datasets for the i-th model. For the linear case, we compute term ∑ k i=1 r 2 i in Table 2 36) degrees of freedom as Moreover, we compute the p-value, which rejects the null hypothesis at the level of significance α = 0.05. Similarly, we calculate the statistic for the nonlinear case, as summarized in Table 4. The results reject the null hypothesis for both linear and nonlinear cases and reveal the existence of significant differences in the performances of classifiers. Furthermore, we record the average learning time of each classifier for the above UCI datasets experiments, as shown in Figures 2 and 3. The results show that our MNP-KSVC is faster than Multi-SVM and Twin-KSVC, while slightly slower than MBSVM and MTPMSVM for linear and nonlinear cases. Multi-SVM performs the slowest of all classifiers because Multi-SVM needs to solve larger problems than the nonparallel-based classifiers. Moreover, the Hessian matrix of dual QPPs in MNP-KSVC avoids the time-costly matrix inversion, leading to greater effectiveness than Twin-KSVC. Overall, the above results confirm the feasibility of Twin-KSVC. 5

Discussion and Future Work
This paper proposes a novel K multiclass nonparallel parametric-margin support vector machine, termed MNP-KSVC. Specifically, our MNP-KSVC has the following attractive merits: • For the K-class learning task, our MNP-KSVC first transforms the complicated multiclass problem into K(K − 1)/2 subproblems via a "one-versus-one-versus-rest" strategy. Each subproblem focuses on separating the two selected classes and the rest of the classes. That is, we utilize {−1, 1} to represent the label of the two selected classes and 0 to label the rest. Unlike the "one-versus-all" strategy used in Multi-SVM, MBSVM, and MTPMSVM, this encoding strategy can alleviate the imbalanced issues that sometimes occur in multiclass learning [32,35]. • For each subproblem, our MNP-KSVC aims to learn a pair of nonparallel parametricmargin hyperplanes (36) with the ternary encoding {−1, 0, +1}. These parametricmargin hyperplanes are closer to their corresponding class and at least one distance from the other class. Meanwhile, they restrict the rest of the instances into an insensitive region. A hybrid classification and regression loss joined with the regularization is further utilized to formulate the optimization problems (10) and (11) of MNP-KSVC. • Moreover, the nonlinear extension is also presented to deal with the nonlinear multiclass learning tasks. In contrast to MBSVM [33] and Twin-KSVC [35], the linear and nonlinear models in MNP-KSVC are consistent. Applying the linear kernel in the nonlinear problems (44) and (45) results in the same formulations as the original linear problems (14) and (15) There are several interesting directions to research in the future, such as extensions to semi-supervised learning [26,42], multi-label learning [22], and privilege-information learning [43].

Conflicts of Interest:
The authors declare no conflict of interest.