Minimal Complexity Support Vector Machines for Pattern Classification

Minimal complexity machines (MCMs) minimize the VC (Vapnik-Chervonenkis) dimension to obtain high generalization abilities. However, because the regularization term is not included in the objective function, the solution is not unique. In this paper, to solve this problem, we discuss fusing the MCM and the standard support vector machine (L1 SVM). This is realized by minimizing the maximum margin in the L1 SVM. We call the machine Minimum complexity L1 SVM (ML1 SVM). The associated dual problem has twice the number of dual variables and the ML1 SVM is trained by alternatingly optimizing the dual variables associated with the regularization term and with the VC dimension. We compare the ML1 SVM with other types of SVMs including the L1 SVM using several benchmark datasets and show that the ML1 SVM performs better than or comparable to the L1 SVM.


Introduction
In the support vector machine (SVM) [1,2], training data are mapped into the high dimensional feature space, and in that space, the separating hyperplane is determined so that the nearest training data of both classes are maximally separated. Here, the distance between a data sample and the separating hyperplane is called margin. Thus, the SVM is trained so that the minimum margin is maximized.
Motivated by the success of SVMs in real world applications, many SVM-like classifiers have been developed to improve the generalization ability. The ideas of extensions lie in incorporating the data distribution (or margin distribution) to the classifiers because the SVM only considers the data that are around the separating hyperplane. If the distribution of one class is different from that of the other class, the separating hyperplane with the same minimum margin for both classes may not be optimal.
Yet another approach controls the overall margins instead of the minimum margin [16][17][18][19][20][21][22]. In [16], only the average margin is maximized. While the architecture is very simple, the generalization ability is inferior to that of the SVM [17,19]. To improve the generalization ability, the equality constraints were added in [19], but this results in the least squares SVM (LS SVM).
In [17], a large margin distribution machine (LDM) was proposed, in which the average margin is maximized and the margin variance is minimized. The LDM is formulated based on the SVM with inequality constraints. While the generalization ability is better than that of the SVM, the problem is that the LDM has one more hyperparameter than the SVM does. To cope with this problem, in [20,21], the unconstrained LDM (ULDM) was proposed, which has the equal number of hyperparameters and which has the generalization ability comparable to that of the LDM and the SVM.
The generalization ability of the SVM can be analyzed by the VC (Vapnik-Chervonenkis) dimension [1] and the generalization ability will be improved if the radius-margin ratio is minimized, where the radius is the radius of the minimum hypersphere that encloses all the training data in the feature space. The minimum hypersphere changes if the mapping function is changed during model selection, where all the parameter values including the value for the mapping function are determined, or feature selection is performed during training a classifier. So, minimization of the radius-margin ratio can be utilized in selecting the optimal mapping function [23,24]. In [25,26], feature selection is carried out during training by minimizing the radius-margin ratio.
If the center of the hypersphere is assumed to be at the origin, the hypersphere can be minimized for a given feature space as discussed in [27,28]. The minimal complexity machine (MCM) was derived based on this assumption. In the MCM, the VC dimension is minimized by minimizing the upper bound of the soft-margin constraints for the decision function. Because the regularization term is not included, the MCM is trained by linear programming. The quadratic version of the MCM (QMCM) tries to minimizes the VC dimension directly [28]. The generalization performance of the MCM is shown to be superior to that of the SVM, but according to our analysis [29], the solution is non-unique and the generalization ability is not better than that of the SVM. The problem of non-uniqueness of the solution is solved by adding the regularization term in the objective function of the MCM, which is a fusion of the MCM and the linear programming SVM (LP SVM) called MLP SVM.
In [30], to improve the generalization ability of the standard SVM (L1 SVM), we proposed fusing the MCM and the L1 SVM, which is the minimal complexity L1 SVM (ML1 SVM). We also proposed ML1 v SVM, whose component is more similar to the L1 SVM. By the computer experiment using RBF (radius basis function) kernels, the proposed classifiers generalize better than or comparable to the L1 SVM.
In this paper we discuss the ML1 SVM and ML1 v SVM more in detail, propose their training methods, prove their convergence, and demonstrate the effectiveness of the proposed classifiers using polynomial kernels in addition to RBF kernels.
The ML1 SVM is obtained by adding the upper bound on the decision function and the upper bound minimization term in the objective function of the L1 SVM. We show that this corresponds to minimizing the maximum margin. We derive the dual form of the ML1 SVM with one set of variables associated with the soft-margin constraints and the other set, upper-bound constraints. We then decompose the dual ML1 SVM into two subproblems: one for the soft-margin constraints, which is similar to the dual L1 SVM, and the other for the upper-bound constraints. These subproblems include neither the bias term nor the upper bound. Thus, for a convergence check, we derive the exact KKT (Karush-Kuhn-Tucker) conditions that do not include the bias term and the upper bound. The second subproblem is different from the first subproblem in that it includes the inequality constraint on the sum of dual variables. To remove this, we change the inequality constraint into two equality constraints and obtain ML1 v SVM.
We consider training the ML1 SVM and ML1 v SVM optimizing the first and the second subprograms, alternatingly. Because the ML1 SVM and ML1 v SVM are very similar to the L1 SVM, we discuss the training method based on Sequential Minimum Optimization (SMO) fused with Newton's method [31]. We also show the convergence proof of the proposed training methods. By computer experiments using polynomial kernels, in addition to RBF kernels, we compare the ML1 SVM and ML1 v SVM with other classifiers including the L1 SVM.
In Section 2, we summarize the architectures of L1 SVM and the MCM. In Section 3, we discuss the architecture of the ML1 SVM and derive the dual form and the optimality conditions of the solution. Then, we discuss the architecture of the ML1 v SVM. In Section 4, we discuss the training method that is an extension of SMO fused with Newton's method [31] and the working set selection method. We also show the convergence proof of the proposed method. In Section 5, we evaluate the generalization ability of the ML1 SVM and other SVM-like classifiers using two-class and multiclass problems.

L1 Support Vector Machines and Minimal Complexity Machines
In this section, we briefly explain the architectures of the L1 SVM and the MCM [27]. Then we discuss the problem of non-unique solutions of the MCM and one approach to solving the problem [29].

L1 Support Vector Machines
Let M training data and their labels be {x i , y i } (i = 1, . . . , M), where x i is an n-dimensional input vector and y i = 1 for Class 1 and −1 for Class 2. The input space is mapped into the l-dimensional feature space by the mapping function φ(x) and in the feature space the separating hyperplane is constructed. The decision boundary given by the decision function f (x) is given by where w is the l-dimensional coefficient vector and b is the bias term. The primal form of the L1 SVM is given by where ξ = (ξ 1 , . . . , ξ M ) , ξ i is the slack variable for x i , and C is the margin parameter that determines the trade-off between the maximization of the margin and minimization of the classification error. Inequalities (3) are called soft-margin constraints. The dual problem of (2) and (3) is given by subject to where is the kernel function, α = (α 1 , . . . , α M ) , α i is the Lagrange multiplier for the ith inequality constraint, and x i associated with positive α i (> 0) is called a support vector. The decision function given by (1) is rewritten as follows: where S is the set of support vector indices.

Minimal Complexity Machines
The VC (Vapnik-Chervonenkis) dimension is a measure for estimating the generalization ability of a classifier and lowering the VC dimension leads to realizing a higher generalization ability. For an SVM-like classifier with the minimum margin δ min , the VC dimension D is bounded by [1] where R is the radius of the smallest hypersphere that encloses all the training data.
In training the L1 SVM, both R and l are not changed. In the LS SVM, where ξ i are replaced with ξ 2 i in (2) and the inequality constraints, with equality constraints in (3), although both R and l are not changed by training, the second term in the objective function works to minimize the square sums of y i f (x i ) − 1. Therefore, like the LDM and ULDM, this term works to condense the margin distribution in the direction orthogonal to the separating hyperplane.
In [27], first the linear MCM, in which the classification problem is linearly separable, is derived by minimizing the VC-dimension, i.e., R/δ min in (7). In the input space, R and δ min are calculated, respectively, by where 1 is added to x i to make the separating hyperplane passes through the origin in the augmented space. Equation (8) shows that the minimum hypersphere is also assumed to pass through the origin in the augmented space. After some derivations, the linear MCM is derived. The nonlinear MCM is derived as follows: where h is the upper bound of the soft-margin constraints and K ij = K(x i , x j ). Here, the mapping function φ(x) in (1) is [32] φ(x) = (K 11 , . . . , K 1M ) , and w = α. The MCM can be solved by linear programming.
Because the upper bound h in (11) is minimized in (10), the separating hyperplane is determined so that the maximum distance between the training data and the separating hyperplane is minimized. This is a similar idea to that of the LS SVM, LDM, and ULDM.
The MCM is derived based on the minimization of the VC dimension, and thus considers maximizing the minimum margin. However, the MCM does not explicitly include the term related to the margin maximization. This makes the solution non-unique and unbounded.
To make the solution unique under the condition that the extended MCM still is a linear programming problem, in [29] we proposed the minimal complexity LP SVM (MLP SVM), which fuses the MCM an LP SVM: where C h is the positive parameter and C α = 1. Deleting C h h in (13) and upper bound h in (14), we obtain the LP SVM. Setting C h = 1 and C α = 0 in (13), we obtain the MCM. Compared to the MCM and the LP SVM, the number of hyperparameters increases by 1.
According to the computer experiment, in general, the MLP SVM is better than the MCM and LP SVM in generalization abilities. However, it is inferior to the L1 SVM [29].

Minimal Complexity L1 Support Vector Machines
In this section, we discuss the ML1 SVM, which consists of two optimization subprograms and derive the KKT conditions that the optimal solution of the ML1 SVM must satisfy. Then, we discuss a variant of the ML1 SVM, whose two subprograms are more similar.

Architecture
The LDM and ULDM maximize the average margin and minimize the average margin variance and LS SVM makes data condense around the minimum margin. The idea of the MCM to minimize the VC dimension results in minimizing the maximum distance of the data from the separating hyperplane. Therefore, these classifiers have the idea in common: condense data as near as possible to the separating hyperplane.
In [29], we proposed fusing the MCM and the LP SVM. Similar to this idea, here we fuse the MCM given by (10) and (11) and the L1 SVM given by (2) and (3): where ξ = (ξ 1 , . . . , ξ M ) , C h is the parameter to control the volume that the training data occupy, and h is the upper bound of the constraints. The upper bound defined by (11) is redefined by (17) and (18), which exclude ξ i . This makes the KKT conditions for the upper bound simpler. We call (17) the upper bound constraints and the above classifier minimum complexity L1 SVM (ML1 SVM). The right-hand side of (17) shows the margin with the minimum margin normalized to 1 if the solution is obtained. Therefore, because h is the upper bound of the margins and h is minimized in (15), the ML1 SVM maximizes the minimum margin and minimizes the maximum margin simultaneously. If we use (12), we can directly solve (15) to (18). However, sparsity of the solution will be lost. Therefore, in a way similar to solving the L1 SVM, we solve the dual problem of (15) to (18).
For the solution of (32) to (35), x i associated with positive α i or α M+j is a support vector. However, from (28), the decision function is determined by the support vectors that satisfy α i − α M+i = 0.
We consider decomposing the above optimization problem into two subproblems: (1) optimizing Then the optimization problem given by (32) to (35) is decomposed into the following two subproblems: subject to where α 0 = (α 1 , . . . , α M ) . Subproblem 2: Optimization of α M+i subject to where α M = (α M+1 , . . . , α 2M ) . Here we must notice that is satisfied. This contradicts (42). We solve Subproblems 1 and 2 alternatingly until the solution converges. Subproblem 1 is very similar to the L1 SVM and can be solved by the SMO (Sequential minimal optimization). Subproblem 2, which includes the constraint (42) can also be solved by a slight modification of the SMO.

KKT Conditions
To check the convergence of Subproblems 1 and 2, we use the KKT complementarity conditions (24) to (27). However, variables h and b, which are included in the KKT conditions, are excluded from the dual problem given by (32) to (35). Therefore, as with the L1 SVM [33], to make an accurate convergence test, the exact KKT conditions that do not include h and b need to be derived.
We rewrite (24) as follows: where We can classify the conditions of (45) into the following three cases: Then the KKT conditions for (45) are simplified as follows: To detect the violating variables, we define b low and b up as follows: The bias term is estimated to be where b e is the estimate of the bias term using (24). Likewise, using (46), (25) becomes Then the conditions for (25) are rewritten as follows: we have The KKT conditions for (25) are simplified as follows: To detect the violating variables, we define b − low , b + low , b − up , and b + up as follows: In general, the distributions of Classes 1 and 2 data are different. Therefore, the upper bounds of h for Classes 1 and 2 are different. This may mean that either of exist. However, because of (41), both classes have at least one positive α M+i each, and because of (53), the values of h for both classes can be different. This happens because we separate (33) into two equations as in (36). Then, if the KKT conditions are satisfied, both of the following inequalities hold From the first inequality, the estimate of h, h − e for Class 2, is given by From the second inequality, the estimate of h, h + e for Class 1, is given by

Variant of Minimal Complexity Support Vector Machines
Subproblem 2 of the ML1 SVM is different from Subproblem 1 in that the former includes the inequality constraint given by (42). This makes the solution process more complicated. In this section, we consider making the solution process similar to that of Subproblem 1.
Solving Subproblem 2 results in obtaining h + e and h − e . We consider assigning separate variables h + and h − for Classes 1 and 2 instead of a single variable h. Then the complementarity conditions for h + and h − are where η + and η − are the Lagrange multipliers associated with h + and h − , respectively. To simplify Subproblem 2, we assume that η + = η − = 0. This and (41) make (41) and (42) two equality constraints.
Then the optimization problem given by (40) to (43) becomes Here, (41) is not necessary because of (67). We call the above architecture ML1 v SVM. For the solution of the ML1 SVM, the same solution is obtained by the ML1 v SVM with the C h value given by However, the reverse is not true, namely, the solution of the ML1 v SVM may not be obtained by the ML1 SVM. As the C h value becomes large, the value of η becomes positive for the ML1 SVM, but for the ML1 v SVM, the values of α M+i are forced to become larger.

Training Methods
In this section we extend the training method for the L1 SVM that fuses SMO and Newton's method [31] for training the ML1 SVM. The major part of the training method consists of calculation of corrections by Newton's method and the working set selection method. The training method of the ML1 v SVM is similar to that of the L1 SVM. Therefore, we only explain the difference of the methods in Section 4.3.

Calculating Corrections by Newton's Method
In this subsection, we discuss the corrections by Newton's method for two subprograms.

Subprogram 1
First we discuss optimization of Subproblem 1. We optimize the variables α i in the working set {α i |i ∈ W, i ∈ {1, . . . , M}}, where W includes working set indices, fixing the remaining variables, by Here We can solve the above optimization problem by the method discussed in [31]. We select α s in the working set and solve (71) for α s : Substituting (73) into (70), we eliminate the equality constraint. Let α W = (. . . , α i , . . .) (i = s, i ∈ W). Now because Q(α W ) is quadratic, we can express the change of Q(α W ), ∆Q(α W ), as a function of the change of α W , ∆α W , by Then, neglecting the bounds, ∆Q(α W ) has the maximum at where Here, the partial derivative of Q with substitution of (73) is calculated by the chain rule without substitution.
We assume that −∂ 2 Q(α)/∂α 2 W is positive definite. If not, we avoid matrix singularity adding a small value to the diagonal elements.
Then from (73) and (75), we obtain the correction of α s : For we delete these variables from the working set and repeat the procedure for the reduced working set. Let ∆α i be the maximum or minimum correction of α i that is within the bounds. Then if α i + ∆α i < 0, where r (0 < r ≤ 1) is the scaling factor. The corrections of the variables in the working set are given by We select α M+s in the working set W and solve (83) for α M+s : Then similar to Subproblem 1, we substitute (86) into (82) and eliminate α M+s . The partial derivatives in (75) are as follows: From (86), we obtain the correction of α M+s : Now we consider the constraint (84). If η = 0, the sum of corrections needs to be zero. This is achieved if Namely, we select the working set from the same class.
we delete these variables from the working set and repeat the procedure for the reduced working set. Let ∆α M+i be the maximum or minimum correction of α M+i that is within the bounds. Then if where r (0 < r ≤ 1) is the scaling factor. If r ∑ i∈W ∆α M+i > η > 0, we further calculate Then the corrections of the variables in the working set are given by where r is replaced by r r if r ∑ i∈W α M+i > η > 0.

Working Set Selection
At each iteration of training, we optimize either Subproblem 1 (α i ) or Subproblem 2 (α M+i ). To do so, we define the most violating indices as follows: We consider that training is converged when where τ is a small positive parameter.
If (96) is not satisfied, we correct variables associated with V i where i is determined by According to the conditions of (96) and (97), we determine the variable pair as follows: 1.
If V 1 is the maximum in (97), we optimize Subproblem 1 (α i ). Let the variable pair associated with b up and b low be α i min and α i max , respectively. 2.
If η = 0 and either V 2 or V 3 is the maximum in (97), or if η = 0 and either V 2 or V 3 exceeds τ but not both, we optimize Subproblem 2 (α M+i belonging to either Class 1 or 2). Let the variable pair associated with b k up and b k low (k is either + or −) be α M+i k min and α M+i k max , respectively.

3.
If η = 0 and both V 2 and V 3 exceed τ, we optimize Subproblem 2 (α M+i selected from Classes 1 and 2). Let the variable pair be α M+i − min and α M+i + max . This is to make the selected variables correctable as will be shown in Section 4.4.2.
For the ML1 v SVM, η = 0. Therefore, Case (3) is not necessary. Because the working set selection strategies for α i and α M+i are the same, in the following we only discuss the strategy for α i .
In the first order SMO, at each iteration step, α i min and α i max that violate the KKT conditions the most are selected for optimization. This guarantees the convergence of the first order SMO [33]. In the second order SMO, the pair of variables that maximize the objective function are selected. However, to reduce computational burden, fixing α i min , the variable that maximizes the objective function is searched [34]: where V KKT is the set of indices that violate the KKT conditions: We call the pair of variables that are determined either by the first or the second order SMO, SMO variables.
Because the second order SMO accelerates training for a large C value [35], in the following we use the second order SMO. However, for a substantially large C value, training speed of the second order SMO slows down because variables need to be updated many times to reach to the optimal values. To speed up convergence in such a situation, in addition to the SMO variables, we add variables that are selected in the previous steps as SMO variables, into the working set.
We consider that a loop is detected when at least one of the current SMO variables has already appeared as an SMO variable at a previous step. When a loop is detected, we pick up the loop variables that are the SMO variables in the loop. To avoid obtaining an infeasible solution by adding loop variables to the working set, we restrict loop variables to be unbounded support vectors, i.e., α i = C (This happens only for Subprogram 1).
Because we optimize Subprograms 1 and 2 alternatingly, loop variables may include those for Subprograms 1 and 2. Therefore, in optimizing Subprogram 1, we need to exclude the loop variables for Subproblem 2; and vice versa. In addition, in optimizing Subproblem 2 with η = 0, we need to exclude variables belonging to the unoptimized class, in addition to those for Subproblem 1.
If no loop is detected, the working set includes only the SMO variables. If a loop is detected, the working set consists of the SMO variables and the loop variables. For the detailed procedure, please refer to [31].

Training Procedure of ML1 SVM
In the following we show the training procedure of the ML1 SVM. 1.
(Initialization) Select α i and α j in the opposite classes and set (Corrections) If Pr1 (Program 1), calculate partial derivatives (76) and (77)  In Subproblem 2 of the ML1 v SVM, data for Class 1 and Class 2 can be optimized separately. Therefore, because the data that are optimized belong to the same class, the sum of corrections is zero. Thus, in Step 1, a = 1. In Step 3, if Pr2 is optimized, the variables associated with V 2 or V 3 are optimized, not both.

Convergence Proof
Convergence of the first order SMO for the L1 SVM is proved in [33]. Similarly, we can prove that the training procedure discussed in Section 4.3 converges to the unique solution. In the following we prove the convergence for the ML1 SVM. The proof for the ML1 v SVM is evident from the discussion.
Subprograms 1 and 2 are quadratic programming problems and thus have unique maximum solutions. Therefore, it is sufficient to prove that the objective function increases by the first order SMO. For the second order SMO, the increase of the objective function is also guaranteed because the variable pair that gives the largest increase of the objective function is selected. By combining SMO with Newton's method, if some variables are not correctable, they are deleted from the working set. By this method, in the worst case, only the SMO variables remain in the working set. Therefore, we only need to show that the first order SMO converges for the ML1 SVM.

Convergence Proof for Subprogram 1
From (48), α i max satisfies Moreover, from (49), α i min satisfies Because the KKT conditions are not satisfied, F i max > F i min . Moreover, we set α s = α i min . Then from (76) and (77), where in (103), if the equality holds, we replace zero with a small negative value to avoid zero division in (75). Then from (102), (103), and (75), the signs of the corrections ∆α i min and ∆α i max are given by 1. ∆α i max > 0 and ∆α i min < 0 for y i min = y i max = 1, 2.
From (100) and (101), the above corrections are all possible. For instance, in (1) ∆α i max > 0 and Because the corrections are not zero, the objective function increases.
If both V 2 and V 3 are larger than τ, we select α M+i − min and α M+i + max . Then from (104) and (105), these variables are correctable whether they be increased or decreased.
Because the corrections are not zero, the objective function increases.

Computer Experiments
First we analyze the behavior of the ML1 SVM and ML1 v SVM using a two-dimensional iris dataset and then to examine the superiority of the proposed classifiers over the L1 SVM, we evaluate their generalization abilities and computation time using two-class and multiclass problems. All the programs used in the performance evaluation were coded in Fortran and tested using Windows machines.

Analysis of Behaviors
We analyzed the behaviors of the ML1 SVM and ML1 v SVM using the iris dataset [36], which is frequently used in the literature. The iris dataset consists of 150 data with three classes and four features. We used Classes 2 and 3 and Features 3 and 4. For both classes, the first 25 data were used for training and the remaining data were used for testing. We used linear kernels: K(x, x ) = x x .
We evaluated the h + and h − values for the change of the C h value from 0.1 to 2000 fixing C to 1000. Figure 1 shows the result for the ML1 v SVM. Both h + and h − values are constant for C h = 0.1 to 100 and they decrease as the C h value is increased. For the ML1 SVM, the h + and h − values are constant for the change of C h and are the same as those of the ML1 SVM with C h = 0.1 to 100. For C h = 2000, η = 1960.00. Thus, ∑ y i =1,i=1,...,M α M+i = 20.00. For C h = 10,000, η = 9732.75. Thus, ∑ y i =1,i=1,...,M α M+i = 133.63. This means that C h value is too small to obtain the solution comparable to that of the ML1 v SVM. Therefore, the ML1 SVM is insensitive to the C h value compared to the ML1 v SVM.    Figure 3 shows the decision boundary of the ML1 v SVM for C = 1000 and C h = 0.1, 1000. For C h = 0.1, the decision boundary is almost parallel to the x 1 axis. However, for C h = 1000, the decision boundary rotates in the clockwise direction to make the data more condensed to the decision boundary. The accuracy for the test data is 92% for C h = 0.1 and is increased to 94% for C h = 1000.
From the above experiment we confirmed that the solutions of the ML1 SVM and the ML1 v SVM are the same for small C h values but for large C h values both are different and in extreme cases the solution of the ML1 v SVM may be infeasible.

Performance Comparison
In this section, we compare the generalization performance of the ML1 SVM and ML1 v SVM with the L1 SVM, MLP SVM [29], LS SVM, and ULDM [20] using two-class and multiclass problems. Our main goal is to show that the generalization abilities of the ML1 SVM and ML1 v SVM are better than the generalization ability of the L1 SVM.

Comparison Conditions
We determined the hyperparameter values using the training data by fivefold cross-validation, trained the classifier with the determined hyperparameter values, and evaluated the accuracy for the test data.
We trained the ML1 SVM, ML1 v SVM, and L1 SVM by SMO combined with Newton's method. We trained the MLP SVM by the simplex method and the LS SVM and ULDM by matrix inversion.
We used RBF kernels: where γ is the parameter to control the spread of the radius and m is the number of inputs to normalize the kernel, and polynomial kernels including linear kernels: where P = 0 and d = 1 for linear kernels and P = 1 and d = 2, 3, . . . for polynomial kernels. After model selection, we trained the classifier with the determined hyperparameter values and calculated the accuracy for the test data. For two-class problems, which have multiple sets of training and test pairs, we calculated the average accuracies and their standard deviations, and performed Welch's t test with the confidence level of 5%. Table 1 lists the two-class problems used [37], which were generated using the datasets from the UC Irvine Machine Learning Repository [38]. In the table, the numbers of input variables, training data, test data, and training and test data pair sets are listed. The table also includes the maximum average prior probability shown in the accuracy (%). The numeral in parentheses shows that for the test data. According to the prior probabilities, the class data are relatively well balanced and there are not much differences between training and test data.  Table 2 shows the evaluation results using RBF kernels. In the table, for each classifier and each classification problem, the average accuracy and the standard deviation are shown. For each problem the best average accuracy is shown in bold and the worst, underlined. The "+" and "−" symbols at the accuracy show that the ML1 SVM is statistically better and worse than the classifier associated with the attached symbol, respectively. For instance, the ML1 SVM is statistically better than the ULDM for the flare-solar problem. The "Average" row shows the average accuracy of the 13 problems for each classifier and "B/S/W" denotes the number of times that the associated classifier show the best, the second best, and the worst accuracy. The "W/T/L" row denotes the number of times that the ML1 SVM is statistically better than, comparable to, and worse than the associated classifier.

Two-Class Problems
According to the "W/T/L" row, the ML1 SVM is statistically better than the MLP SVM but is slightly inferior to ULDM. It is comparable to the remaining classifiers. From the "Average" measure, the ULDM is the best, the ML1 v SVM, the second, the L1 SVM and the ML1 SVM, the third. From the "B/S/W" measure, the ULDM is the best and the LS SVM is the second best. Table 3 shows the results for polynomial kernels. For each problem the best average accuracy is shown in bold and the worst, underlined. We do not include the MLP SVM for comparison. From the table, the ML1 SVM is statistically comparable to the L1 SVM, slightly better than ML1 v SVM, and better than the LS SVM and ULDM. From the average accuracies, the L1 SVM is the best, ML1 SVM, the second best, and the ULDM, the worst.
The ML1 v SVM is statistically inferior to ML1 SVM for the cancer and splice problems. We study why this happened. For the second file of the breast-cancer problem, the accuracy for the test data is 67.53% by the ML1 v SVM, which is by 7.79% lower than by the ML1 SVM. The parameter values are selected as d = 2, C = 1, and C h = 10. For the C h value higher than or equal to 50, the accuracy is 75.32%, which is the same as that by ML1 SVM. Therefore, model selection did not work properly. For the 17th file of the splice problem, the accuracy for the ML1 v SVM is 83.68% with d = 1, and C = C h = 1. This is caused by C h = 1 when the d and C values are determined by grid search. If C h = 0.1, by model selection d = 2, C = C h = 0.1 are obtained, and the accuracy is 87.36%, which is better than 87.22% by the ML1 SVM. Therefore, in this case also, the low accuracy was obtained by the problem of model selection. For each problem the best average accuracy is shown in bold and the worst underlined. The "+" and "−" symbolsatthe accuracy show that the ML1 SVM is statistically better and worse than the classifier associated with the attached symbol, respectively.
The ULDM performs worse than the ML1 SVM for banana and thyroid problems. For the banana problem by the ULDM, the average accuracy is by 6.66% lower than by the ML1 SVM. This was caused by mal-selection of the parameter ranges. By setting C = {10 8 , 10 10 , 10 12 , 10 14 , 10 16 } and d = {6, 7,8,9,10,11,12,13,14, 15}, the average accuracy and standard deviation are 88.35 ± 1.10%, which are statistically comparable to those by the ML1 SVM. However, for the thyroid problem, the change of the parameter ranges does not improve the average accuracy much. Table 3. Accuracies of the test data for the two-class problems using polynomial kernels. For each problem the best average accuracy is shown in bold and the worst underlined. The "+" and "−" symbolsatthe accuracy show that the ML1 SVM is statistically better and worse than the classifier associated with the attached symbol, respectively.

Problem
According to the experiment for the two-class problems, generally the accuracies using the RBF kernels are better than those using polynomial kernels, but for both RBF and polynomial kernels, the ML1 SVM, ML1 v SVM, and L1 SVM perform well, while the LS SVM and ULDM do not for the polynomial kernels.

Multiclass Problems
We used pairwise (one-vs-one) classification for multiclass problems. To resolve unclassifiable regions occurred by pairwise classification, we used fuzzy classification, introducing the membership function for each decision function [2]. Table 4 shows the ten multiclass problems used for performance evaluation. Unlike the two-class problems, for each problem there is one training dataset and one test dataset. The numeral problem [2] is to classify numerals in the Japanese license plates, and the thyroid problem [38] is a medical diagnosis problem. The blood cell problem [2] classifies white blood cells labeled according to the maturity of the growing stage. Three hiragana problems [2] are to classify hiragana characters in the Japanese license plates. The satimage problem [38] classifies lands according to the satellite images. The USPS [39] and MNIST [40,41] problems treat numeral classification and the letter problem [38] treats alphabets. Because the original training dataset of the MNIST problem is too large especially for the LS SVM and ULDM, we switched roles of the training data and the test data.
Except for the thyroid problem, the class data are relatively well balanced. For the thyroid data, almost all data belong to one class. Moreover, the classification accuracy of a classifier smaller than 92.41% is meaningless.  Table 5 lists the accuracies using the RBF kernels for the test data. For each problem, the best accuracy is shown in bold, and the worst, underlined. For the MLP SVM, the accuracies for the thyroid, MNIST, and letter problems were not available. The "Average" row shows the average accuracy of the associated classifier for the ten problems and "B/S/W" shows the numbers of times that the best, the second best, and the worst accuracies are obtained. Among the ten problems, the accuracies of the ML1 SVM and ML1 v SVM are better than or equal to those of the L1 SVM for nine and eight problems, respectively. In addition, the best average accuracy is obtained for the ML1 v SVM, the second best, the ML1 SVM, and the third best, the L1 SVM. This is very different from the two-class problems where the ML1 SVM and ML1 v SVM are comparable to the L1 SVM. Table 6 shows the accuracies of the test data using polynomial kernels. For each problem the best accuracy is shown in bold and the worst, underlined. From the average accuracy, the ML1 v SVM performs best, LS SVM performs the second best, and the L1 SVM and ULDM the worst. The difference between the LS SVM and ML1 SVM is very small. Improvement of the ML1 SVM and ML1 v SVM over the L1 SVM is larger than that using the RBF kernels. For all the 10 problems, they are better than or equal to the L1 SVM. However, as seen from the Average rows in Tables 5 and 6, the accuracies using polynomial kernels are in general worse than those using RBF kernels. The reason for this is not clear but the ranges of parameter values in model selection might not be well tuned for polynomial kernels.
In the previous experiment we evaluated RBF kernels and polynomial kernels separately, but we can choose the best kernel from RBF and polynomial kernels. If we have cross-validation results for both kernels, we can select the better one that has the higher accuracy. Table 7 shows the accuracies by cross-validation for RBF and polynomial kernels. For each problem and for each classifier, the better accuracy is shown in bold. The last row of the table shows the average accuracies. The average accuracies for the polynomial kernels are worse than those for the RBF kernels for all the classifiers. For each classifier, the number of problems that the polynomial kernels perform better or equal is one to two. Moreover, if we select RBF kernels when both average accuracies are the same, selecting kernels with the better average accuracy results in improving the accuracy for the test dataset as seen from Tables 5 and 6. However, employing polynomial kernels in addition to RBF kernels does not improve the accuracy significantly. For each problem, the best accuracy is shown in bold, and the worst underlined.  For each problem and for each classifier, the better accuracy is shown in bold.

Training Time Comparison
First, we examine the complexity of computation for each classifier excluding the MLP SVM. We trained the ML1 SVM, ML1 v SVM, and L1 SVM by SMO combined with Newton's method. Therefore, the complexity of computation for the subproblem with working set size W is O(W 3 ). Because the ML1 SVM and ML1 v SVM solve two quadratic programming programs, each having the same number of variables, M, the complexity of computation is the same with that of the L1 SVM. Therefore, the three classifiers are considered to be trained in comparable time.
Because the matrices associated with the LS SVM and ULDM are positive definite, they can be solved by iterative methods such as stochastic gradient methods [17]. However, we trained the LS SVM and ULDM by Cholesky factorization to avoid the inaccuracy caused by insufficient convergence. Therefore, the complexity of computation of both methods is O(M 3 ).
The purpose of this section is to confirm that the ML1 SVM, ML1 v SVM, and L1 SVM can be trained in comparable time.
Excluding that of the MLP SVM, we compared the time for training and testing a classifier using a Windows machine with 3.2 GHz CPU and 16 GB memory. For the two-class problems, we set the parameter values with the frequently selected values and trained the classifier using a training dataset and tested the trained classifier using the associated test dataset. For the multiclass problems, we trained a classifier with the parameter values obtained by cross-validation and tested the trained classifier with the test dataset.
Because the tendency is similar we only show the results using RBF kernels. Table 8 shows the parameter values selected for the two-class and multiclass problems. In the table Thyroid (m) denotes the multiclass thyroid problem. For each problem, the γ values in bold, the γ and C values in bold, and γ, C, and C h values in bold show that they appear more than once. For each problem, the γ values in bold, the γ and C values in bold, and γ, C, and C h values in bold show that they appear more than once.
From the table, it is clear that the ML1 SVM, ML1 v SVM, and L1 SVM selected the same γ and C values frequently and the ML1 SVM and ML1 v SVM selected the same γ, C, and C h values 10 times out of 23 problems. Table 9 sows the CPU time for training and test with the optimized parameter values listed in Table 8. For each problem the shortest CPU time is shown in bold and the longest, underlined. The CPU time for the two-class problem is that per file. For the multiclass problems, we used fuzzy pairwise classification. Therefore, the training time includes that for determining n(n − 1)/2 decision functions where n is the number of classes. For example, for the letter problem, 329 decision functions need to be determined. The last row of the table shows the numbers of times that the associated classifier are the fastest (B), the second fastest (S), and the slowest (W).
From the table, the ML1 SVM and ML1 v SVM show comparable computation time. Moreover, except for the hiragana-50, hiragana-13, hiragana-105, USPS, MNIST, and letter problems, computation time for the ML1 SVM and ML1 v SVM is comparable to that of the L1 SVM. For these problems, the ML1 SVM and ML1 v SVM are much slower than L1 SVM. Analyzing the convergence process, we found that for these problems, monotonicity of the objective function value was sometimes violated. To improve convergence, we need to clarify why it happens and to find a way to speed up training in such a situation. However we leave this problem in the future study.

Discussions
As discussed in Section 3.3, the ML1 SVM and ML1 v SVM are equivalent for a small C h value but for a large C h value they are different. The computer experiments in Section 5.1 also revealed that the ML1 SVM is insensitive to the change of a C h value. However, the difference of the generalization performance between the ML1 SVM and ML1 v SVM is not large for the two-class and multiclass problems.
The execution time for the ML1 and ML1 v SVM was sometimes longer than that for the L1 SVM. This will cause problem in model selection. While we leave the discussions of speeding up training, for the model selection, this problem may be alleviated for the ML1 v SVM. If h + < 1 or h − < 1 is satisfied, the solution is infeasible. Therefore, we can skip cross-validation at the current C h value and the larger ones.
We used line search to speed up cross-validation. If grid search was used, in fivefold cross-validation we needed to train the ML1 SVM or ML1 v SVM 3520 (11 × 8 × 8 × 5) times, instead of 480 ((11 × 8 + 8) × 5) times. The speed up ratio is estimated to be 7.3. We evaluated the difference between the grid search and line search for the heart problem. We measured the execution time of cross-validation, training the classifier with the determined parameter values, and testing the classifier using the test data. For the ML1 SVM, the speed up ratio by line search was 35.7, and the average accuracy with the standard deviation of the grid search was 82.78 ± 3.45%, which was slightly lower. By the ML1 v SVM, the speed up ratio was 40.0, and the average accuracy with the standard deviation was 82.85 ± 3.31%, which was also lower than that by line search. Therefore, because model selection slowed down very much and the improvement of the average accuracy was not obtained at least for the heart problem, grid search will not be a good selection.
To speed up model selection of the ML1 SVM or ML1 v SVM, we may use the L1 SVM considering that the same values were selected frequently for the γ(d) and C values (see Table 8 for γ values). For the multiclass problems, four problems do not have the same values. To check whether the idea works, we performed model selection of C h values fixing the values of γ and C determined by model selection of the L1 SVM for the hiragana-13, hiragana-105, USPS, and letter problems. Among the four problems, the ML1 SVM and ML1 v SVM performed worse than the L1 SVM for the letter problem. Moreover, the resulting average accuracies of the ML1 SVM and ML1 v SVM for ten problems were 97.14% and 97.16%, respectively, which were lower than by the original model selection by 0.03% (see Table 5) but were still better than the accuracy of the L1 SVM. If we switch back the roles of the training and test data for the MNIST problem, the selected parameter values for the L1 SVM were the same. The accuracies for the test data were 98.55%, 98.77%, and 98.78% for the L1 SVM, ML1 SVM, and ML1 v SVM, respectively.
For the polynomial kernels, different kernel parameters were selected for six problems: the numeral (only for the ML1 v SVM), blood cell, hiragana-50, hiragana-13, hiragana-105, and USPS problems. For each problem we determined the C h value by cross-validation fixing the values of d and C determined by the L1 SVM. The accuracies for the test datasets were all better than or equal to those by the L1 SVM. The average accuracies of the ML1 SVM or ML1 v SVM for all the ten problems were 96.53% and 96.64%, respectively, which were lower than by the original model selection by 0.19% and 0.12%, respectively. For the MNIST problem with the switched training and test data, the selected parameter values for the L1 SVM were the same. Moreover, the accuracies for the test data were 98.17%, 98.23%, and 98.34% for the L1 SVM, ML1 SVM, and ML1 v SVM, respectively.

Conclusions
The minimal complexity machine (MCM) minimizes the VC dimension and generalizes better than the standard support vector machine (L1 SVM). However, according to our previous analysis, the solution of the MCM is non-unique and unbounded.
In this paper, to solve the problem of the MCM and to improve the generalization ability of the L1 SVM, we fused the MCM and the L1 SVM, namely, we introduced minimizing the upper bound of the absolute decision function values to the L1 SVM. This corresponds to minimizing the maximum margin. Converting the original classifier into dual one, we derived two subproblems: the first subproblem corresponds to the L1 SVM and the second subproblem corresponds to minimizing the upper bound. We further modified the second subproblem by converting the inequality constraint into two equality constraints: one for optimizing the variables associated with the positive class and the other for the negative class. We call this architecture ML1 v SVM and the original architecture, ML1 SVM.
We derived the exact KKT conditions for the first and second subproblems that exclude the bias term and the upper bound and discussed training the two subproblems alternatingly, fusing sequential minimal optimization (SMO) and Newton's method.
According to computer experiments of the two-class problems using RBF kernels, the average accuracy of the ML1 SVM is statistically comparable to that of the ML1 v SVM and L1 SVM. Using polynomial kernels, the ML1 SVM is statistically comparable to the L1 SVM but is slightly better than ML1 v SVM.
For the multiclass problems using RBF kernels, the ML1 v SVM and ML1 SVM generalize better than the L1 SVM and the ML1 v SVM performs best, and the ML1 SVM, the second, among six classifiers tested. Using polynomial kernels, the ML1 v SVM performs best, the LS SVM the second best, ML1 SVM the third, and the L1 SVM worst.
Therefore, the idea of minimizing the VC dimension for the L1 SVM worked to improve the generalization ability of the L1 SVM.
Execution time for the ML1 SVM and ML1 v SVM is comparable to that for the L1 SVM for most of the problems tested, but in some problems, execution time is much longer. In the future study, we would like to clarify the reason and propose a fast training method in such cases. Another study will be to consider robustness for outliers by the soft upper bound, instead of the hard upper bound.