Kernel-Free Quadratic Surface Minimax Probability Machine for a Binary Classiﬁcation Problem

: In this paper, we propose a novel binary classiﬁcation method called the kernel-free quadratic surface minimax probability machine (QSMPM), that makes use of the kernel-free techniques of the quadratic surface support vector machine (QSSVM) and inherits the advantage of the minimax probability machine (MPM) without any parameters. Speciﬁcally, it attempts to ﬁnd a quadratic hypersurface that separates two classes of samples with maximum probability. How-ever, the optimization problem derived directly was too difﬁcult to solve. Therefore, a nonlinear transformation was introduced to change the quadratic function involved into a linear function. Through such processing, our optimization problem ﬁnally became a second-order cone programming problem, which was solved efﬁciently by an alternate iteration method. It should be pointed out that our method is both kernel-free and parameter-free, making it easy to use. In addition, the quadratic hypersurface obtained by our method was allowed to be any general form of quadratic hypersurface. It has better interpretability than the methods with the kernel function. Finally, in order to demonstrate the geometric interpretation of our QSMPM, ﬁve artiﬁcial datasets were implemented, including showing the ability to obtain a linear separating hyperplane. Furthermore, numerical experiments on benchmark datasets conﬁrmed that the proposed method had better accuracy and less CPU time than corresponding methods.


Introduction
Machine learning is an important branch in the field of artificial intelligence, which has a wide range of applications in various fields of contemporary science [1]. With the development of machine learning, the classification problem has been widely concerned and studied in the fields of pattern recognition [2], text classification [3], image processing [4], financial time series prediction [5], skin disease [6], intrusion detection systems [7], etc. The classification problem is a vital task in supervised learning that learns a classification rule from a training set with known labels and then uses it to assign a new sample to a class.
At present, there are many famous classification methods. Among these existing methods, Lanckriet et al. [8,9] proposed an excellent classifier, called the minimax probability machine (MPM). For a given binary classification problem, the MPM not only deals with it in the linear case, but also in the nonlinear case by the kernel trick. It is worth noting that the MPM does not have any parameters, which is an important advantage. Therefore, it has been widely used in computer vision [10], engineering technology [11,12], agriculture [13], and novelty detection [14]. Moreover, many researchers have proposed a variety of improved versions of the MPM from different perspectives [14][15][16][17][18][19][20][21][22][23][24][25]. The representative works can be briefly reviewed as follows. In [15], Thomas and Gregory proposed MPM regression (MPMR), which transformed the regression problem into a classification problem, and then used the classifier MPM to obtain a regression function. To further exploit the structural information of the training set, Gu et al. [17] proposed the structural MPM (SMPM) by combining the finite mixture models with the MPM. In addition, Yoshiyama et al. [21] proposed the Laplacian MPM (Lap-MPM), which improved the performance of the MPM in semisupervised learning. However, the nonlinear MPM using kernel techniques lacks interpretability and usually depends heavily on the choice of a proper kernel function and the corresponding kernel parameters. Furthermore, choosing the appropriate kernel function and adjusting its parameters may require much computational time and effort. Therefore, it naturally occurs to us that the study of a kernel-free nonlinear MPM is of great significance.
For the first time, Dagher [26] proposed a kernel-free nonlinear classifier, namely the quadratic surface support vector machine (QSSVM), in 2008. It was based on the maximum margin idea, and the training points were separated by a quadratic hypersurface without a kernel function, avoiding the time-consuming process of selecting the appropriate kernel function and its corresponding parameters. Furthermore, in order to improve the classification accuracy and robustness, Luo et al. [27] proposed the soft-margin quadratic surface support vector machine (SQSSVC). After that, Bai et al. [28] proposed the quadratic kernel-free least-squares support vector machine for target diseases' classification. Following these leading works, some scholars performed further studies, e.g., see [29][30][31][32][33][34] for the classification problem, [35] for the regression problem, and [36] for the cluster problem. The good performance of these methods demonstrates that the quadratic hypersurface is an effective method to flexibly capture the nonlinear structure of data. Thus, it can be seen that it is very interesting to study the kernel-free nonlinear MPM using the above kernel-free technique.
In this paper, for the binary classification problem, a new kernel-free nonlinear method is proposed, which is called the kernel-free quadratic surface minimax probability machine (QSMPM). It was constructed on the basics of the MPM by using the kernel-free techniques of the QSSVM. Specifically, it tries to seek a quadratic hypersurface that separates two classes of samples with maximum probability. However, the optimization problem derived directly was too difficult to solve. Therefore, a nonlinear transformation was introduced to change the quadratic function involved into a linear function. Through such processing, our optimization problem finally became a second-order cone programming problem, which was solved efficiently by an alternate iteration method. It is important to point out that our QSMPM addresses the following key issues. First, our method directly generates a nonlinear (quadratic) hypersurface without the kernel function, so there is no need to select the appropriate kernel. Second, our method does not need to choose any parameters. Third, the quadratic hypersurface obtained by our method has better interpretability than the one by the methods with the kernel function. Fourth, it is rather flexible because the quadratic hypersurface obtained by our method can be any general form of the quadratic hypersurface. In our experiment, the results of five artificial datasets showed that the proposed method can find the general form of the quadratic surface and has also the ability to obtain the linear separating hyperplane. Numerical experiments on 14 benchmark datasets verified that the proposed method was superior to corresponding methods in both accuracy and CPU time. What is more gratifying is that when the number of samples or the dimension is relatively large, our method can obtain good classification performance quickly. In addition, the results of the Friedman test and Nemenyi post-hoc test indicated that our QSMPM was statistically the best one compared to other methods.
The rest of this paper is organized as follows. Section 2 briefly reviews the related works, the QSSVM, and the MPM. Section 3 presents our method QSMPM, gives its algorithm, and analyzes the computational complexity of the QSMPM. In Section 4, we show the interpretability of our method. In Section 5, the results of the numerical experiments on the artificial datasets and benchmark datasets are presented, and a further statistical analysis is performed. Finally, Section 6 gives the conclusion and future work of this paper.
Throughout this paper, we use lower case letters to represent scalars, lower case bold letters to represent vectors, and upper case bold letters to represent matrices. R denotes the set of real numbers. R d denotes the space of d-dimensional vectors. R d×d denotes the space of d × d matrices. S d denotes the set of d × d symmetric matrices. S d + denotes the set of d × d symmetric positive semidefinite matrices. I d denotes the d × d identity matrix.
x 2 denotes the two-norm of the vector x.

Related Work
In this section, we briefly introduce the QSSVM and the MPM. For a binary classification problem, the training set is given as: where x i ∈ R d is the i-th sample and y i ∈ {+1, −1} corresponds to the class label, i = 1, 2, . . . , m + + m − . The number of samples in class +1 and class −1 is m + and m − , respectively. For the training set (1), we want to find a hyperplane or quadratic hypersurface: and then use a decision function: to determine whether a new sample x ∈ R d is assigned to class +1 or class −1.

Quadratic Surface Support Vector Machine
We first shortly outline the quadratic surface support vector machine (QSSVM) [26]. For the given training set (1), the goal of the QSSVM is to seek a quadratic separating hypersurface: where A ∈ S d , b ∈ R d , c ∈ R, which separates the samples into two classes with the largest margin. In order to obtain the quadratic hypersurface (4), the QSSVM establishes the following optimization problem: The optimization problem (5) is a convex quadratic programming problem. After obtaining the optimal solution A * , b * , and c * to the optimization problem (5), for a given new sample x ∈ R d , its label is assigned to either class +1 or class −1 by the decision function: To allow some samples in the training set (1) to be misclassified, Luo et al. further proposed the soft-margin quadratic surface support vector machine (SQSSVM); please refer to [27].

Minimax Probability Machine
Now, we briefly review the minimax probability machine (MPM) [8,9]. Let us leave the training set (1) aside for a moment and suppose that these samples have some distribution. Specifically, assume that the samples in class +1 are drawn from a distribution with the mean vector µ + ∈ R d and the covariance matrix Σ + ∈ S d + , without making other specific distributional assumptions. A similar assumption is also given for the samples in class −1 with the mean vector µ − ∈ R d and the covariance matrix Σ − ∈ S d + . Denote the two distributions as x + ∼ (µ + , Σ + ) and x − ∼ (µ − , Σ − ), respectively. Based on the above assumptions, the MPM attempts to obtain a separating hyperplane: where w ∈ R d , b ∈ R, which separates the two classes of samples with maximal probability with respect to all distributions having these mean vectors and covariance matrices. This is expressed as: where α ∈ (0, 1) represents the lower bound of the accuracy for future data, namely the worst-case accuracy. The infimum "inf" is taken over all distributions having these mean vectors µ ± ∈ R d and covariance matrices Σ ± ∈ S d + . The constraint condition of the above optimization problem (8) is the probabilistic constraint, which is difficult to solve. In order to convert the probabilistic constraints to easy, tractable constraints, the following lemma [9] is given: . Let x be a d-dimensional random vector with mean vector µ and covariance matrix Σ, where Σ ∈ S d + . Given w ∈ R d , b ∈ R, such that w T x ≤ b and α ∈ (0, 1), the condition: holds if and only if: where κ(α) = α 1−α .
Using the above Lemma 1, the optimization problem (8) is equivalent to: Then, through a series of algebraic operations (see Theorem 2 in [9], for the details), the above optimization problem (11) leads to: When its optimal solution w * is obtained, for the optimization problem (11), the optimal solution with respect to b is given by: Now, let us return to the training set (1). It is easy to see that these required mean vectors µ ± ∈ R d and covariance matrices Σ ± ∈ S d + are able to be estimated by the training set (1) as follows:μ Therefore, in practice, these mean vectors µ ± and covariance matrices Σ ± in (12)-(14) should be replaced byμ ± andΣ ± , and the optimal solutions of w and b thus obtained are denoted asŵ * andb * . Then, for a given new sample x ∈ R d , its label is assigned to either class +1 or class −1 by the decision function: In addition, for nonlinear cases and more details, please refer to [8,9].

Kernel-Free Quadratic Surface Minimax Probability Machine
In this section, we first formulate the kernel-free quadratic surface minimax probability machine (QSMPM). Then, its algorithm is given.

Optimization Problem
For the binary classification problem with the training set (1), we attempt to find a quadratic separating hypersurface: where A ∈ S d , b ∈ R d , c ∈ R, which separates the two classes of the samples. Inspired by the MPM, we construct the following optimization problem: where α ∈ (0, 1) represents the lower bound of the accuracy for future data, namely the worst-case accuracy. The notation x + ∼ (µ + , Σ + ) refers to the class distribution that has the prescribed mean vector µ + ∈ R d and covariance matrix Σ + ∈ S d + , but otherwise arbitrary, and likewise for x − .
The above optimization problem (18) corresponds to the optimization problem (8), which is used to derive the optimization problem (11). Analogically, the optimization problem (18) should be used to derive the required optimization problem. Unfortunately, it does not have a counterpart when the functions in curly braces in the optimization problem (18) are quadratic because of the lack of the corresponding Lemma 1. In order to overcome this difficulty, we change the quadratic functions as a linear function by introducing a nonlinear transformation from By representing the upper triangular entries of the symmetric matrix: as a vector: a = (a 11 , a 12 , . . . , a 1d , a 22 , . . . , a 2d , . . . , and defining: the quadratic function (17) of x in d-dimensional space yields the linear function of z in 2 -dimensional space as follows: Following the transformation (19), the training set (1) in the d-dimensional space correspondingly becomes: where (24), it is naturally assumed that the samples of the two classes are sampled from z + ∼ (µ z + , Σ z + ) and z − ∼ (µ z − , Σ z − ), respectively, where these mean vectors µ z ± ∈ R d 2 +3d 2 and covariance matrices Σ z ± ∈ S d 2 +3d 2 + can be estimated as: Based on the transformation (19), the optimization problem (18) is replaced by: Now, Lemma 1 [9] is applicable to the optimization problem (26). Thus, we have: where κ(α) = α 1−α . Moreover, a series of algebraic operation shows that the above optimization problem (27) is equivalent to the following second-order cone programming problem: When its optimal solution w * is obtained, for the optimization problem (27), the optimal solution with respect to c is given by: or: In the next subsection, we show how to solve the optimization problem (28).

Algorithm
Now, we present the solving process of the optimization problem (28), which is achieved by referring to [9]. By constructing an orthogonal matrix F ∈ R d 2 +3d 2 × d 2 +3d−2 2 whose columns span the subspace of vectors orthogonal toμ z + −μ z − ∈ R d 2 +3d 2 , the un- ; the optimization problem (28) is transferred to the unconstrained optimization problem: In order to solve the above optimization problem (31), Lanckriet et al. [9] introduced two extra variables β and η and considered the following optimization problem: This optimization problem (32) is solved by an alternative iteration. The variables are divided into two sets: one is β and η, and the other is u. At the t-th iteration, first by fixing β and η to take the derivative of the optimization problem (32) with respect to u, we have the following updated iteration formula of u t : where . To ensure the stability, the regularization term δI d 2 +3d−2 2 (δ > 0) is added. Therefore, the Equation (33) is replaced by: Next, by fixing u to take the derivative of the optimization problem (32) with respect to β and η, respectively, we have the following updated iteration formula of β t and η t : When the optimal solution u * is obtained by the above two updated iteration Formulas (34) and (35), the optimal solution w * of the optimization problem (28) is w * = w 0 + Fu * . Then, we summarize the process of finding the optimal solution A * , b * , c * of the optimization problem (18) in Algorithm 1.
After obtaining the optimal solution A * , b * and c * to the optimization problem (18), for a given new sample x ∈ R d , its label is assigned to either class +1 or class −1 by the decision function: It should be pointed out that our QSMPM is kernel-free, which avoids the timeconsuming task of selecting the appropriate kernel function and its corresponding parameters. What is more, it does not require any choice of parameter, which makes its use simpler and convenient. Furthermore, from the geometric point of view, the quadratic hypersurface (17) determined by our method is allowed to be any general form of quadratic hypersurface, including hyperplanes, hyperparaboloids, hyperspheres, hyperellipsoids, hyperhyperboloids, and so on, which is shown clearly by five artificial examples in Section 5.

Computational Complexity
Here, we analyze the computational complexity of our QSMPM. Suppose that the number and the dimension of the samples are N and d, respectively. Before reformulating the QSMPM as an SOCP problem, all d-dimensional samples need to be projected into the 2 -dimensional space. Therefore, the total computational complexity of the QSMPM is O(( d 2 +3d 2 ) 3 + N( d 2 +3d 2 ) 2 + Nd 2 ). In addition, we give the computational complexity of the MPM and the SVM. Their complexity is O(d 3 + Nd 2 ) [9] and O(N 3 ) [19], respectively. Then, by referencing the computational complexity of SVM, we obtain that the computational complexity of the QSSVM is O(N 3 + Nd 2 ). According to the above analysis, assuming that N is much larger than d, we can see that the computational complexity of the QSMPM is higher than that of the MPM, but lower than that of the SVM and the QSSVM.

The Interpretability
In this section, we discuss the interpretability of our method QSMPM. Suppose we have obtained the optimal solution A * , b * , c * to the optimization problem (18), then the quadratic hypersurface (17) has the following component form: where [x] i is the i-th component of the vector x ∈ R d , a * ij is the i-th row and the j-th column component of the matrix A * ∈ S d , and b * i is the i-th component of the vector b * ∈ R d . Each component of x produces the contribution of a quadratic polynomial function. Specifically, b * i is the linear effect coefficient of the i-th component, a * ii (i = j) is the quadratic effect coefficient of the i-th component, and a * ij (i = j) is the interaction coefficient between the i-th component and the j-th component. Therefore, for the i-th component of x, consider that the larger |a * ii | + |a * ij | + |b * i | (j = 1, 2, . . . , d, j = i), the greater the contribution of the i-th component is. In particular, when |a * ii | + |a * ij | + |b * i | = 0 (j = 1, 2, . . . , d, j = i), the i-th component of x would not work. Therefore, compared with the methods with the kernel function, the QSMPM has better interpretability.

Numerical Experiments
In this section, we provide some numerical experiments to verify the performance of our QSMPM. We In addition, we also compared it with the QSSVM and the SQSSVM. In all numerical experiments, the penalty parameter C in the S-SVM and the kernel parameter σ of the RBF kernel were selected from {2 −7 , 2 −6 , · · · , 2 7 } by the 10-fold cross-validation method. All numerical experiments were conducted using MATLAB R2016 (b) on a computer equipped with a 2.50 GHz (I5-4210U) CPU, and 4G available memory.  4], ξ i ∼ N(0, 1). Figure 1 illustrates the classification results of the MPM−L, the MPM−P, the MPM−R, and the QSMPM on Example 1, respectively. We can see from Figure 1 that our QSMPM can obtain classification results as good as the other three methods. In addition, the quadratic hypersurface found by our QSMPM is a straight line, that is a linear separating hyperplane.

Example 4.
[ Figure 4 shows the classification results on Example 4. We can observe in Figure 4 that the QSMPM can obtain the same classification performance as the MPM−P and the MPM−R and is better than the MPM−L. Our QSMPM can find an ellipse.    Figure 5 that the classification performance of QSMPM is better than the MPM−L and is similar to the MPM−P and the MPM−R. In addition, our QSMPM finds a hyperbola.
In summary, from Figure 1 to Figure 5, we can see that our QSMPM can find any general form of the quadratic hypersurface, such as the line, parabola, circle, ellipse, and hyperbola found in sequence in the above numerical experiments. Moreover, our method can achieve as good classification performance as the MPM−P and the MPM−R. In addition, it can be seen from Figure 1d that our method can obtain the linear separating hyperplane.

Benchmark Datasets
To verify the classification performance and computational efficiency of our QSMPM, we performed the following numerical experiments on 14 benchmark datasets. Table 1 summarizes the basic information of the 14 benchmark datasets in the UCI Machine Learning Repository. It can be seen from Table 2 that compared with the other methods, our QSMPM obtained better accuracy on the first group of benchmark datasets, among which the accuracy was the best on four benchmark datasets. More specifically, except for Haberman and Bupa, the accuracy of our method was the best compared to the QSSVM and the SQSSVM. The accuracy of our QSMPM was the best compared to the three original kernel versions of the MPM except for Bupa. Furthermore, the accuracy of our method was the best compared to the H−SVM and the S−SVM with three kernel function except for Heart and Haberman. In addition, we can observe that QSMPM had a short CPU time.
Then, the classification results on the second group are reported in Table 3. The symbol "−" indicates that the corresponding method cannot obtain the classification results, because it cannot choose the optimal parameter in a limited amount of time or because the dimension and the number of the dataset are relatively large, resulting in insufficient memory. From Table 3, we can see that our QSMPM had good classification results on the second group of benchmark datasets. Especially on QSAR and Turkiye, the H−SVM−R, the three kernel versions of the S−SVM, the QSSVM, and the SQSSVM could not obtain the corresponding classification results, but our QSMPM could obtain good classification performance. Here, we mention the reason for this situation. According to the computational complexity of each method, we know that when the sample dimension and the number of samples are relatively large, the SVM and the QSSVM need a larger memory space. In addition, our QSMPM had the fastest running time except the MPM−L, and it ran quite fast when the number of samples or the dimension was large.

Statistical Analysis
To further compare the performance of the above 12 methods, the Friedman test and the post-hoc test were performed. The ranks of the 12 methods on all benchmark datasets is shown in Table 4.
First, the Friedman test was used to compare the average ranks of different methods. The null hypothesis states that all methods have the same performance, that is their average ranks are the same. Based on the average ranks displayed in Table 4, we can calculate the Friedman statistic τ F by the following formula: where N and k are the number of datasets and methods, respectively. r i is the average rank of the i-th method. According to the formula (38), τ F = 4.1825. For α = 0.05, we can obtain F α = 1.8526. Since τ F > F α , we rejected the null hypothesis. Then, we proceeded with a post-hoc test (the Nemenyi test) to find out which methods significantly differed. To be more specific, the performance of two methods was considered to be significantly different if the difference of their average ranks was larger than the critical difference (CD). The CD can be calculated by: For α=0.05, we know q α = 3.2680. Thus, we obtained CD = 4.4535 by the formula (39). Figure 6 visually displays the results of the Friedman test and Nemenyi post-hoc test, where the average ranks of each method are marked along an axis. The axis is turned so that the lowest (best) ranks are to the right. Groups of methods that are not significantly different are linked by a red line. In Figure 6, we can see that our QSMPM was the best one statistically among the compared methods. Furthermore, there was no significant difference in performance between the QSMPM and the MPM−R.

Conclusions
For the binary classification problem, a new classifier, called the kernel-free quadratic surface minimax probability machine (QSMPM), was proposed by using the kernel-free techniques of the QSSVM and the classification idea of the MPM. Specifically, our goal was to find a quadratic hypersurface that separates two classes of samples with maximum probability. However, the optimization problem derived directly was too difficult to solve. Therefore, a nonlinear transformation was introduced to change the quadratic function involved into a linear function. Through such processing, our optimization problem finally became a second-order cone programming problem, which was solved efficiently by an alternate iteration method. Here, we clarify the main contributions of this paper. Unlike the methods realizing nonlinear separation, our method was kernel-free and had better interpretability. Then, our method was easy to use because it did not have any parameters. Furthermore, numerical experiments on five artificial datasets showed that the quadratic hypersurfaces found by our method were rather general, including that it could obtain the linear separating hyperplane. In addition, numerical experiments on benchmark datasets confirmed that the proposed method was superior to some relevant methods in both accuracy and computational time. Especially when the number of samples or dimension was relatively large, our method could also quickly obtain good classification performance. Finally, the results of the statistical analysis showed that our QSMPM was statistically the best one compared with the corresponding methods. Our QSMPM focuses on the standard binary classification problem, which we will extend to the multiclassification problem.
In our future work, there will be some issues to be address to extend our QSMPM. For example, we need to investigate further how to add appropriate regularization terms to our method. Meanwhile, we need to consider that the worst-case accuracies for two classes are not the same, and that will be interesting. Furthermore, we will pay attention to how the QSMPM achieves the dual purpose of feature selection and classification simultaneously. In addition, we can apply our method to practical problems in many fields in the future, especially image recognition in the medical field.