Kernel Based Data-Adaptive Support Vector Machines for Multi-Class Classification

Imbalanced data exist in many classification problems. The classification of imbalanced data has remarkable challenges in machine learning. The support vector machine (SVM) and its variants are popularly used in machine learning among different classifiers thanks to their flexibility and interpretability. However, the performance of SVMs is impacted when the data are imbalanced, which is a typical data structure in the multi-category classification problem. In this paper, we employ the data-adaptive SVM with scaled kernel functions to classify instances for a multi-class population. We propose a multi-class data-dependent kernel function for the SVM by considering class imbalance and the spatial association among instances so that the classification accuracy is enhanced. Simulation studies demonstrate the superb performance of the proposed method, and a real multi-class prostate cancer image dataset is employed as an illustration. Not only does the proposed method outperform the competitor methods in terms of the commonly used accuracy measures such as the F-score and G-means, but also successfully detects more than 60% of instances from the rare class in the real data, while the competitors can only detect less than 20% of the rare class instances. The proposed method will benefit other scientific research fields, such as multiple region boundary detection.


Introduction
One of the typical problems in data mining and machine learning is to classify new instances on the basis of observed ones. A common classification problem is separating two classes based on the estimated decision rule trained from the training data, however, multi-class situations have been increasingly seen in various scientific areas, including disease diagnosis in medical research [1], artificial intelligence [2], users' preferences in recommendation systems [3], and risk evaluation in social sciences [4]. Accordingly, techniques are either derived from those binary classifiers or originally proposed specifically for multi-category classification problems. One of the most powerful classifiers is the support vector machine (SVM) [5], which shows its superior performance in many real applications [6] and is known for its excellent performance in both small and big samples, its robustness for outliers, and ease of interpretation.
The most popular framework for dealing with the multi-category classification problems is to decompose it into a series of binary classifications where the regular binary classifiers can be directly applied. Examples of those methods include the well-known one-versus-one [7] and one-versus-all [5] techniques. In particular, for a k-category classification case under the SVM framework, the least square SVM (LS-SVM) [8] method was extended to the multi-class case [9]. To overcome the drawback of the original LS-SVM that the decision function is constructed from most of the training samples, referred to as the non-sparseness problem, Xia and Li [10] developed a new multi-class LS-SVM algorithm where the solution can be sparse in the weight coefficients of support vectors. Fung and Mangasarian [11] followed the idea of proximal SVM (PSVM) in [12] to extend the PSVM to the multi-class case. For each decomposed sub-classification problem, the solution is similar to its binary case for classifying new samples by allocating them to the closer class of the two parallel planes. This PSVM method turns out to be quite aligned with the oneversus-all method. Zhang et al. [13] extended the PSVM method to include the adaptive kernel function, which magnifies the resolution on each boundary based on weighted factors that can be obtained from a Chi-square distribution. However, its adaptively scaled kernel depends on a squared distance, which may not be reliable [1], and the decay rate for each class is constant. Following the idea by Crammer and Singer [2], He et al. [14] proposed a simplified multi-class support vector machine with a reduced dual optimization. Their method suffers a computation burden. He et al. [14] also presented a simplified multi-class SVM to reduce the size of the resulting dual optimization by introducing a relaxed classification error bound, which speeds up the training process without sacrificing classification accuracy.
However, an imbalance issue usually arises in real applications such as cancer research, especially when dealing with multi-category classification. That is, some minority classes may only contain very few instances in the training sample data when dealing with two categories in nature or using the one-versus-all strategy in multi-class cases. Learning from the imbalanced data turns out to be remarkably challenging in the field of data mining with big data [6]. Many fields have been seeing the importance and need of accurate classifiers for imbalanced data [15], including the detection of rare but serious diseases such as cancers in medical science, fraudulence issues in accounting [16], and risk evaluation in economics [4]. Many commonly used binary classifiers may only show limited predictive power for the minority class when severe imbalances exist [17]. Indeed, this issue corresponds to the unequal distribution of the sample data from different classes, where a majority of instances belong to a specific class while the rest to others. Chawla et al. [18] and Tang et al. [19] have discussed the issue and found that the SVM for multiple classes with imbalanced data can be prone to generating a classifier with a strong estimation bias towards the majority class and will give a rather poor performance. Wang and Shen [20] proposed a method that can avoid the difficulties of the one-versus-all strategy by dealing with multiple classes in a joint manner. Consequently, an accurate classifier is always desired when a specific class is extremely small compared to other classes in the training data, such as the one-versus-all case in dealing with multi-class classification.
To overcome the effect of imbalance on classifications, Liu and He [1] proposed a new method to enhance the performance of the SVM for imbalanced data by adaptively scaling the kernel function obtained from a standard SVM so that the separation between two categories can be effectively enlarged. The method also takes into account the location of the support vectors in the feature space, which makes it more appealing when the responses are from multiple classes. In this paper, we propose a new data-adaptive SVM technique for multi-class problems. A new data-adaptive kernel function is proposed for the multi-class SVM in a way that the decay rate of the scaling magnitude is more robust and can vary along with the density of the samples in the neighborhood. Not only does the method take the imbalance of data from a multi-class response into consideration, but it involves spatial association of local data instances as well. By using this adaptive kernel function, the constructed classifier shows excellent predictive power, especially for imbalanced data, with a competitive cost of time consumption. Numerical investigations demonstrate the superior performance of the proposed method, and a real image dataset is employed as an illustration.
The remainder of the paper is organized as follows. Section 2 introduces the proposed methodology for multi-category classification with class imbalance taken into account. Numerical investigation is presented in Section 3 to demonstrate the superb prediction accuracy of the proposed method compared with its competitors. Concluding remarks and discussion are described in the final section.

SVM Framework and Notation
As a general method for classification proposed by Vapnik and Vapnik [5], the support vector machine essentially uses a kernel function that maps the original input data space into a high-dimensional feature space so that the instances from two classes are as far as possible, preferably separable with a linear boundary in the feature space.
To start with, we consider a binary case. Given a sample {x i , y i } for i = 1, . . . , n, where x i is a vector of predictors in the input space I = R p and y i represents the class index, which takes a value from {+1, −1}, a nonlinear support vector machine maps the input data x = {x 1 , . . . , x n } into a high-dimensional feature space, F = R l , using a nonlinear mapping function s : R p → R l , and finds a linear boundary in the feature space F by maximizing the smallest distance of instances to this boundary. Mathematically, the idea is equivalent to solve min w,b subject to where C is the so-called soft margin parameter that determines the trade-off between the optimal combinatorial choice of the margin and the classification error, and ξ = (ξ 1 , . . . , ξ n ) T is a non-negative slack variable vector that controls misclassification. The dual procedure of (1) is to solve subject to where α i 's are the dual variables and the scalar function K(·, ·) is called a kernel function defined as K(x i , x j ) =< s(x i ), s(x j ) > with < ·, · > being the inner product operator. Denote SV the index set of the support vectors {j | α j > 0 for j = 1, 2, . . . , n}. With all the observations x i , i ∈ SV, the kernel form of the SVM boundary can be written as Consequently, the label of an instance x is assigned by sign(D(x)), with whereâ represents the predicted value of a. Theoretically, the bias term b j is proved identical for all instances in the SV [21]. Practically, the biased termb is determined as the average of all the estimatedb j 's at all the support vectors, whereb j is obtained by using the j-th support vector x jb j = y j − ∑ i∈SVα i y i K(x i , x j ).
A k-category classification problem with the class label y i taking value from {1, . . . , k} can be generally decomposed into a sequence of binary classification problems using the one-versus-all strategy. Specifically, the m-th binary classification, m = 1, . . . , k, is set up for and I(·) is the indicator function. Hence, by applying the SVM procedure for binary classification, k classifiers can be constructed with k kernels K 1 , . . . , K k , and the m-th kernel form of the SVM boundary between the m-th class and the remaining (k − 1) classes can be written as With the estimated decision functions from all m-th binary classifications, the final class label of an instance can be assigned using a majority voting procedure.
Quite a few typical kernels are available for the SVM procedure. One is the radial kernel K(x, x ) = f (− x − x 2 /2), such as the Gaussian Radial Basis Function kernel, Another type of kernel takes a form of the inner product K(x, x ) = f (< x, x >), such as a polynomial kernel with degree d,

Conformal Transformation and Adaptive Kernel Machine
From the geometrical point of view, when the feature space F is the Euclidean space, the Riemannian metric is induced in the input space I. Take a two-dimensional case, for instance, a small change d(x) in the input space will be mapped as ds(x) in the feature space where Thus, the squared length of ds(x) can be written in the quadratic form as where s ij (x) = (∇s) T · (∇s).

Lemma 1 ([1]). Suppose K(p, q) is a kernel function, and s(·)
is the corresponding mapping in the support vector machine. Then Detailed proof is given in Appendix A.
Though the parameters of kernel functions are able to manipulate the geometric characteristics of the feature space F to some degree, conformal transformation on the original kernel function can further contribute to great adaptability. Conformal transformation is a function mapping that projects the original input space to a new feature space with the angles between vectors being preserved in a local area [1]. Definẽ thenK(x, x ) corresponds to the mappings that may increase the separation for a properly chosen positive scalar function c(x) which has larger values at the support vectors identified using the kernel K(x, x ). Furthermore,K can be easily shown to satisfy the Mercer positivity condition, the sufficient condition for being a kernel function. Specifically, we employ the L 1 -norm adaptive radial basis function (RBF) kernel proposed in [1]: where and M can be regarded as the distance between the nearest and the farthest support vectors under the original mapping s(x). In this way, the average on the right-hand side can comprise all the support vectors different from the currently considered instance in the neighborhood of s(x) within the radius of M. This takes into account the spatial distribution of the support vectors in the feature space F, and hence partially reflects the spatial association of the instances in the training set. This method turns out to be robust and efficient [1].

Adaptive Kernel Machine for Multi-Class Cases
To apply the adaptive kernel machine to a multi-class classification problem, we first apply the basic SVM to all k classes of a training sample by employing the one-versus-all strategy, and obtain k initial decision boundaries as well as the predicted labels of all instances. We then split the training sample in k datasets using the label of classŷ i from the initial round SVM, represented by S 1 , S 2 , . . . , S k , respectively. This step is essential in the sense of finding the approximated locations of support vectors and the initial boundaries. Similar to the idea of conformal transformation in the binary case, the adaptive datadependent kernel transformation function is defined as where p m (x), m = 1, . . . , k, are functions of data that will be determined to control the decay rates and hence further affect the performance of the classifier.

Specification of Functions p m (x)
In an imbalanced data classification, determination of appropriate weights for each category is important so that the problem can be transferred back to the approximately balanced case. Generally, there are two requirements for the choice of weights. One is that the data in the majority class should be allocated with a smaller weight than those in the minority class so that the data are somewhat balanced in the contribution to the decision function. The other is the natural restriction that the sum of the weights should be 1. Essentially, for imbalanced data, the weights can be set as the reciprocal of the sizes of the classes in the training sample. Let n m denote the training sample size for the m-th class, m = 1, . . . , k. Then the weightings are defined as In this way, w m s show the sparse distribution nature of each category. Note that a L 2 -norm is adopted when building w m in (17). Although L p -norm (p > 0) can be applied in general, such as the L 1 -norm, in real applications, we found the L 2 -norm would show the best empirical performance.
As w m s do not involve the information of x, we further introduce the idea of constructing c(x) in the binary case to include information from x. Define where SV m is the support vector set from the initial SVM with the binary SVM procedure in the m-th class, K m (·, ·) is the kernel function adopted in the m-th binary SVM and s m (·) is its corresponding mapping function. In practice, we adopt a common kernel function K m (·, ·), m = 1, . . . , k, such as the popular Gaussian kernel function, to simplify the calculation. Consequently, we define so that the influence from the size of the class is taken into account. Another potential choice of p m (x) could be where the tuning parameter Q m can also be regarded as the distance between the nearest and the farthest support vector in SV m from s m (x) within the same class. When k is small or moderate, this setting can be meaningful. However, when k is large, the computational cost may arise since more tuning parameters need to be determined. To avoid the problem, we propose to use a universal control Q while taking the weights w m into account. The final version of p m (x) is constructed as In this way, the classification can be more robust to extreme cases in spatial distribution, which may push the classification boundaries towards the majority classes, while the weights are considered to balance the training set so that the performance of the classification is enhanced.
Some other techniques are seen in the literature, though they may show some drawbacks in different situations for imbalanced data. For example, Wu and Amari [22] made some improvements by introducing different tuning parameters for different classes so that the local density of support vectors can be accommodated. With the heavy computational cost it brings, the performance in high-dimension cases turns out uncertain. Williams et al. [23] also extended their binary scaling SVM technique to the multi-class case; however, its distance tuning parameter, corresponding to the value of Q · w m in our case, is fixed throughout the whole region. This inflexible setting cannot reflect the local information, especially when the density of support vectors is quite high. Also, using L 2 -norm of D(x) may lead to unstable classification performance in high dimensional cases due to a faster decay rate to a constant e −k compared with our proposed method.

Data-Adaptive SVM Algorithm for Multi-Class Case
With c(x) constructed in (16), we conformally transfer the k kernels trained from the initial round of multi-class SVM, K 1 , . . . , K k intõ where m = 1, . . . , k, c(·) is defined in (16) with p m (x) as (21). K m (·, ·) is usually set as the Gaussian kernel function during the first round of SVM. The performance of using the form in (19) is similar empirically. Based on the updated kernels, the second round SVM is then conducted and predictions of labels for all instances are obtained. It is seen that 1. The magnification will be almost constant along the separating surface D(x) = 0 for each boundary; 2.
The magnification will be largest where the contours are closest locally. (See more details in the Appendices.) Thus, as long as the parameters C and σ in the kernel machine (and the controlling parameter Q if the form of p m (x) in (21) is adopted) are tuning adaptively with data, the classifiers can be trained, and hence the subjects' labels can be predicted.
To conclude the section, the algorithm of the whole procedure of the multi-label classification problem is described as follows. A regular SVM classifier is trained with an ordinary Gaussian radial basis kernel function, and the support vectors are found so that the separating boundaries can be approximately determined using the one-versus-all technique in the first stage. Based on the spatial information of the support vectors, the conformal transformations will be constructed, and the original kernel functions are updated. Then a new round of SVM optimization problems is conducted with the updated kernel function so that the boundary in each one-versus-all strategy can be found. Consequently, the predicted labels for subjects can be estimated. The whole procedure is summarized in Algorithm 1.

Algorithm 1. Multi-class data adaptive kernel scaling support vector machine (SVM).
Input: y i , x i , i = 1, ..., n; a Gaussian kernel functionK(·, ·) 1: A regular SVM classifier is trained with an ordinary Gaussian radial basis kernel function; 2: Based on the spatial information of these support vectors, the conformal transformation is constructed, and the original kernel function is updated; 3: A new round of SVM optimization problems is conducted with the updated kernel function, and the boundaries for different classes are found; 4: The predicted class labels for instances are determined by majority voting.

Numerical Investigation
In this section, we conduct intensive numerical experiments to evaluate the performance of the proposed classification procedure and compare them with the existing competitors. The whole study will be divided into two parts, one for simulated data and the other for a real image dataset. We will compare the proposed method with four existing methods, including the traditional SVM and methods from Wu [22], William [23] and Maratea [24].
We assess the performance of the classifiers using various quantitative measures. One of them is the overall accuracy, defined by where TP, FN, FP and TN represent the number of instances of true positive, false negative, false positive and true negative in the test sample, respectively. However, for imbalanced data, the overall accuracy rate may not be sufficient [24]. We further adopt two other measurements on classifiers' performance for imbalanced data, namely the F-score and the G-mean, respectively [25]. Specifically, the F-score is defined as F score = 2 × P pre × P spe P pre + P sen , and G-mean as G mean = P sen × P spe , where P pre , P sen and P spe are the precision, the sensitivity and the specificity, respectively. They are obtained by P pre = TP TP + FP , P sen = TP TP + FN , and P spe = TN TN + FP .
Note that F-score measures the harmonic mean of the precision and sensitivity, while G-mean is constructed as the geometric mean of the sensitivity and the specificity, giving a more fair comparison between the positive and negative classes, regardless of its size. To further evaluate the numerical performance of the multi-category classification, we employ the multi-class ROC and the AUC measures [25].

Simulation Study
First, we conduct simulation studies to evaluate the performance of the proposed method and compare it with the competitors in the literature. Three scenarios are considered. Each of them includes the balanced, moderately imbalanced, and extremely imbalanced cases, respectively. The Gaussian RBF kernel is employed during the first round of classification, if not mentioned elsewhere.
For convenience, the input space is 2-dimensional, and all training data are generated using three classes of bivariate Gaussian distributions with means vectors (2, 2), (4, 3), (3,2), and identical covariance matrix γ · Σ, where γ is a nuisance parameter that controls the overlapping proportion of the classes. Moderate covariance is incorporated for all pairs with a correlation coefficient ρ = 0.3, and the variance of all variables is 1.
The overall sample size for the training data is set as 600 and is separated into three classes by different weights in three different scenarios. The class size is (200,200,200) in Scenario 1, (100, 200, 300) in Scenario 2, and (20,100,480) in Scenario 3. In each scenario, different combinations of parameters that need to be tuned will be considered. The cost parameter C is chosen from the set {0.1, 0.2, 0.5, 1, 5, 8, 40, 100, 500} and σ takes value from the set {0.01, 0.05, 0.1, 0.5, 1, 5, 10, 100}. As Q is the threshold controlling the size of the local neighborhood, it is chosen by a grid search from the set {0.1, 0.2, . . . , 1} times the maximal Euclidean distance between all pairs of data points in the sample. All classifiers are tuned properly with respect to the corresponding measures.
The classification procedure is as follows. First, we train the classifiers with the traditional SVM using the one-versus-all strategy, and the support vectors are identified approximately. The kernel functions for all the methods are then updated adaptively by conformal transformation with different scalar function c(x), using p m defined in (21). A second round of SVM is then conducted, and the estimated class labels for observations in the test sample will be given and consequently compared with the true labels. Five-fold cross validation is employed to obtain the misclassification rate for each simulated dataset, and the whole process is repeated 1000 times. With the accuracy measures defined above, the performance of all the classifiers is shown in Tables 1-3, and Figures 1-3. Similar results are seen in the proposed method with the other way of defining p m . Table 1. F-score (F), G-mean (G) and the AUC (A) measures for all five classification methods in Scenario 1 for n 1 = 200, n 3 = 200 and n 3 = 200, respectively. Max margin is 0.02.  Table 2. F-score (F), G-mean (G) and the AUC (A) measures for all five classification methods in Scenario 2 for n 1 = 100, n 3 = 200 and n 3 = 300, respectively. Max margin is 0.04.

SVM Wu [22] William [23] Maratea [24]
Our Method  Table 3. F-score (F), G-mean (G) and the AUC (A) measures for all five classification methods in Scenario 3 for n 1 = 20, n 3 = 100 and n 3 = 480, respectively. Max margin is 0.05. It is seen that all methods considered here have improved performances comparing to the ordinary SVM in almost all scenarios with different combinations of the parameters C and σ. In general, the proposed method outperforms all the other classifiers considered, especially in the imbalanced data. When σ gets larger with fixed C, the misclassification rate tends to decrease in all the methods compared. When σ is relatively small, the proposed method performs better than those of Wu and Williams' methods, while if σ is relatively large, all the methods are nearly the same. This is because when σ is large, the feasible solution set gets large, and all of the methods tend to find the optimal solution. Correspondingly, when C increases, the budget for misclassification gets bigger, which means more tolerance is permitted so that the two classes can be separated. In this scenario, we found that p m is roughly the same, approximately the reciprocal of |D| max . This makes sense because in the balanced-data case, the density of the distributed SVs is roughly uniform, and hence the averages of the distance in the feature space for each data point are roughly the same.

SVM Wu [22] William [23] Maratea [24] Our Method
For imbalanced scenarios the performances of all methods turn out to be a bit worse than the balanced case with no surprise due to the non-uniformly distributed support vectors. The change of the misclassification rate with C and σ is similar to that in the balanced data case. The proposed method performs the best among all the methods.

A Real Prostate Caner MRI Dataset
In this section, we apply the proposed method to a prostate cancer MR image dataset. The study aims to find statistical methods to classify cancer and non-cancer areas or grades of cancer by the imaging data obtained by imaging collection equipments. In this case, nine common classes are labeled and listed as follows, indicating different levels of severity of cancer. Note that the labels are given at the voxel level. That is, for a specific patient, it is very likely to have different voxels (indicating different positions of the prostate tissue) with different classes. A patient that has G3 + 4-type cancer in some areas is likely to have G3-type cancer as well as OtherProstate-type of voxels in other areas. Our objective is to predict class labels at the voxel level. There are several labels associated with G5, however, the whole dataset contains only one patient with a very tiny area of G5 and the associated type of cancer. Therefore, G5 is extremely imbalanced.
In the first phase of the study, 21 patients are involved and more than 400 images are collected. Predictors on each voxel are the three-dimensional intensity measures from MRIs, denoted as T2W intensity, ADC intensity and C-Grade intensity. Other measures such as DCE and DWI are only available to part of the patients and hence are not included in the training process.
To adopt the proposed data-adaptive scaling in this multi-class case, two-stage SVMs are required. During the first stage, a standard SVM with the selected kernel is conducted so that the support vectors from the original dataset can be found. Based on the identified support vectors, the kernel functions are updated. Then, a second-stage SVM is conducted with the updated kernel, and the resulting estimated boundary will be used as the rule for classification. In terms of choosing appropriate tuning parameters for each method, 7-fold cross validation is conducted for 500 times at the patient level.
To assess the performance, we compare our proposed methods with both traditional and data-adaptive multi-category classification methods. In terms of the traditional methods, one-versus-one (1vs1) and one-versus-all (1vsA) from indirect methods, and the Crammer and Singer's (CS) direct methods and He's Simplified SVM (simSVP) will be included, while for the data-adaptive methods, Amari's and William's adaptively scaling will be included. In terms of the criterion of the classification performance, misclassification rate, percentage of support vectors in the whole dataset, F-score and G-means along with their margins are reported. Table 4 presents the assessment measures for all the methods considered. Obviously, the proposed method performs almost the best among all the compared methods. A highlight point is that the proposed method has the smallest margins in all performance measures, resulting from the property of the robust decay of the magnification effect from the proposed data-adaptive kernel. In terms of the accuracy, the proposed method has a similar misclassification rate to the indirect methods, which is significantly smaller than the rest of the methods. F-score and G-means are the largest for the proposed method, much larger than other data-adaptive kernel methods. The percentage of support vectors that are used for constructing the classifiers is the smallest for the proposed method. It is worth pointing out that among those wrongly predicted labels, G4 + 3 is the dominant class. In other words, the misclassification always happens in G4 + 3 type cancer. This is because this type of cancer is really rare in the training sample, taking only 1-2% among all the labels. These extremely imbalanced data have made it very difficult to be detected with a high accuracy. The proposed method can detect around 60% among this type, while other data adaptive (Amari's and William's) methods can only find less than 20%. All other methods cannot detect this class. Also, only our method detects the G5 class from the only one patient, while all competitor methods fail.

Concluding Remarks
In this paper, we developed a new data-dependent SVM construction technique for the multi-category classification problem. Based on the data-adaptive kernel SVM for the binary case, we proposed a new method to construct the data-dependent kernel for the multi-class setting, especially when the data are imbalanced. The data-dependent kernel functions have a more robust decay rate and can vary along with the density of the size of neighbors. Thus, the kernel can be adapted optimally for a specific dataset. Numerical results from both synthetic and real datasets have shown the excellent performance of the proposed method. Not only does the proposed method outperform in terms of the commonly used accuracy measures such as the F-score and G-means, compared with the competitors, but also successfully detects more than 60% of instances from the rare class in the real data, while the competitors can only detect less than 20%. A possible future work is to select relevant predictors for the multi-class kernel functions and consider the spatial association between different images. It is worth noting that the misclassification rate will be affected by the distance of the mean vectors. For instance, the misclassification will not occur if the centers of the three Gaussian distributions are sufficiently far from each other when the covariance matrix is set as unity. The proposed method may be useful in other scientific research fields, such as detecting the boundaries of multiple regions of interest.