Next Article in Journal
Integration of Molecular Docking and In Vitro Studies: A Powerful Approach for Drug Discovery in Breast Cancer
Next Article in Special Issue
Answer Set Programming for Regular Inference
Previous Article in Journal
Active Load-Sensitive Electro-Hydrostatic Actuator for More Electric Aircraft
Previous Article in Special Issue
Application of Machine Learning Techniques to Delineate Homogeneous Climate Zones in River Basins of Pakistan for Hydro-Climatic Change Impact Studies
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Selection of Support Vector Candidates Using Relative Support Distance for Sustainability in Large-Scale Support Vector Machines

1
Vision AI Labs, SK Telecom, Seoul 04539, Korea
2
Department of Industrial Engineering, College of Engineering, Hanyang University, Seoul 04763, Korea
*
Author to whom correspondence should be addressed.
Appl. Sci. 2020, 10(19), 6979; https://doi.org/10.3390/app10196979
Submission received: 8 September 2020 / Revised: 26 September 2020 / Accepted: 29 September 2020 / Published: 6 October 2020
(This article belongs to the Special Issue Applied Machine Learning)

Abstract

:
Support vector machines (SVMs) are a well-known classifier due to their superior classification performance. They are defined by a hyperplane, which separates two classes with the largest margin. In the computation of the hyperplane, however, it is necessary to solve a quadratic programming problem. The storage cost of a quadratic programming problem grows with the square of the number of training sample points, and the time complexity is proportional to the cube of the number in general. Thus, it is worth studying how to reduce the training time of SVMs without compromising the performance to prepare for sustainability in large-scale SVM problems. In this paper, we proposed a novel data reduction method for reducing the training time by combining decision trees and relative support distance. We applied a new concept, relative support distance, to select good support vector candidates in each partition generated by the decision trees. The selected support vector candidates improved the training speed for large-scale SVM problems. In experiments, we demonstrated that our approach significantly reduced the training time while maintaining good classification performance in comparison with existing approaches.

1. Introduction

Support vector machines (SVMs) [1] have been a very powerful machine learning algorithm developed for classification problems, which works by recognizing patterns via kernel tricks [2]. Because of its high performance and great generalization ability compared with other classification methods, the SVM method is widely used in bioinformatics, text and image recognition, and finances, to name a few. Basically, the method finds a linear boundary (hyperplane) that represents the largest margin between two classes (labels) in the input space [3,4,5,6]. It can be applied to not only linear separation but also nonlinear separation using kernel functions. Its nonlinear separation can be achieved via kernel functions, which map the input space to a high-dimensional space, called feature space where optimal separating hyperplane is determined in the feature space. In addition, the hyperplane in the feature space, which achieves a better separation of training data, is translated to a nonlinear boundary in the original space [7,8]. The kernel trick is used to associate the kernel function with the mapping function, bringing forth a nonlinear separation in the input space.
Due to the growing speed of data acquisition on various domains and the continual popularity of SVMs, large-scale SVM problems frequently arise: human detection using histogram of oriented gradients by SVMs, large-scale image classification by SVMs, disease classification using mass spectrum by SVMs, and so forth. Even though SVMs show superior classification performance, their computing time and storage requirements increase dramatically with the number of instances, which is a major obstacle [9,10]. As the goal of SVMs is to find the optimal separating hyperplane that maximizes the margin between two classes, they should solve a quadratic programming problem. In practice, the time complexity in the training phase of the SVM method is at least O n 2 , where n is the number of data samples, depending on the kernel function [11]. Indeed, several approaches have been applied to improve the training speed of SVMs. Sequential minimal optimization (SMO) [12], SVM-light [13], simple support vector machine (SSVM) [14] and library of support vector machine (LibSVM) [15] are among others. Basically, they break the problem into a series of small problems that can be easily solved, reducing the required memory size.
Additionally, data reduction or selection methods have been introduced for large-scale SVM problems. Reduced support vector machines (RSVMs) are a random sampling method that, being quite simple, uses a small portion of the large dataset [16]. However, it needs to be applied several times and unimportant observations are equally sampled. The method presented by Collobert et al. efficiently parallelizes sub-problems, fitting to very large-size SVM problems [17]. It used cascades of SVMs in which data are split into subsets to be optimized separately with multiple SVMs instead of analyzing the whole dataset. A method based on the selection of candidate vectors (CVS) was presented using relative pair-wise Euclidean distances in the input space to find the candidate vectors in advance [18]. Because the only selected samples are used in the training phase, it shows fast training speed. However, its classification performance is relatively worse than that of the conventional SVM, and the need for selecting good candidate vectors arise.
Besides, for large-scale SVM problems, a joint approach that combines SVM with other machine learning methods has emerged. Many evolutionary algorithms have been proposed to select training data for SVMs [19,20,21,22]. Although they have shown promising results, these methods need to be executed multiple times to decide proper parameters and training data, which is computationally expensive. Decision tree methods also have been commonly proposed to reduce training data because the training time is proportional to O n p 2 where p represents discrete input variables [23] so is faster than traditional SVMs. The decision tree method recursively decomposes the input data set into binary subsets through independent variables when the splitting condition is met. In supervised learning, decision trees, bringing forth random forests, are one of the most popular models because they are easy to interpret and computationally inexpensive. Indeed, taking advantage of decision trees, several researches combining SVMs with decision trees have been proposed for large-size SVM problems. Fu Chang et al. [24] presented a method that uses a binary tree to decompose an input data space into several regions and trains an SVM classifier on each of the decomposed regions. Another method using decision trees and Fisher’s linear discriminant was also proposed for large-size SVM problems in which they applied Fisher’s linear discriminant to detect ‘good’ data samples near the support vectors [25]. Cervantes et al. [26] also utilized a decision tree to select candidate support vectors using the support vectors annotated by SVM trained by a small portion of training data. Their approaches, however, are limited in that it cannot properly handle the regions that have nonlinear relationships.
The ultimate aim in dealing with large-scale SVM problems is to reduce the training time and memory consumption of SVMs without compromising the performance. For this goal, it would be worth finding good support vector candidates as a data-reduction method. Thus, in this paper we present a method that finds support vector candidates based on decision trees that works better than previous methods. We determine the decision hyperplane using support vector candidates chosen among the training dataset. In this proposed approach, we introduce a new concept, relative support distance, to effectively find candidates using decision trees in consideration of nonlinear relationships between local observations and labels. Decision tree learning decomposes the input space and helps find subspaces of the data where the majority class labels are opposite to each other. Relative support distance measures a degree that an observation is likely to be a support vector, using a virtual hyperplane that bisects the two centroids of two classes and the nonlinear relationship between the hyperplane and each of the two centroids.
This paper is organized as follows. Section 2 provides the overview of SVMs and decision trees that are exploited in our algorithm. In Section 3, we introduce the proposed method of selecting support vector candidates using relative support distance measures. Then, in Section 4, we provide the results of experiments to compare the performance of the proposed method with that of some existing methods. Lastly, in Section 5, we conclude this paper with future research directions.

2. Preliminaries

In this section, we briefly summarize the concepts of support vector machines and decision trees. Relating to the concepts, we then introduce the concept of relative support distance to measure the possibility of being a support vector in training data.

2.1. Support Vector Machines

Support vector machines (SVMs) [1] are generally used for binary classification. Given n pairs of instances with input vectors x 1 ,   x 2 ,   ,   x n and response variables y 1 , y 2 , , y n , where x i p   and y i 1 ,   1 , SVMs present a decision function in a hyperplane that optimally separates two classes:
y = s i g n ( w t x + b )
where w is a weight vector and b is a bias term. The margin is the distance between the hyperplane and the training data nearest the hyperplane. The distance from an observation to the hyperplane is given by d x / w . To find the hyperplane that maximizes the margin, we solve the problem by transforming it to its dual problem, introducing the Lagrange multipliers. Namely, in soft-margin SVMs with penalty parameter C , we find w by the following optimization problem:
m a x i n α i 1 2 i n j n α i α j y i y j K x i , x j , s u b j e c t   t o i n α i y i = 0 , 0 α i C / n , i = 1 , , n .
where C > 0 , α i , i = 1 , , n , are the dual variables corresponding x i , and all the x i corresponding to nonzero α i are called support vectors. By numerically solving the problem (2) for α i , we obtain α i * and compute w * = i α i * y i x i and b * = y i w * x i for 0 < α i * < C / n . The kernel function K x i , x j is the inner product of the mapping function: K x i , x j = ϕ ( x i ) T ϕ x j . The mapping function ϕ x maps the input vectors to high-dimensional feature spaces. Well-known kernel functions are polynomial kernels, tangent kernels, and radial basis kernels. In this research, we chose the radial basis kernel function (RBF) with a free parameter γ denoted as
K x i , x j = e x p γ || x i x j || 2
Notice that the radial basis kernel, possessing the mapping function ϕ x with an infinite number of dimensions [27], is flexible and the most widely chosen.

2.2. Decision Tree

A decision tree is a general tool in data mining and machine learning used as a classification or regression model in which a tree-like graph of decisions is formed. Among the well-known algorithms such as CHAID (chi-squared automatic interaction detection) [28], CART (classification and regression tree) [29], C4.5 [30], and QUEST (quick, unbiased, efficient, statistical tree) [31], we use CART which is very similar to C4.5 since it uses a binary splitting criterion applied recursively and leaving no empty leaf. Decision tree learning builds its model based on recursive partitioning of training data into pure or homogeneous sub-regions. Prediction process of classification or regression can be expressed by inference rules based on the tree structure of the built model, so it can be interpreted and understood easier than other methods. The tree building procedure begins at the root node, which includes all instances in the training data. To find the best possible variable to split the node into two child nodes, we check all possible splitting variables (called splitters), as well as all possible values of the variable used to split the node. It involves an O p n   log   n time complexity where p is the number of input variables and n is the size of the training data set [32]. In choosing the best splitter, we can use some impurity metrics such as entropy or Gini impurity. For example, the Gini impurity function: i m T = 1 y p ( T = y ) 2 , where p T = y is the proportion of observations where class type T is y . Next, we define the difference between the weighted impurity measure of the parent node and the two child nodes. Let us denote the impurity measure of the parent node by i m T ; the impurity measures of the two child nodes by i m T l e f t and i m T r i g h t ; the number of parent node instances by X T ; and the number of the child node instances by X T , l e f t and X T , r i g h t . We choose the best splitter by the query that decreases the impurity as much as possible:
Δ i m T = i m T X T , l e f t X T i m T l e f t X T , r i g h t X T i m T r i g h t .
There are two methods called pre-pruning and post-pruning to avoid over-fitting in decision tree. The pre-pruning method uses stopping conditions before over-fitting occurs. It attempts to stop separating each node if specified conditions are met. The latter method makes a tree over-fitting and determines an appropriate tree size by backward pruning of the over-fitted tree [33]. Generally, the post-pruning is known as more effective than the pre-pruning. Therefore, we use the post-pruning algorithm.

3. Tree-Based Relative Support Distance

In order to cope with large-scale SVM problems, we propose a novel selection method for support vector candidates using a combination of tree decomposition and relative support distance. We aim to reduce the training time of SVMs for the numerical computation of α i in (2) which produces w * and   b * in (1) by selecting good support vectors in advance that are a small subset of the training data. To illustrate our concept, we start with a simple example in Figure 1, where the distribution of the iris data is shown: for the details of the data, refer to Fisher [34]. In short, the iris dataset describes iris plants using four continuous features. The data set contains 3 classes of 50 instances as Iris Setosa, Iris Versicolor, or Iris Virginica. We decompose the input space into several regions by decision tree learning. After training an SVM model for the whole dataset, we mark support vectors by filled shapes. Each region has its own majority class label, and the boundaries are between the two majority classes. The support vectors are close to the boundaries. In addition, we notice that they are located relatively far away from the center of the data points with the majority class label in a region.
In light of this, we describe our algorithm to find a subset of support vectors that determine the separating hyperplane. We divide a training dataset into several decomposed regions in the input space by decision tree learning. This process brings each decomposed region to have most of the data points with the majority class label by the tree learning algorithm. Next, we detect adjacent regions in which the majority class is opposite to that of each region. We define this kind of region as distinct adjacent region. Then we calculate a new distance measure, relative support distance, with the data points in the selected region pairs. The procedure of the algorithm is as follows:
  • Decompose the input space by decision tree learning.
  • Find distinct adjacent regions, which mean adjacent regions whose majority class is different from that of each region.
  • Calculate the relative support distances for the data points in the found distinct adjacent regions.
  • Select the candidates of support vectors according to the relative support distances.

3.1. Distinct Adjacent Regions

After applying decision tree learning to the training data, we detect adjacent regions. The decision tree partitions the input space into several leaves (also denoted by terminal nodes) by reducing some impurity measures such as entropy. Following the approach of detecting adjacent regions introduced by Chau [25], we put in mathematical conditions for being adjacent regions and relate it to the relative support distance. Firstly, we represent each terminal node of a learned decision tree as follows:
L q = j = 1 p b q j , l q j b q j h q j ,
where L q is the q th leaf in the tree structure and b q j is the boundary range for the j th variable of the q th leaf with its lower bound l q j and upper bound h q j .   Recall that p is the number of input variables. We should check whether each pair of leaves, L o and L q , meet the following criteria:
h o s = l q s   o r   l o s = h q s ,
l q k l o k h q k   o r   l q k h o k h q k ,
where s and k are one of the input variables, 1 s p , 1 k p , and s k . That is to say, if two leaves L o and L q are adjacent regions, they have to share one variable, represented by the variable s in Equation (6), and one boundary, induced by the variable k in (7). Among all adjacent regions, we only consider distinct adjacent regions. For example, in Figure 2, the neighbors of L 1 are L 2 , L 4 , and L 5 : L 1 , L 5 , however, does not form an adjacent region pair. { L 3 , L 5 } is an adjacent region pair but not distinct since those regions have the same majority class. Therefore, the distinct adjacent regions in the example are only L 1 , L 2 , L 1 , L 4 , L 2 , L 3 , L 2 , L 5 , and L 4 , L 5 . Distinct adjacent regions are summarized in Table 1. Now, we apply the measure of relative support distance to select support vector candidates in the found distinct adjacent regions for each region.

3.2. Relative Support Distance

Support vectors (SVs) play a substantial role in determining the decision hyperplane in contrast to non-SV data points. We extract data points in the training data that are most likely to be the support vectors, constructing a set of support vector candidates. Given two distinct adjacent regions L 1 and L 2 from the previous step, let us assume the majority class label of L 1 is y = 1 and that of L 2 , y = 2 without loss of generality. First, we calculate the centroid ( m c ) for each majority class label as follows: for an index set S c = { i | x i L c and the label of x i = c } ,
m c = 1 n c i S c x i ,
where c 1 ,   2 and n c is the cardinality of index set S c .
In other words, m c is the majority-class centroid of data points in L c , of which the labels are y = c . Next, we create a virtual hyperplane that bisects the line from m 1 to m 2 :
M = 1 2 m 1 + m 2 , W = m 1 m 2 ,
where M is the middle point of the two majority-class centroids. The virtual hyperplane is given by H x = 0 , where
H x = W t x M .
Lastly, we calculate the distance r x between each data point x in S c and m c and the distance h between each data point in S c and the virtual hyperplane H x = 0 :
r x c , l = x c , l m c , h x c , l = H x c , l W = W t x c , l M m 1 m 2 ,
where x c , l is the l th data point belonging to S c . Figure 3 shows a conceptual description of r and h using the virtual hyperplane in a leaf. After calculating r x and h x , we apply feature scaling to bring all values into the range between 0 and 1. Our observation is that data points lying close to the virtual hyperplane are likely to be support vectors. In addition, data points lying close to the centroid are less likely to be support vectors. In light of these observations, we select data points lying near the hyperplane and far away from the centroid. For this purpose, we define the relative support distance T r x , h x as follows:
T r x , h x = 1 1 + e r x h x .
The larger T r x , h x becomes, the more likely that the associated x is a support vector.
The relationship between support vectors and distances r and h is illustrated in Figure 4. We use leaves L 4 with distinct adjacent regions L 1 and L 2 in Figure 1. In Figure 4, the observations marked by circles are non-support vectors while those by triangles (in red) are support vectors selected after training all data by SVMs. The distances r and h of L 4 relative to L 1 are in Figure 4a, and the relative support distance measures in Figure 4c. Likewise, those of L4 relative to L2 are in Figure 4b,d. We observe that the observations, marked by triangles and surrounded by a red ellipsoid in Figure 1, correspond to the support vectors surrounded by a red ellipsoid in region L 4 in Figure 4a, and they have larger values of relative support distance as shown in Figure 4c. Similarly, we notice that the observations, marked by triangles and surrounded by a green ellipsoid in Figure 4b, correspond to the support vectors surrounded by a green ellipsoid in region L 4 in Figure 1, and they also have larger values of relative support distance as shown in Figure 4d. The support vectors in L 4 are obtained by collecting observations with large values of relative support distance, for example by the rule T r x , h x > 0.9 , from both the pair of L 4 and L 1 and the pair of L 4 and L 2 . The results reveal that the observations that have a mostly larger distance r and shorter distance h are likely to be support vectors.
For each region, we calculate pairwise relative support distance with distinct adjacent regions and select a fraction of the observations, denoted by parameter β , in the decreasing order by T r x , h x as a candidate set of support vectors. That is to say, for each region, we select the top β fraction of training data based on T r x , h x . Parameter β represents the proportion of the selected data points, between 0 and 1 . For example, when β is set to 1 , all data points are included in the training of SVMs. When β = 0.1 , we exclude 90 % of the data points and reduce the training data set to 10 % . Finally, we combine relative support distance with random sampling, which means that a half of the training candidates are selected based on the proposed distance and the others are selected by random sampling. Though being quite informative for selecting possible support vectors, the proposed distance is calculated locally with distinct adjacent regions. Therefore, random sampling can compensate this property by providing whole data distribution information.

4. Experimental Results

In the experiments, we compare the proposed method, tree-based relative support distance (denoted by TRSD ), with some previously suggested methods, specifically SVMs with candidate vectors selection, denoted by CVS [18], and SVM with Fisher linear discriminant analysis, denoted by FLD [25], as well as standard SVMs, denoted by SVM . For all comparing methods, we use LibSVM [15] since it is one of the fastest methods for training SVMs. The experiments are run on a computer with the following features: Core i5 3.4 GHz processor, 16.0 GB RAM, Windows 10 enterprise operating system. The algorithms are implemented in the R programming language. We use 18 datasets which are from UCI Machine Learning Repository [35] and LibSVM Data Repository [36] except the checkerboard dataset [37]: a9a, banana, breast cancer, four-class, German credit, IJCNN-1 [38], iris, mushroom, phishing, Cod-RNA, skin segmentation, waveform, and w8a. Iris and Waveform datasets are modified for binary classification problems by assigning one class to positive and the others to negative. Table 2 shows a summary of the datasets used in the experiments where Size is the number of instances in dataset and Dim is the number of features.
For testing, we apply three-fold cross validation, repeated three-times, by shuffling each dataset and dividing it into three parts, and use two parts as the training dataset, the other part as a testing dataset with different seeds. We use the RBF kernel for training SVMs in all tested methods. For each experiment, cross validation and grid search are used for tuning two hyper-parameters: the penalty factor C and the RBF kernel parameter γ in Equation (3). Hyper-parameters are searched by a two-dimensional grid with C 0.1 ,   1 ,   10 ,   100 and γ 0.001 ,   0.01 ,   0.1 ,   1 ,   1 / p where p is the number of features. Table 3 shows the values used for each dataset in the experiments. Moreover, we vary the fraction of data points in each region β from 0.1 to 0.3 with the interval of 0.1 .
We compare the performance of SVM , CVS , FLD , and TRSD in terms of classification accuracy and the training time (in seconds), summarized in Table 4. We also depicted the performance comparison of the proposed TRSD with CVS and FLD on the five largest datasets when β = 0.1 in Figure 5. We used log-2 scale for y-axis in Figure 5b. In Table 4, Acc is the accuracy on test data; σ is the standard deviation; and Time is the training time in seconds. Even though the accuracy of the proposed algorithm is slightly degraded in a few cases, it is higher than that of CVS and FLD in most cases. In addition, as β is greater, the accuracy of the proposed algorithm enhanced substantially. For small datasets, there is no significant improvement on computation time compared to the standard SVM since those datasets are already small enough. However, we notice that the training time of TRSD improved quite much when using the large-scale datasets.
For statistical analysis, we also performed Friedman test to see that there exists significant difference between the multiple comparing methods in terms of accuracy. If the null hypothesis of the Friedman test is rejected, we performed Dunn’s test. Table 5 shows the summary of the Dunn’s test results at the significant level α = 0.05 . In Table 5, the entries (1) TRSD   >   CVS   FLD ; (2) TRSD     CVS   FLD ; (3) TRSD   <   CVS   FLD , respectively, denote that: (1) the performance of TRSD is significantly better than CVS   FLD ; (2) there in no significant difference between the performances of TRSD and CVS   FLD ; and (3) the performance of TRSD is significantly worse than CVS   FLD . Each number in Table 5 means the number of datasets. At β = 0.1 , our proposed method is significantly better than CVS and FLD in 10 and 9 cases among 18 datasets. These numbers increase to 11 and 10 at β = 0.3 . On the other hand, our proposed method is significantly worse than CVS only in 2, 3 and 3 cases; than FLD in 0, 1 and 0 cases at β = 0.1 ,   0.2 ,   0.3 respectively. Based on the observations in the experiments, we can conclude that our proposed method generates an effective reduction of the training datasets while producing better performance than the existing data reduction approaches.
Finally, to compare the training time of SVM , CVS , FLD , and TRSD in detail, we divide it into two parts: selecting candidate vectors ( SC ) and training a final SVM model ( TS ) when β = 0.3 , summarized in Table 6. From Table 6, we can notice that it especially takes longer time for TRSD and FLD than CVS to select candidate vectors with w8a dataset. This is because the time complexity of building a decision tree is O p n   log   n where p is the number of features and n is the size of training dataset. However, our proposed method takes shorter than FLD since calculating TRSD is more computationally efficient than fisher linear discriminant and is the fastest overall. The results in Table 6 show that the proposed method efficiently selects support vector candidates while maintaining good classification performance.

5. Discussion and Conclusions

In this study, we have proposed a tree-based data reduction approach for solving large-scale SVM problems. In order to reduce time consumption in training SVM models, we apply a novel support vector selection method combining tree decomposition and the proposed relative support distance. We introduce the relative distance measure along with a virtual hyperplane between two distinct adjacent regions to effectively exclude non-SV data points. The virtual hyperplane, easily obtainable, takes advantage of the decomposed tree structures and is shown to be effective in selecting support vector candidates. In computing the relative support distance, we also use the distance between each data point to the centroid in each region and combine the two in consideration of the nonlinear characteristics of support vectors. In experiments, we have demonstrated that the proposed method outperforms some existing methods for selecting support vector candidates in terms of computation time and classification performance. In the future, we would like to investigate other large-scale SVM problems such as multi-class classification and support vector regression. We also envision an extension of the proposed method to under-sampling techniques.

Author Contributions

Investigation, M.R.; Methodology, M.R.; Software, M.R.; Writing—original draft, M.R.; Writing—review & editing, K.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Ministry of Education of the Republic of Korea and the National Research Foundation of Korea (NRF-2020R1F1A1076278).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Cortes, C.; Vapnik, V. Support-Vector Networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
  2. Cristianini, N.; Shawe-Taylor, J. An Introduction to Support Vector Machines and other Kernel-Based Learning Methods; Cambridge University Press: Cambridge, UK, 2000. [Google Scholar]
  3. Cai, Y.D.; Liu, X.J.; biao Xu, X.; Zhou, G.P. Support Vector Machines for predicting protein structural class. BMC Bioinform. 2001, 2, 3. [Google Scholar] [CrossRef] [PubMed]
  4. Belongie, S.; Malik, J.; Puzicha, J. Shape Matching and Object Recognition Using Shape Contexts. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 509–522. [Google Scholar] [CrossRef] [Green Version]
  5. Ahn, H.; Lee, K.; Kim, K.j. Global Optimization of Support Vector Machines Using Genetic Algorithms for Bankruptcy Prediction. In Proceedings of the 13th International Conference on Neural Information Processing—Volume Part III; Springer: Berlin, Germany, 2006; pp. 420–429. [Google Scholar]
  6. Bayro-Corrochano, E.J.; Arana-Daniel, N. Clifford Support Vector Machines for Classification, Regression, and Recurrence. IEEE Trans. Neural Netw. 2010, 21, 1731–1746. [Google Scholar] [CrossRef] [PubMed]
  7. Boser, B.E.; Guyon, I.M.; Vapnik, V.N. A Training Algorithm for Optimal Margin Classifiers. Proceedings of the Fifth Annual Workshop on Computational Learning Theory; Association for Computing Machinery: New York, NY, USA, 1992; pp. 144–152. [Google Scholar]
  8. Friedman, J.; Hastie, T.; Tibshirani, R. The Elements of Statistical Learning; Springer: Berlin, Germany, 2001; Volume 1. [Google Scholar]
  9. Qiu, J.; Wu, Q.; Ding, G.; Xu, Y.; Feng, S. A survey of machine learning for big data processing. EURASIP J. Adv. Signal. Process. 2016, 2016, 1–16. [Google Scholar]
  10. Liu, P.; Choo, K.K.R.; Wang, L.; Huang, F. SVM of Deep Learning? A Comparative Study on Remote Sensing Image Classification. Soft Comput. 2017, 21, 7053–7065. [Google Scholar] [CrossRef]
  11. Chapelle, O. Training a support vector machine in the primal. Neural Comput. 2007, 19, 1155–1178. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  12. Platt, J.C. 12 fast training of support vector machines using sequential minimal optimization. Adv. Kernel Methods 1999, 185–208. [Google Scholar]
  13. Joachims, T. Svmlight: Support Vector Machine. SVM-Light Support. Vector Mach. Univ. Dortm. 1999, 19.. Available online: http://svmlight.joachims.org (accessed on 29 November 2019).
  14. Vishwanathan, S.; Murty, M.N. SSVM: A simple SVM algorithm. 2002 International Joint Conference on Neural Networks. IJCNN’02, Honolulu, HI, USA, 12–17 May 2002. [Google Scholar]
  15. Chang, C.C.; Lin, C.J. LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2011, 2, 27. [Google Scholar] [CrossRef]
  16. Lee, Y.J.; Mangasarian, O.L. RSVM: Reduced Support Vector Machines. SDM. In Proceedings of the 2001 SIAM International Conference on Data Mining, Chicago, IL, USA, 5–7 April 2001; pp. 325–361. [Google Scholar]
  17. Collobert, R.; Bengio, S.; Bengio, Y. A parallel mixture of SVMs for very large scale problems. Neural Comput. 2002, 14, 1105–1114. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  18. Li, M.; Chen, F.; Kou, J. Candidate vectors selection for training support vector machines. In Natural Computation, 2007. ICNC 2007. In Proceedings of the Third International Conference on Natural Computation, Haikou, China, 24–27 August 2007; pp. 538–542. [Google Scholar]
  19. Nishida, K.; Kurita, T. RANSAC-SVM for large-scale datasets. In Proceedings of the 19th International Conference on Pattern Recognition, Tampa, FL, USA, 8–11 December 2008; pp. 1–4. [Google Scholar]
  20. Kawulok, M.; Nalepa, J. Support Vector Machines Training Data Selection Using a Genetic Algorithm. In Proceedings of the 2012 Joint IAPR International Conference on Structural, Syntactic, and Statistical Pattern Recognition; Springer: Berlin, Germany, 2012; pp. 557–565. [Google Scholar]
  21. Nalepa, J.; Kawulok, M. A Memetic Algorithm to Select Training Data for Support Vector Machines. In Proceedings of the 2014 Annual Conference on Genetic and Evolutionary Computation, Association for Computing Machinery, New York, NY, USA, 12–16 July 2014; pp. 573–580. [Google Scholar]
  22. Nalepa, J.; Kawulok, M. Adaptive Memetic Algorithm Enhanced with Data Geometry Analysis to Select Training Data for SVMs. Neurocomputing 2016, 185, 113–132. [Google Scholar] [CrossRef]
  23. Martin, J.K.; Hirschberg, D. On the complexity of learning decision trees. International Symposium on Artificial Intelligence and Mathematics. Citeseer 1996, 112–115. [Google Scholar]
  24. Chang, F.; Guo, C.Y.; Lin, X.R.; Lu, C.J. Tree decomposition for large-scale SVM problems. J. Mach. Learn. Res. 2010, 11, 2935–2972. [Google Scholar]
  25. Chau, A.L.; Li, X.; Yu, W. Support vector machine classification for large datasets using decision tree and fisher linear discriminant. Future Gener. Comput. Syst. 2014, 36, 57–65. [Google Scholar] [CrossRef]
  26. Cervantes, J.; Garcia, F.; Chau, A.L.; Rodriguez-Mazahua, L.; Castilla, J.S.R. Data selection based on decision tree for SVM classification on large data sets. Appl. Soft Comput. 2015, 37, 787–798. [Google Scholar] [CrossRef] [Green Version]
  27. Radial Basis Function Kernel, Wikipedia, Wikipedia Foundation. Available online: https://en.wikipedia.org/wiki/Radial_basis_function_kernel (accessed on 29 November 2019).
  28. Kass, G.V. An Exploratory Technique for Investigating Large Quantities of Categorical Data. J. R. Stat. Soc. 1980, 29, 119–127. [Google Scholar] [CrossRef]
  29. Breiman, L.; Friedman, J.H.; Olshen, R.A.; Stone, C.J. Classification and Regression Trees; Wadsworth and Brooks: Monterey, CA, USA, 1984. [Google Scholar]
  30. Quinlan, J.R. C4.5: Programs for Machine Learning; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1993. [Google Scholar]
  31. Loh, W.Y.; Shih, Y.S. Split Selection Methods for Classification Trees. Stat. Sin. 1997, 7, 815–840. [Google Scholar]
  32. Martin, J.K.; Hirschberg, D.S. The Time Complexity of Decision Tree Induction; University of California: Irvine, CA, USA, 1995. [Google Scholar]
  33. Li, X.B.; Sweigart, J.; Teng, J.; Donohue, J.; Thombs, L. A dynamic programming based pruning method for decision trees. Inf. J. Comput. 2001, 13, 332–344. [Google Scholar] [CrossRef]
  34. Fisher, R.A. The use of multiple measurements in taxonomic problems. Ann. Eugen. 1936, 7, 179–188. [Google Scholar] [CrossRef]
  35. Dua, D.; Graff, C. UCI Machine Learning Repository; University of California, School of Information and Computer Science: Irvine, CA, USA, 2019. [Google Scholar]
  36. Chang, C.C.; Lin, C.J. LIBSVM: A Library for Support Vector Machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1—27:27. 2011. Available online: https://www.csie.ntu.edu.tw/cjlin/libsvmtools/datasets/binary.html (accessed on 29 November 2019).
  37. Ho, T.K.; Kleinberg, E.M. Checkerboard Data Set. 1996. Available online: https://research.cs.wisc.edu/math-prog/mpml.html (accessed on 29 November 2019).
  38. Prokhorov, D. IJCNN 2001 Neural Network Competition.Slide Presentation in IJCNN’01, Ford Research Laboratory. 2001. Available online: https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html (accessed on 29 November 2019).
Figure 1. The construction of a decision tree and SVMs for the iris data shows the boundaries and support vectors (the filled shapes). The support vectors in the regions are located far away from the majority-class centroid.
Figure 1. The construction of a decision tree and SVMs for the iris data shows the boundaries and support vectors (the filled shapes). The support vectors in the regions are located far away from the majority-class centroid.
Applsci 10 06979 g001
Figure 2. Distinct adjacent regions are L 1 , L 2
, L 1 , L 4 , L 2 , L 3 , L 2 , L 5 , and L 4 , L 5 .
Figure 2. Distinct adjacent regions are L 1 , L 2
, L 1 , L 4 , L 2 , L 3 , L 2 , L 5 , and L 4 , L 5 .
Applsci 10 06979 g002
Figure 3. Distances r from the center and h from the virtual hyperplane are shown. The two-star shapes are the centroid of each class.
Figure 3. Distances r from the center and h from the virtual hyperplane are shown. The two-star shapes are the centroid of each class.
Applsci 10 06979 g003
Figure 4. Illustration of the distances r and h according to distinct adjacent regions for the iris data in Figure 1. (a) Distances for leaf L 4 relative to L 1 are shown. Notice the support vectors captured by leaf L 4 to leaf L 1 are in the (dotted) red ellipsoid. (b) Distances for leaf L 4 relative to L 2 are shown. Notice the support vectors captured by leaf L 4 relative to leaf L 2 are in the (dashed) green ellipsoid. (c) Relative support distances of the observations x i in leaf L 4 relative to L 1 are shown. Notice the support vectors captured by leaf L 4 to leaf L 1 are in the (dotted) red ellipsoid. (d) Relative support distances of the observations x i in leaf L 4 relative to L 2 are shown. Notice the support vectors captured by leaf L 4 relative to leaf L 2 are in the (dashed) green ellipsoid.
Figure 4. Illustration of the distances r and h according to distinct adjacent regions for the iris data in Figure 1. (a) Distances for leaf L 4 relative to L 1 are shown. Notice the support vectors captured by leaf L 4 to leaf L 1 are in the (dotted) red ellipsoid. (b) Distances for leaf L 4 relative to L 2 are shown. Notice the support vectors captured by leaf L 4 relative to leaf L 2 are in the (dashed) green ellipsoid. (c) Relative support distances of the observations x i in leaf L 4 relative to L 1 are shown. Notice the support vectors captured by leaf L 4 to leaf L 1 are in the (dotted) red ellipsoid. (d) Relative support distances of the observations x i in leaf L 4 relative to L 2 are shown. Notice the support vectors captured by leaf L 4 relative to leaf L 2 are in the (dashed) green ellipsoid.
Applsci 10 06979 g004aApplsci 10 06979 g004b
Figure 5. Comparison of the proposed tree-based relative support distance (TRSD) with candidate vectors (CVS) and Fisher linear discriminant analysis (FLD) on the five largest datasets.
Figure 5. Comparison of the proposed tree-based relative support distance (TRSD) with candidate vectors (CVS) and Fisher linear discriminant analysis (FLD) on the five largest datasets.
Applsci 10 06979 g005
Table 1. Partition of input regions and distinct adjacent regions.
Table 1. Partition of input regions and distinct adjacent regions.
RegionDistinct Adjacent Regions
L 1 L 2 , L 4
L 2 L 1 , L 3 , L 5
L 3 L 2
L 4 L 1 , L 5
L 5 L 2 , L 4
Table 2. Datasets for experiments.
Table 2. Datasets for experiments.
DatasetSizeDim y i   =   + 1 y i   =   1
Iris-Setosa150450100
Iris-Versicolor150450100
Iris-Virginia150450100
Breast Cancer68310444239
Four-class8622555307
Checkerboard10002514486
German Credit100024700300
Waveform-050002116573343
Waveform-150002116573343
Waveform-250002116573343
Banana5300229242376
Mushroom812411239164208
Phishing11,0556848986157
w8a45,54630044,2261320
a9a48,84212337,15511,687
IJCNN-1141,69122128,12613,565
Skin Segmentation245,057350,859194,198
Cod-RNA488,5658325,710162,855
Table 3. Hyper-parameters setting for the experiments.
Table 3. Hyper-parameters setting for the experiments.
MethodDataset Penalty   Factor   C RBF   Kernel   γ DatasetPenalty Factor CRBF Kernel r
SVM Iris-Setosa100.001Waveform-20.1 1 / p
CVS 1000.001 100.01
FLD 1000.001 1000.001
TRSD 1000.001 1 1 / p
SVM Iris-Versicolor100.1Banana11
CVS 1000.1 1001
FLD 0.10.01 10 1 / p
TRSD 101 11
SVM Iris-Virginia1000.01Mushroom1000.001
CVS 1000.01 100.001
FLD 1000.001 0.10.01
TRSD 1000.1 1000.001
SVM Breast Cancer10.01Phishing10 1 / p
CVS 100.001 10.01
FLD 1000.001 1000.001
TRSD 10.1 1000.001
SVM Four-class101w8a100.001
CVS 1001 100.01
FLD 1001 10.001
TRSD 101 100.001
SVM Checkerboard1001a9a100.001
CVS 1001 100.01
FLD 1001 1000.001
TRSD 1001 100.001
SVM German Credit1000.001IJCNN-1100.1
CVS 1 1 / p 100 1 / p
FLD 0.10.001 1000.01
TRSD 10.01 10 1 / p
SVM Waveform-010.01Skin Segmentation1001
CVS 100.01 1001
FLD 10.01 1000.1
TRSD 10.01 1001
SVM Waveform-11 1 / p Cod-RNA101
CVS 10 1 / p 1000.001
FLD 1 1 / p 1000.01
TRSD 1 1 / p 10 1 / p
Table 4. Comparisons in terms of accuracy and time on datasets (the results of support vector machines (SVM) are not bolded, because it had access to all examples).
Table 4. Comparisons in terms of accuracy and time on datasets (the results of support vector machines (SVM) are not bolded, because it had access to all examples).
Dataset S V M C V S F L D T R S D
acc σ Timeacc σ TimeAcc σ TimeAcc σ Time
β = 0.1
Iris-Setosa10000.0110000.0110000.0110000.01
Iris-Versicolor95.430.980.0178.5412.710.0152.2917.530.0287.712.220.02
Iris-Virginia96.572.240.0187.996.550.0191.714.070.0291.54.720.01
Breast Cancer96.990.890.0195.360.50.0196.30.950.0896.241.040.04
Four-class10000.0193.190.080.0193.283.580.0793.242.40.05
Checkerboard94.30.630.0384.611.470.01---79.343.130.08
German Credit75.70.70.0969.870.750.01700.040.4270.550.970.21
Waveform-089.850.631.0385.50.730.2187.510.650.5889.010.510.34
Waveform-191.260.30.6883.70.870.289.460.380.6690.060.50.35
Waveform-291.760.480.9484.521.230.2189.770.970.7490.310.480.38
Banana90.60.690.5964.132.690.0383.461.770.0988.80.40.06
Mushroom10001.4790.670.622.0884.997.231.1199.830.120.79
Phishing96.690.264.59900.492.2893.280.691.4693.960.260.79
w8a99.110.04114.8597.40.0519.6497.830.3231.8998.380.0951.18
a9a84.70.14373.2478.480.2943.1981.270.7221.9184.240.1612.59
IJCNN-199.270.7224.2997.020.129.8397.550.11298.390.057.71
Skin Segmentation99.94020.2399.930.011.2799.450.113.3999.890.012.51
Cod-RNA96.980.034425.5590.680.08178.396.050.0866.9396.320.0342.85
β = 0.2
Iris-Setosa10000.0110000.0110000.0110000.01
Iris-Versicolor95.430.980.0189.443.920.0151.7217.020.0293.431.480.02
Iris-Virginia96.572.240.0193.432.20.0194.262.170.0290.313.030.02
Breast Cancer96.990.890.0195.551.090.0196.490.910.0796.30.950.06
Four-class10000.0196.421.780.0195.381.630.0798.810.80.05
Checkerboard94.30.630.0393.661.090.01---83.662.480.08
German Credit75.70.70.0970.040.010.02700.040.3570.560.620.25
Waveform-089.850.631.0388.350.370.2788.550.730.7689.280.730.54
Waveform-191.260.30.6886.450.560.2689.890.370.7890.520.490.5
Waveform-291.760.480.9488.210.50.2790.390.540.9990.620.580.51
Banana90.60.690.5979.862.020.185.141.50.189.690.360.07
Mushroom10001.4797.091.122.2297.370.881.3399.910.10.85
Phishing96.690.264.5993.910.22.5894.420.381.6394.620.30.99
w8a99.110.04114.8597.50.0727.2298.120.12233.7898.630.0563.04
a9a84.70.14373.2478.860.2674.2282.190.4239.3784.610.1824.27
IJCNN-199.270.7224.2998.410.0523.5698.070.0918.798.720.0714.45
Skin Segmentation99.94020.2399.940.012.3699.730.064.8299.90.013.64
Cod-RNA96.980.034425.5595.440.03514.996.190.03293.2996.430.03152.98
β = 0.3
Iris-Setosa10000.0110000.0110000.0110000.01
Iris-Versicolor95.430.980.0192.32.440.0159.8714.470.0194.012.570.02
Iris-Virginia96.572.240.0193.982.030.0195.512.530.0194.311.770.02
Breast Cancer96.990.890.0196.420.690.0196.360.550.0896.170.790.05
Four-class10000.0196.621.520.0195.621.60.0799.50.660.05
Checkerboard94.30.630.0393.9610.01---87.182.30.1
German Credit75.70.70.0969.950.220.03700.040.3570.810.870.29
Waveform-089.850.631.0389.10.380.3588.930.710.7189.470.340.41
Waveform-191.260.30.6887.490.330.3490.180.460.7490.80.340.38
Waveform-291.760.480.9489.630.430.3390.710.660.8491.090.630.41
Banana90.60.690.5981.470.740.1786.631.830.1290.090.490.1
Mushroom10001.4797.790.712.1797.870.281.7199.950.070.97
Phishing96.690.264.5994.510.16394.750.31.9994.930.141.29
w8a99.110.04114.8597.710.0849.6398.360.07237.0198.830.0655.09
a9a84.70.14373.2480.350.2127.483.180.2877.1884.680.1343.02
IJCNN-199.270.7224.2998.680.0637.7998.30.0834.7598.860.0625.43
Skin Segmentation99.94020.2399.940.013.8499.750.087.0199.9103.78
Cod-RNA96.980.034425.5595.920.04899.196.270.01598.2396.480.02256.71
Table 5. Dunn’s test results in different β for the significant level α = 0.05 .
Table 5. Dunn’s test results in different β for the significant level α = 0.05 .
β T R S D   >   C V S T R S D     C V S T R S D   <   C V S T R S D   >   F L D T R S D     F L D T R S D   <   F L D
0.11062990
0.21143981
0.311431080
Total3214828251
Table 6. Time comparison in detail. SC and TS mean computing time for selecting candidate vectors and training a final SVM model respectively.
Table 6. Time comparison in detail. SC and TS mean computing time for selecting candidate vectors and training a final SVM model respectively.
DatasetSVMCVSFLDTRSD
SCTSSCTSSCTSSCTS
Iris-Setosa00.0100.0100.0100.01
Iris-Versicolor00.0100.0100.0100.01
Iris-Virginia00.0100.0100.0100.01
Breast Cancer00.0100.010.070.010.040.01
Four-class00.0100.010.060.010.040.01
Checkerboard00.0300.01--0.090.01
German Credit00.090.010.020.330.020.270.02
Waveform-001.030.180.170.580.130.30.11
Waveform-100.680.170.170.650.090.30.08
Waveform-200.940.170.160.730.110.340.07
Banana00.5900.170.080.040.050.05
Mushroom01.471.90.270.990.720.710.26
Phishing04.592.080.921.30.690.720.57
w8a0114.8517.4832.15228.838.1846.358.74
a9a0373.2444.582.8816.5260.668.8434.18
IJCNN-10224.296.0631.7310.0824.675.7719.66
Skin Segmentation020.230.33.543.223.792.31.48
Cod-RNA04425.550.7898.415.46582.7711.24245.47

Share and Cite

MDPI and ACS Style

Ryu, M.; Lee, K. Selection of Support Vector Candidates Using Relative Support Distance for Sustainability in Large-Scale Support Vector Machines. Appl. Sci. 2020, 10, 6979. https://doi.org/10.3390/app10196979

AMA Style

Ryu M, Lee K. Selection of Support Vector Candidates Using Relative Support Distance for Sustainability in Large-Scale Support Vector Machines. Applied Sciences. 2020; 10(19):6979. https://doi.org/10.3390/app10196979

Chicago/Turabian Style

Ryu, Minho, and Kichun Lee. 2020. "Selection of Support Vector Candidates Using Relative Support Distance for Sustainability in Large-Scale Support Vector Machines" Applied Sciences 10, no. 19: 6979. https://doi.org/10.3390/app10196979

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop