Capped Linex Metric Twin Support Vector Machine for Robust Classification

In this paper, a novel robust loss function is designed, namely, capped linear loss function Laε. Simultaneously, we give some ideal and important properties of Laε, such as boundedness, nonconvexity and robustness. Furthermore, a new binary classification learning method is proposed via introducing Laε, which is called the robust twin support vector machine (Linex-TSVM). Linex-TSVM can not only reduce the influence of outliers on Linex-SVM, but also improve the classification performance and robustness of Linex-SVM. Moreover, the effect of outliers on the model can be greatly reduced by introducing two regularization terms to realize the structural risk minimization principle. Finally, a simple and efficient iterative algorithm is designed to solve the non-convex optimization problem Linex-TSVM, and the time complexity of the algorithm is analyzed, which proves that the model satisfies the Bayes rule. Experimental results on multiple datasets demonstrate that the proposed Linex-TSVM can compete with the existing methods in terms of robustness and feasibility.


Introduction
Data collecting and reasonable processing are becoming increasingly crucial as modern computer technology advances. As an excellent machine learning tool, support vector machine (SVM) [1][2][3][4] has been widely used in financial forecast, bioinformatics, computer vision, image annotation, data mining and other fields in recent years. The main idea of SVM classification based on statistical learning theory and optimization theory is to construct a pair of parallel hyperplanes to maximize the minimum distance between two classes of samples. Generally speaking, the optimal hyperplane is realized by solving an optimization problem with inequality constraints. In order to avoid overfitting, scholars extend support vector machines to soft difference support vector machines (C-SVM) [5], introduce relaxation variables to relax constraints, and increase the penalty term of relaxation variables in the objective function. However, the loss function adopted by C-SVM is generally a hinge loss, which makes it very sensitive to noise. In the following research, C-SVM is extended to deal with the problem of function estimation, and a support vector interpretation of ridge regression [6] is proposed, which is different from the inequality constraints in C-SVM, as it uses equivalent constraints. Similarly, Suykens [7] considered equality constraints in the sense of least squares and proposed a least squares support vector machine (LSSVM). Unlike C-SVM, it does not use the non-support vector machine to optimize the classifier; LSSVM makes full use of the information of all data points and uses L 2 loss to punish both data points symmetrically. In order to further improve the performance of classification, researchers need to impose heavier penalties on samples that are misclassified. For this reason, in the literature [8], Ma et al. considered asymmetric linear exponential loss (LINEX) to be used to achieve this goal, and Linex-TSVM is proposed to study the binary classification problem. However, both SVM, C-SVM , LSSVM and Linex-TSVM have their own advantages, but they all have to solve a large-scale QPP(Quadratic Programming problem), which requires a lot of time to learn and is not suitable for dealing with practical problems. Because all the above models need to solve a big QPP, to further improve the computing speed, Jayadeva et al. [9] proposed a twin support vector machine (TSVM) for pattern classification based on the generalized eigenvalue approximation support vector machine (GEPSVM). Since TSVM solves two smaller QP problems instead of a single large QPP, it can theoretically learn four times faster than a standard SVM. The main goal of TSVM is to find two parallel hyperplanes, each of which is as close as possible to the corresponding class in the sample data, while being as far away from the other classes as possible. Therefore, TSVM is more suitable for the classification of large-scale data.
It is well known that distance metrics play a crucial role in many machine learning algorithms. Although the above algorithms demonstrate good performance in pattern classification, it is worth noting that most of them adopt the L 2 -norm distance metric, whose squaring operation will exaggerate the impact of outliers on model performance. To effectively alleviate the impact of the L 2 -norm distance metric on the robustness of the algorithm, the L 1 -norm distance metric with the bounded derivative has received extensive attention and research in many fields of machine learning in recent years [10][11][12]. Recently, more and more researchers have paid attention to the capped L 1 -norm. The capped L 1 -norm can solve the deficiency of L 1 -norm unboundedness. In particular, Wang et al. [13] proposed a new robust TSVM (CTSVM) by the applying the capped L 1 -norm.
Inspired by the successful application of capped L 1 -norm and linex loss function [14][15][16][17], meanwhile, the latest research shows that no scholar has extended the Linex loss function to twin support vector machines, therefore, a new robust twin support vector machine is established in this paper. The details and the main contributions of this work are as follows: (1) A novel robust loss function is designed, namely, capped linear loss function L aε .
(2) A novel robust twin support vector machine, namely capped linex twin support vector machine (Linex-TSVM) is proposed.
(3) A efficient iterative algorithm is designed to solve Linex-TSVM, which is not only easy to implement, but also theoretically guarantees the existence of a reasonable optimal solution. We analyze the computational complexity of the algorithm and prove that the model satisfies the Bayesian rule.
(4) Extensive experiments conducted across multiple datasets demonstrates that the proposed Linex-TSVM is competitive with state-of-the-art methods in terms of robustness and feasibility. Therefore, the Linex-TSVM is feasible for practical applications.
The rest of this article is organized as follows. In Section 2, we briefly review Linex-SVM and TSVM. In Section 3, we describe in detail the proposed capped linex loss function and Linex-TSVM, and give the relevant theoretical analysis. After the experimental results on multiple data sets are presented in Section 4, we conclude this paper in Section 5.

Related Work
In this section, we are warranted to review Linex-SVM and TSVM.

Linex-SVM
Linex loss function is a typical asymmetric loss function, defined as: where a = 0 is a parameter. If a < 0, the left side of Linex loss is steeper than the right side, and the opposite is true when a > 0. The value of |a| decides the symmetry of Linex loss function. This shows that the symbol of a determines the shape of the function. When a takes an appropriate value, it can be reduced to square loss. Linex loss function is not only asymmetric, but also convex and derivable; thus, it is widely used in statistics.
For the dichotomy problem in n-dimensional Euclidean space, the training set can be expressed as where x i ∈ R n is the feature vector of the data i, and y i ∈ {−1, +1} is the label of the data i. For the training set Equation (2), the Linex-support vector machine model can be written as a convex optimization problem with equation constraints in Equation (3) by  introducing a linear loss function: min ω,b,ξ where ξ = (ξ 1 , ξ 2 , · · · , ξ m ) is a slack variable, C is a penalty parameter and a is a parameter of the Linex loss. Furthermore, we can use the Nesterov accelerated gradient (NAG) method to obtain the optimal solution (ω 1 , b 1 ) and construct the decision function f (x) = sgn(ω T 1 x + b 1 ).

TSVM
The support vector machine is not suitable for dealing with large-scale data, to improve the practical application of the model and further shorten the learning time. Jayadeva et al. proposed a twin support vector machine (TSVM) for pattern classification based on the generalized eigenvalue approximation support vector machine (GEPSVM). The details are as follows: Considering n dimensional Euclidean space R n the binary classification problem, the training set is T = {x i , y i |i = 1, 2...m}, x i ∈ R n , where y i ∈ {−1, 1}. A ∈ R m 1 ×n represents all positive samples; B ∈ R m 2 ×n represents all negative samples. TSVM identifies two non-parallel hyper planes in the feature space: where The TSVM classifier is obtained by solving the following pair of QPPS: min where C 1 ≥ 0, C 2 ≥ 0 represent regularization parameters, e 1 , e 2 are all unit vectors, ξ 1 , ξ 2 are the slack vectors. Then, the dual problem of TSVM is obtained by dual theory: Furthermore, by introducing the kernel method, TSVM can be extended to nonlinear space, and it can be decided whether the sample data x belongs to positive class or negative class according to the shortest distance between the sample data x and two non-parallel planes. The decision function is

Capped Linex Loss Function
In this section, in order to minimize the influence of abnormal values on the classification results of the model, we propose a novel robust loss function, that is, the capped linex loss function. The details are as follows: Definition 1. The capped linex loss function is defined as where a = 0 is a parameter, when a < 0, the left side of the loss function is steeper than the right side; when a > 0, the right side of the loss function is steeper than the left side; see Figure 1. ε > 0 is a thresholding parameter; x i is the component of x. Figure 1 shows the comparison between the capped linex loss function and the linex loss function. Obviously, we can observe that the improved linear loss has an upper bound, and when the error tends to be consistent, even if there are outliers, the loss will not increase to a certain extent, which improves the robustness of the model.

Capped Linex Twin Support Vector Machine
Linex-SVM model still needs to be improved: linex loss is an unbounded function, and the loss tends to be consistent with the increase in error. However, in practical applications, datasets are often accompanied by noise, and the unboundedness of the linex loss function will affect the overall performance of the model. In other words, Linex-SVM is a relatively weak method in dealing with training sets with outliers. In addition, almost all the instances in the Linex-SVM contribute to the final optimal hyperplane, which will greatly reduce the training speed.
In order to improve the classification performance of Linex-SVM, we first improve Linex loss to capped linex loss and introduce regularization term to enhance robustness. Secondly, we generalize Linex-SVM to twin support vector machine, and transform a large QPP into two small QP problems to improve the training speed. Based on the above two points, a new twin support vector machine model, named the capped linex twin support vector machine (Linex-TSVM) is obtained: min where C 1 , C 2 , C 3 , C 4 ≥ 0, e 1 ∈ R m 1 and e 2 ∈ R m 2 are the unit vectors. ξ and η are slack vectors.
In addition, we also notice that when using the traditional convex optimization method, it is difficult to solve the problems Equations (12) and (13) simply and quickly. Here, in order to simplify the original problem to an approximate problem that is easier to solve, we use the re-weighted trick [12,[18][19][20], the most important of which is the formula (12) as an example, for the distance measurement items, when the F = 1 |x| holds, then . Further, in order to simplify function e aξ − aξ − 1 into an easy-to-solve ξ T Qξ, we define Q as diagonal matrices with i-th diagonal element as: similarly, Based on the above discussion and calculation, we can obtain the optimization problem Equations (16) and (17), as follows: min where e 1 ∈ R m 1 and e 2 ∈ R m 2 are the unit vectors, F and K are also two diagonal matrices: (16) and (17), we use diagonal matrices F, Q and K, U, respectively, to reduce the influence of outliers and abnormal noise on the model. Specifically, if the points in the same class are far away from the hyperplane, they can be treated as noise and removed. In addition, the model mainly sets the elements in the diagonal matrix according to the distance from the data point x i to the hyperplane. For F, if f i is greater than ε 1 , the corresponding f i is set to a smaller value (Smallval), which is almost equivalent to 0. Where 'Smallval' is a small constant, which will be set to 10 −8 in the experiment.

Remark 1. What is more detailed is that in the objective functions Equations
The corresponding Lagrange function of the above optimization problem Equation (16) can be written as: where α is a Lagrange multiplier, derive the Lagrange function about ω 1 and β 1 , obtain the following Karush-Kuhn-Tucker Conditions.
Through KKT condition, the dual problem of Equation (12) is as follows: Similarly, the dual problem of Equation (13) is: where β is a Lagrange multiplier, and Thus, we get the vector Z 1 and Z 2 , and gain the new data point x ∈ R n to a positive or negative category.
Based on the above discussion, our algorithm will be presented in Algorithm 1.

Bayes Rule
We want to prove that the model proposed in this paper can satisfy the Bayes rule, assuming that the sample (x i , y i ) are independent of the same probability φ, and the probability φ is defined on X × Y, where X ∈ R n , Y = −1, 1. Further, we assume that the conditional distribution φ(y|x) is a binomial distribution, including φ(−1|x) and φ(1|x). As we all know, the ultimate goal of the classification problem is to obtain a classifier C : X → Y with small error . Bayesian classifier [8] is defined as the classifier with the lowest probability of classification error among all kinds of classifiers.
For any loss function L, the expected risk of the classifier f : X → R can be defined as Next, by minimizing the expected risk of all measurable classification functions, we can obtain Based on the above important definition of Bayes rule, we obtain Theorem 1 to prove that Bayes rule holds for capped linex loss function. The details of the proof are as follows.

Computational Complexity Analysis
This part mainly analyzes the computational complexity of Algorithm 1. As we all know, the computational complexity includes the number of iterations and the computational cost of iterations. The computational complexity of Algorithm 1 after one iteration is divided into two parts: (1) the time complexity of solving QPP is not more than m 3 4 , and the inverse of matrix is not greater than (n + 1)(n + 1). Therefore, the total time complexity of solving Linex-TSVM is about O(t · ( m 3 4 (n + 1)3)), where t is the number of iterations, and the experimental results of this paper demonstrate that t = 50 meets the expectation. Under the condition of universality, the number of iterations of each algorithm is much less than the number of samples. Similarly, Linex-TSVM has cubic time complexity in the number of samples.

Experimental Results and Discussions
In this section, we first set the experimental parameters in Section 4.1, and in sections Sections 4.2 and 4.3, we give in detail the experimental results of the model Linex-TSVM with or without noise. Finally, we present some results on the data set in Section 4.4 to prove the convergence of the objective function.

Evaluation Criteria
In order to evaluate the classification performance of our proposed truncated linear loss support vector machine more accurately, we compare it with other mature methods, including SVM, LSSVM, C-SVM, Linex-TSVM, and TBSVM. For these five support vector machines and Linex-TSVM, the iterative process is stopped when the difference between the target values of the two iterations is less than 0.001 and the number of iterations is more than 50. At the same time, in order to measure the performance of all algorithms, the traditional precision index (ACC) is used to measure the performance of these algorithms, which is defined as follows: Among them, TP and TN represent correct positive samples and negative samples, respectively. FN and FP represent wrong positive samples and negative samples, respectively. In order to make a more accurate comparison, we use the quadratic programming (QP) toolbox of matlab to solve the QP problem in related algorithms. The experimental environment consists of a Windows 10 machine and Intel i7 Processor (3.70 GHz) with 8 GB of RAM.

Parameters Selection
For the learning algorithm, its performance is very sensitive to the parameters involved, so it is necessary to record the parameters of each algorithm in detail and list them as follows.
SVM and LSSVM:the kernel parameter σ. C-SVM: the regularization parameter c, the kernel parameter σ.
The experimental parameters are selected by ten cross-validation methods, and the test accuracy is the average of 10 clusters of results in each dataset.

Description of the Datasets
To verify the effectiveness of Linex-TSVM, we conduct numerical simulations on different datasets, including seven benchmark datasets from the UCI machine learning repository and two artificial datasets. The datasets are described as follows: Artificia datasets: In the artificial dataset (a) and (b), there are 50 positive samples and 50 negative samples, represented by '+', ' ' and ' ', respectively, as shown in Figure 2. Because the outliers will have a certain impact on the classification performance, it is also the standard to measure the stability of the algorithm. Therefore, we introduce four outliers in the artificial dataset to evaluate its robustness, two of which belong to class +1 and two belong to class −1.
UCI datasets: Australian, Spect, Pima, German, Vote, CMC, Sonar, Spect and Large dataset(codrna). Details of the eight UCI datasets are given in Table 1. These UCI datasets are used to test the performance of our algorithms and related algorithms. We divide all the data sets into ten subsets, including nine training sets and one test set, that is, 10-fold cross-validation, so that the process is repeated ten times, and the average value of the final result is taken as the criterion to measure the performance of the model. At the same time, we normalize the eight participating data sets, which can avoid errors caused by different orders of magnitude and units, keeping the result within [0, 1].

Experimental Results on the Employed Datasets without Outliers
Eight UCI datasets are selected and the running results are compared with the other six algorithms to verify the better classification performance of the proposed algorithm.
All experimental results presented in Table 2 are based on optimal parameters. Here, "Time(s)" denotes the average runtime in seconds taken by each algorithm according to the optimal parameters, "ACC ± S" denotes the average classification accuracy plus or minus standard deviations.   Intuitively, it can be observed from Table 2 that the classification performance of the twin support vector machine based on capped linear loss function proposed in this paper is better than that of the other six models. Except for CMC data sets, Linex-TSVM has better results on other data sets. At the same time, we also observe that the computing time of this model is not dominant, which is because the model is more complex. The time of LSSVM algorithm for solving a system of linear equations is shorter, and compared with SVM, it shortens the time while retaining accuracy, which is in line with the relevant theory. It is worth mentioning that the result of Linex-SVM is still good, which shows that the introduction of linear loss function is meaningful.
Through the detailed analysis of the above experimental results, we can obtain an objective and reasonable conclusion: the use of capped linear loss function on the basis of TBSVM can improve the classification performance, and the introduction of L 1 -norm distance metric can also enhance the robustness of the model; thus, our model is an effective supervision algorithm without the influence of outliers.

Experimental Results on Artificial Dataset with Outliers
It is well known that outliers tend to have a certain impact on classification performance, which is also a measure of the stability of the algorithm. Therefore, we introduce outliers in artificial datasets (a) and (b), respectively, and Figure 2 is displayed visually. In order to further verify the robustness of the capped linear loss function, we show the classification accuracy of this algorithm on artificial data sets (a) and (b) in Figure 3 From Figure 3, we observe that the proposed Linex-TSVM has higher accuracy when considering outliers; on artificial datasets (a) and (b), the classification accuracy of Linex-TSVM is 68.06% and 91.97%, respectively, which is better than the other five algorithms, can deal with outliers well, and has stronger robustness and better classification ability.
In summary, the capped L 1 -norm is robust to different types of outliers in the literature [21][22][23][24][25]. It can overcome the residual error of outliers in the experiment, and can help the model to eliminate the influence of outliers. In particular, the truncated linear loss function in this model can increase the punishment for outliers. In a word, Linex-TSVM can effectively improve the robustness of TBSVM.

Experimental Results on UCI Dataset with Outliers
In order to verify that this model is also suitable for large-scale data with outliers, we add 10% and 25% noises to the eight UCI data sets, respectively. The reason why the algorithm is introduced into the model is that in practical application, there are various kinds of data and there must be different degrees of noise. In order to verify that the model is suitable for data sets of different fields and different sizes to a certain extent, it is necessary to introduce different noise to compare the models. At the same time, we find that after adding noise, the accuracy will fluctuate to a certain extent, but the overall trend shows a slow decline, which shows that when the noise is relatively large, it will have a certain impact on the model, but the model in this paper is more stable. The results, such as Tables 3 and 4, show that after the introduction of outliers, the seven algorithms all have varying degrees of accuracy fluctuations, but show a downward trend as a whole, and the classification accuracy of Linex-TSVM is almost better than other algorithms. This shows that the model proposed in this paper has stronger robustness.
Specifically, in Tables 3 and 4, Linex-TSVM has the best accuracy in seven of the eight data sets, while the least squares support vector machine model has the shortest computing time under different noises. It is worth noting that compared with SVM, LSSVM, NPSVM, C-SVM, Linex-SVM, due to the use of capped linear loss function, the penalty for outliers is increased, so it has better classification accuracy. Linex-TSVM is better than Linex-SVM and TBSVM.  Table 4. Experimental results on UCI datasets with 25% Gaussian noise. Furthermore, in order to more comprehensively analyze the robustness of the algorithm under different noises, we have carried out more experiments on Australian, Spect, Pima, German, Vote, CMC, Sonar, Codrna and Spect, and we use different noises to test the performance of the six algorithms. For an original dataset X, we changed it with X + λX, where λ = q X F X F and q is a noise factor. Here,X is the noise matrix whose elements are i.i.d. standard Gaussian variables. The value is q ∈ {0.1, 0.2, 0.3, 0.4}. Through Figure 4, we can observe that under different noise factors, Linex-TSVM shows better classification accuracy and stability, while the other six models are relatively more volatile.

Datasets ACC ± S (%) ACC ± S (%) ACC ± S (%) ACC ± S (%) ACC ± S (%) ACC ± S (%) ACC ± S (%) Times (s) Times (s) Times (s) Times (s) Times (s) Times (s) Times (s)
Next, we introduce the box line diagram to verify that the model is better from another point of view. In Figure 5, we select six datasets to analyze the height of the box reflects the fluctuation of the data to a certain extent, that is, it represents the fluctuation of classification accuracy. The upper and lower edges represent the maximum and minimum values of the group of data, and the points outside the box can be understood as "outliers" in the data, so we can directly observe that the classification accuracy of Linex-TSVM is higher than that of other models.
To sum up, the capped linear loss function twin support vector machine proposed in this paper is superior to the other six algorithms in terms of classification accuracy and robustness, indicating that Linex-TSVM is a robust learning algorithm for large-scale data classification with noise.

Analysis for the Convergence
In this section, we show the convergence curve of the proposed algorithm on four datasets to directly verify that the convergence speed of the proposed algorithm can achieve the desired speed. The result is Figure 6, where the horizontal axis represents the number of iterations and the vertical axis represents the value of the objective function. We set: when the difference between the target values of two consecutive iterations is less than 0.001 and the number of iterations is less than 50, the iterative process stops.
The result of Figure 6 shows that the value of the objective function of Linex-TSVM decreases monotonously with the increase in the number of iterations, and the algorithm can converge quickly in about 5 iterations, that is, it converges within a limited number of iterations, and we obtain satisfactory results, which is consistent with the previous theoretical analysis.

Statistical Analysis
In this section, the statistical analysis method-Friedman test is used to compare the differences among the six algorithms involved. In this paper, the Friedman test is a statistical test of the homogeneity of multiple (related) samples, which makes full use of all the information in the original data and has many advantages. It is worth noting that the zero hypothesis means that all algorithms have the same performance. When the zero hypothesis is rejected, we can perform the post-processing test of the Nemeny test [26]. Next, the average ranking and accuracy of the six algorithms on seven data sets are shown in Table 5.
Next, we take eight UCI datasets with 10% Gaussian noise as examples to compare the six algorithms. The formula for Friedman statistical variables is as follows: where k is the number of algorithms and N is the number of UCI datasets. In our paper, k = 7, N = 8. R i represents the average ranking of the i algorithm on the seven UCI datasets. In addition, according to the χ 2 F distribution with k − 1 degrees of freedom, we can obtain: where F F ((k − 1), (k − 1)(N − 1)) obeys the F-distribution, and its degree of freedom is (k − 1) and (k − 1)(N − 1). In this paper, we choose α = 0.05 and we can get F α (6, 42) = 2.34. Obviously, F F > F α , we reject the zero hypothesis. Intuitively, from the Table 5, ee observe that Linex-TSVM has better classification performance, which means that our algorithm is more effective.
Next, through the Nemenyipost-hoctest, we can further compare the errors of the six algorithms in this paper. If the average rank difference between each other is greater than the critical value, the results demonstrate that the performance of the two algorithms is different. By dividing the Studentized range statistic by √ 2, we can obtain q α = 2.95. Therefore, we calculate the critical difference (CD) by the following formulation: Based on Figure 7, the performance of Linex-TSVM is significantly better than SVM, LSSVM, C-SVM, Linex-SVM and TBSVM, but the different between Linex-SVM and TBSVM is not obvious, because the different is smaller than the calculated CD value. Through the above analysis, the Linex-TSVM proposed in this paper has better performance.

Conclusions
The Twin Support vector machine classification has become a research hotspot. Twin support vector machine models based on different loss functions have been proposed, such as TPMSVM, TWSVM, SG-TSVM and so on. It is urgent to propose a loss function with better performance under the framework of support vector machine. The summary of this paper is as follows: Firstly, this paper proposes capped linear loss function and applies it to twin support vector machine, and proposes a new robust classification model, which is called truncated linear loss function twin support vector machine. Compared with the linear loss support vector machine model proposed by Ma et al. [8], it has better classification performance.
Secondly, we give an efficient iterative algorithm to solve Linex-TSVM. Unlike SVM, which needs to solve a large QP problem, this algorithm needs to solve a pair of small QP problems. Finally, we strictly analyze the computational complexity of the algorithm; it is verified that Linex-SVM satisfies the Bayesian rule. Experimental results on multiple data sets demonstrate that our algorithm Linex-TSVM is more feasible and robust in dealing with large-scale datasets with outliers than other models, and intuitively show the convergence of the algorithm. In particular, compared with SVM, LSSVM, C-SVM, NPSVM, Linex-SVM and TBSVM,the average accuracy of Linex-TSVM is higher than that in the absence of noise. The average accuracy of the model in this paper is higher than that of 4.36%, 4.29%, 2.53%, 2.33%, 1.91% and 0.77%, respectively. Linex-TSVM is more robust and stable for outliers.
The focus of future work is that we should still focus on finding better models to improve different data classification results, shorten the computing time while ensuring accuracy, and extend the model of this paper to other work, such as multi-classification problems. In future work, we can further consider applying different models to practical hot issues, such as face recognition, fingerprint recognition, UAV scheduling and so on. Of course, how to develop a better new algorithm for our Linex-TSVM is also very important.