Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (

Support vector machine (SVM) is one of the most successful learning methods for solving classification problems. Despite its popularity, SVM has the serious drawback that it is sensitive to outliers in training samples. The penalty on misclassification is defined by a convex loss called the hinge loss, and the unboundedness of the convex loss causes the sensitivity to outliers. To deal with outliers, robust SVMs have been proposed by replacing the convex loss with a non-convex bounded loss called the ramp loss. In this paper, we study the breakdown point of robust SVMs. The breakdown point is a robustness measure that is the largest amount of contamination such that the estimated classifier still gives information about the non-contaminated data. The main contribution of this paper is to show an exact evaluation of the breakdown point of robust SVMs. For learning parameters such as the regularization parameter, we derive a simple formula that guarantees the robustness of the classifier. When the learning parameters are determined with a grid search using cross-validation, our formula works to reduce the number of candidate search points. Furthermore, the theoretical findings are confirmed in numerical experiments. We show that the statistical properties of robust SVMs are well explained by a theoretical analysis of the breakdown point.

Support vector machine (SVM) is a highly developed classification method that is widely used in real-world data analysis [

In practical situations, however, SVM has drawbacks. The remarkable feature of the SVM is that the separating hyperplane is determined mainly from misclassified samples. Thus, the most misclassified samples significantly affect the classifier, meaning that the standard SVM is extremely susceptible to outliers. In

In this paper, we provide a detailed analysis on the robustness of SVMs. In particular, we deal with a robust variant of kernel-based

Robust

In this paper, our purpose is a theoretical investigation of the statistical properties of robust SVMs. In particular, we derive the exact finite-sample breakdown point of robust

In the detailed analysis of the breakdown point, we reveal that the finite-sample breakdown point of robust

Some of the previous studies are related to ours. In particular, the breakdown point was used to assess the robustness of kernel-based estimators in [

The paper is organized as follows. In

First of all, we summarize the notation used throughout this paper. Let

Next, let us introduce the classification problem with an input space

In kernel-based

The representer theorem [

As pointed out in [

In the literature,

The proof is presented in Theorem 10 of [

In

The hinge loss in (

Here, we introduce robust

The optimal solution,

The representer theorem ensures that the optimal decision function of (

In sequel sections, we develop a learning algorithm and investigate its robustness property against outliers. In order to avoid technical difficulties in the theoretical analysis of robust

Now, let us show the equivalence of robust

Robust

It is hard to obtain a global optimal solution of (

Let us show an expression of the objective function in (

For the negative margin

We derive the DCA using the decomposition (

Let us describe the learning algorithm for robust

The learning algorithm is presented in Algorithm 1. Given training samples

The numerical solution given by DCA is modified in Steps 7 and 8. Step 7 of Algorithm 1 is equivalent to solving (

Throughout the learning algorithm, the objective value monotonically decreases. Indeed, the DCA has the monotone decreasing property of the objective value [

It is straightforward to guarantee the monotone decrease of the objective value even if

Compute the sort

Set

Set

Let

Let

Let

The partial dual problem of (

The last equality comes from the optimality condition with respect to the variables

Let us give a geometric interpretation of the above expression. For the training data

The coefficients

Therefore, the dual form of robust

Let us describe how to evaluate the robustness of learning algorithms. There are a number of robustness measures for evaluating the stability of estimators as discussed later in

The breakdown point indicates the largest amount of contamination such that the estimator still gives information about the non-contaminated data [

For simplicity, the dependency of

The parameters of robust

To start with, let us derive a lower bound of the breakdown point for the optimal value of Problem (

In what follows, we assume

The proof of the above theorem is given in

The inequality (

When the contamination ratio in the training dataset is greater than the parameter

The proof is given in

Now, we will show the robustness of the bias term

The proof is given in

The robustness of

The proof is given in

Combining Theorems 1–4, we find that the breakdown point of robust

Bounded kernel: For

Unbounded kernel: For

Let us reconsider the breakdown point of learning methods.

Suppose that the function

For learning methods of the first type, the breakdown point indicates the number of outliers such that the estimator remains in a uniformly-bounded region. This is meaningful information about the robustness of the learning method. In this case, the larger breakdown point is regarded as a more robust method. As shown in Theorems 1 and 2, the robust

The second type implies that the hypothesis space of the learning method is bounded regardless of datasets. The

In this paper, we focus on the breakdown point of learning algorithms of the first type. Then, the analysis based on the breakdown point suggests proper choices of hyperparameters

Robust statistical inference has been studied for a long time in mathematical statistics, and a number of robust estimators have been proposed for many kinds of statistical problems [

From the standpoint of the breakdown point, however, convex loss functions do not provide robust estimators, as shown in [

The theoretical analysis in

If an upper bound of the outlier ratio is known to be

Robust

When the upper bound

Given contaminated training data

The admissible region of

The numerical experiments presented in

We conducted numerical experiments on synthetic and benchmark datasets to compare a number of SVMs. Algorithm 1 was used for robust

As has been shown in many studies including [

In binary classification problems, positive (resp. negative) samples were generated from a multivariate normal distribution with mean

In the numerical experiments, 100 training datasets such that

We conducted numerical experiments to compare the computational cost of robust

Robust

We investigate the validity of Inequality (

In the bottom right panel, the test error gets large when the inequality

As shown in

We compared the generalization ability of robust

The datasets are presented in

We randomly split the dataset into training and test sets. To evaluate the robustness, the training data were contaminated by outliers. More precisely, we randomly chose positive labeled samples in the training data and changed their labels to negative; i.e., we added outliers by flipping the labels. After that, robust

The results are presented in

We have investigated the breakdown point of robust variants of SVMs. The theoretical analysis provides inequalities of learning parameters,

In our paper, the dual representation of robust SVMs is applied to the calculation of the breakdown point. Theoretical analysis using the dual representation can be a powerful tool for the detailed analysis of other learning algorithms.

On the theoretical side, it is interesting to establish the relationship between the robustness, say breakdown point, and the convergence speed of learning algorithms, as presented for the parametric inference in mathematical statistics [

This work was supported by JSPS KAKENHI, Grant Number 16K00044 and 15K00031.

Takafumi Kanamori and Akiko Takeda contributed the theoretical analysis; Takafumi Kanamori and Shuhei Fujiwara performed the experiments; Takafumi Kanamori and Akiko Takeda wrote the paper. All authors have read and approved the final manuscript.

The authors declare no conflict of interest.

The proof is decomposed into two lemmas. Lemma A1 shows that Condition (i) is sufficient for Condition (ii), and Lemma A2 shows that Condition (ii) does not hold if Inequality (

We will show that

We define two points in

Then, we have:

Because

Now, let us prove the inequality,

The above argument leads to:

The upper bound does not depend on the contaminated dataset

We will use the same notation as in the proof of Lemma A1. Without loss of generality, we can assume

Index sets

For a rational number

Let us define

The non-contaminated dataset is denoted as

The estimated bias term

For the non-contaminated data

The existence of

For any non-contaminated sample

On the basis of the argument above, we can prove two propositions:

The function

The function

In addition, for any

Let us prove the first statement. If

for all

there exists an index

For a fixed

For a fixed

For negative (resp. positive) samples, the negative margin is expressed as

In the same manner, we can prove the second statement by using the fact that

Then, we have:

In summary, we obtain:

We will use the same notation as in the proof of Theorem 3 in

Under Inequality (

Distribution of negative margins

(

Plot of contaminated dataset of size

(

Plots of maximum norms and worst-case test errors. The top (Bottom) panels show the results for a Gaussian (linear) kernel. Red points mean the top 50 percent of values; the asterisks (∗) are points that violate the inequality

Number of times that the numerical solution of difference of convex functions algorithm (DCA) satisfies

Setting | Err. (%) | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|

Dim. | Cov. | #Initial Points | #Initial Points | #Initial Points | |||||||

1 | 5 | 10 | 1 | 5 | 10 | 1 | 5 | 10 | |||

2 | 7.9 | 87 | 96 | 97 | 90 | 99 | 99 | 93 | 99 | 99 | |

5 | 1.3 | 98 | 99 | 100 | 100 | 100 | 100 | 99 | 100 | 100 | |

10 | 0.1 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |

2 | 26.4 | 78 | 84 | 88 | 76 | 85 | 90 | 75 | 85 | 86 | |

5 | 24.0 | 46 | 84 | 90 | 53 | 83 | 90 | 66 | 90 | 90 | |

10 | 32.7 | 16 | 59 | 73 | 31 | 72 | 77 | 46 | 85 | 92 |

Computation time (Time) and ratio of support vectors (SV Ratio) of robust

Linear Kernel | Sonar ( |
BreastCancer ( |
Spam ( |
|||
---|---|---|---|---|---|---|

1.10 (0.22) | 0.79 (0.14) | 1.02 (0.17) | 0.21 (0.11) | 13.38 (3.90) | 0.27 (0.22) | |

0.87 (0.15) | 0.75 (0.20) | 0.73 (0.13) | 0.18 (0.06) | 11.29 (2.41) | 0.64 (0.27) | |

1.17 (0.19) | 0.57 (0.12) | 0.80 (0.13) | 0.22 (0.07) | 9.65 (2.13) | 0.24 (0.04) | |

0.81 (0.09) | 0.58 (0.16) | 0.63 (0.07) | 0.28 (0.05) | 8.64 (2.12) | 0.36 (0.21) | |

1.11 (0.18) | 0.49 (0.10) | 0.83 (0.14) | 0.30 (0.03) | 8.65 (1.25) | 0.30 (0.02) | |

0.90 (0.15) | 0.62 (0.16) | 0.76 (0.12) | 0.36 (0.02) | 8.72 (1.77) | 0.38 (0.04) | |

0.12 (0.02) | 0.00 (0.00) | 0.15 (0.02) | 0.00 (0.00) | 1.62 (0.08) | 0.00 (0.00) | |

1 | 0.61 (0.07) | 0.45 (0.08) | 0.60 (0.16) | 0.04 (0.01) | 7.38 (2.36) | 0.08 (0.01) |

1.02 (0.11) | 0.54 (0.13) | 0.68 (0.18) | 0.03 (0.01) | 10.16 (3.31) | 0.11 (0.16) | |

1.07 (0.13) | 0.47 (0.09) | 0.63 (0.17) | 0.05 (0.06) | 20.98 (5.95) | 0.30 (0.32) |

Test error and standard deviation of robust

Data | Outlier | Linear Kernel | Gaussian Kernel | ||||
---|---|---|---|---|---|---|---|

Robust |
Robust |
Robust |
Robust |
||||

Sonar: |
0% | *0.258(.032) | 0.270(.038) | * 0.256(.051) | * 0.179(.038) | **0.188(0.043) | *0.181(0.039) |

5% | * 0.256(0.039) | 0.273(0.047) | *0.258(0.046) | *0.225(0.042) | **0.229(0.051) | * 0.224(0.061) | |

10% | * 0.297(0.060) | 0.306(0.067) | *0.314(0.060) | *0.249(0.059) | ** 0.230(0.046) | *0.259(0.062) | |

15% | * 0.329(0.061) | 0.339(0.064) | *0.345(0.062) | *0.280(0.053) | ** 0.280(0.050) | *0.294(0.064) | |

BreastCancer: |
0% | 0.033(.010) | *0.035(0.008) | * 0.033(0.006) | ** 0.032(0.008) | *0.035(0.012) | 0.033(0.010) |

5% | 0.034(0.009) | * 0.034(0.010) | *0.043(0.015) | ** 0.032(.005) | *0.033(0.007) | 0.033(0.006) | |

10% | 0.055(0.015) | * 0.051(0.026) | *0.076(0.036) | ** 0.035(0.008) | *0.043(0.025) | 0.038(0.008) | |

15% | 0.136(0.058) | * 0.120(0.050) | *0.148(0.058) | **0.160(0.083) | * 0.145(0.070) | 0.150(0.110) | |

PimaIndiansDiabetes: | 0% | **0.237(0.018) | * 0.232(0.014) | 0.246(0.018) | * 0.238(0.021) | *0.240(0.019) | 0.243(0.022) |

5% | **0.239(0.019) | * 0.237(0.016) | 0.269(0.036) | * 0.264(0.025) | *0.267(0.024) | 0.273(0.024) | |

10% | ** 0.280(0.046) | *0.299(0.042) | 0.330(0.030) | *0.302(0.039) | * 0.293(0.036) | 0.315(0.038) | |

15% | ** 0.338(0.042) | *0.349(0.030) | 0.351(0.026) | * 0.344(0.028) | *0.344(0.031) | 0.353(0.016) | |

spam: |
0% | **0.083(0.005) | 0.088(0.006) | *0.083(0.005) | **0.081(0.005) | 0.086(0.006) | * 0.081(0.006) |

5% | ** 0.094(0.008) | 0.104(0.013) | *0.109(0.010) | **0.095(0.008) | 0.097(0.009) | * 0.095(0.008) | |

10% | ** 0.129(0.022) | 0.152(0.020) | *0.166(0.067) | ** 0.129(0.015) | 0.133(0.017) | 0.141(.030) | |

15% | ** 0.201(0.029) | 0.240(0.030) | *0.256(0.091) | ** 0.206(0.018) | 0.223(0.030) | 0.240(0.055) | |

Satellite: |
0% | **0.097(0.004) | *0.096(0.003) | ** 0.094(0.003) | *0.069(0.031) | 0.067(0.004) | ** 0.063(0.004) |

5% | **0.101(0.003) | * 0.100(0.005) | **0.100(0.004) | *0.072(0.015) | 0.078(0.007) | **0.078(0.043) | |

10% | ** 0.148(0.020) | *0.161(0.026) | **0.161(0.019) | *0.117(0.034) | 0.126(0.040) | **0.137(0.027) |