Open Access
This article is
 freely available
 reusable
Entropy 2017, 19(2), 83; doi:10.3390/e19020083
Article
Breakdown Point of Robust Support Vector Machines
^{1}
Department of Computer Science and Mathematical Informatics, Nagoya University, Nagoya 4648601, Japan
^{2}
TOPGATE Co. Ltd., Bunkyoku, Tokyo 1130033, Japan
^{3}
Institute of Statistical Mathematics, Tokyo 1908562, Japan
^{4}
RIKEN Center for Advanced Intelligence Project, Tokyo 1030027, Japan
*
Correspondence: Tel.: +81527894598
Academic Editor:
Kevin H. Knuth
Received: 15 January 2017 / Accepted: 16 February 2017 / Published: 21 February 2017
Abstract
:Support vector machine (SVM) is one of the most successful learning methods for solving classification problems. Despite its popularity, SVM has the serious drawback that it is sensitive to outliers in training samples. The penalty on misclassification is defined by a convex loss called the hinge loss, and the unboundedness of the convex loss causes the sensitivity to outliers. To deal with outliers, robust SVMs have been proposed by replacing the convex loss with a nonconvex bounded loss called the ramp loss. In this paper, we study the breakdown point of robust SVMs. The breakdown point is a robustness measure that is the largest amount of contamination such that the estimated classifier still gives information about the noncontaminated data. The main contribution of this paper is to show an exact evaluation of the breakdown point of robust SVMs. For learning parameters such as the regularization parameter, we derive a simple formula that guarantees the robustness of the classifier. When the learning parameters are determined with a grid search using crossvalidation, our formula works to reduce the number of candidate search points. Furthermore, the theoretical findings are confirmed in numerical experiments. We show that the statistical properties of robust SVMs are well explained by a theoretical analysis of the breakdown point.
Keywords:
support vector machine; breakdown point; outlier; kernel function1. Introduction
1.1. Background
Support vector machine (SVM) is a highly developed classification method that is widely used in realworld data analysis [1,2]. The most popular implementation is called CSVM, which uses the maximum margin criterion with a penalty for misclassification. The positive parameter C tunes the balance between the maximum margin and penalty, and the resulting classification problem can be formulated as a convex quadratic problem based on training data. A separating hyperplane for classification is obtained from the optimal solution of the problem. Furthermore, complex nonlinear classifiers are obtained by using the reproducing kernel Hilbert space (RKHS) as a statistical model of the classifiers [3]. There are many variants of SVM for solving binary classification problems, such as νSVM, EνSVM and least squares SVM [4,5,6]. Moreover, the generalization ability of SVM has been analyzed in many studies [7,8,9].
In practical situations, however, SVM has drawbacks. The remarkable feature of the SVM is that the separating hyperplane is determined mainly from misclassified samples. Thus, the most misclassified samples significantly affect the classifier, meaning that the standard SVM is extremely susceptible to outliers. In CSVM, the penalties of sample points are measured in terms of the hinge loss, which is a convex surrogate of the 01loss for misclassification. The convexity of the hinge loss causes SVM to be unstable in the presence of outliers, since the convex function is unbounded and puts an extremely large penalty on outliers. One way to remedy the instability is to replace the convex loss with a nonconvex bounded loss to suppress outliers. Loss clipping is a simple method to obtain a bounded loss from a convex loss [10,11]. For example, clipping the hinge loss leads to the ramp loss [12,13], which is a loss function used in robust SVMs. Yu et al. [11,14] showed a convex loss clipping that yields a nonconvex loss function and proposed a convex relaxation of the resulting nonconvex optimization problem to obtain a computationallyefficient learning algorithm. The SVM using the ramp loss is regarded as a robust variant of ${L}_{1}$SVM. Recently, Feng et al. [15] also proposed a robust variant of ${L}_{2}$SVM.
1.2. Our Contribution
In this paper, we provide a detailed analysis on the robustness of SVMs. In particular, we deal with a robust variant of kernelbased νSVM. The standard νSVM [5] has a regularization parameter ν, and it is equivalent to CSVM; i.e., both methods provide the same classifier for the same training data if the regularization parameters, ν and C, are properly tuned. We generate a robust variant of νSVM by clipping the loss function of νSVM, called robust $(\nu ,\mu )$SVM, with another learning parameter $\mu \in [0,1)$. The parameter μ denotes the ratio of samples to be removed from the training dataset as outliers. When the ratio of outliers in the training dataset is bounded above by μ, robust $(\nu ,\mu )$SVM is expected to provide a robust classifier.
Robust $(\nu ,\mu )$SVM is closely related to other robust SVMs, such as CVaR$({\alpha}_{L},{\alpha}_{U})$SVM [16], the robust outlier detection (ROD) algorithm [17] and extended robust SVM (ERSVM) [18,19]. In particular, it is equivalent to the CVaR$({\alpha}_{L},{\alpha}_{U})$SVM. In this paper, the learning algorithm we consider is referred to as robust $(\nu ,\mu )$SVM to emphasize that it is a robust variant of νSVM. On the other hand, ROD is to robust $(\nu ,\mu )$SVM what CSVM is to νSVM. ERSVM is another robust extension of νSVM, and it includes robust $(\nu ,\mu )$SVM as a special case. Both ROD and ERSVM have a parameter corresponding to μ; i.e., the ratio of outliers to be removed from the training samples. The above learning algorithms share almost the same learning model. Here, the main concern of the past studies was to develop computationallyefficient learning algorithms and to confirm the robustness property in numerica experiments.
In this paper, our purpose is a theoretical investigation of the statistical properties of robust SVMs. In particular, we derive the exact finitesample breakdown point of robust $(\nu ,\mu )$SVM. The finitesample breakdown point indicates the largest amount of contamination such that the estimator still gives information about the noncontaminated data [20] (Chapter 3.2). In order to investigate the breakdown point, we present that the robustness of the learning method is closely related to the dual representation of the optimization problem in the learning algorithm. Indeed, the dual representation provides an intuitive picture on how each sample affects the estimated classifier. Based on such an intuition, we calculate the exact breakdown point. This is a new approach to the theoretical analysis of robust statistics.
In the detailed analysis of the breakdown point, we reveal that the finitesample breakdown point of robust $(\nu ,\mu )$SVM is equal to μ if ν and μ satisfy a simple condition. Conversely, we prove that the finitesample breakdown point is strictly less than μ, if the condition is violated. An important point is that our findings provide a way to specify a region of the learning parameters $(\nu ,\mu )$, such that robust $(\nu ,\mu )$SVM has the desired robustness property. As a result, one can reduce the number of candidate learning parameters $(\nu ,\mu )$ when the grid search of the learning parameters is conducted with crossvalidation.
Some of the previous studies are related to ours. In particular, the breakdown point was used to assess the robustness of kernelbased estimators in [14]. In that paper, the influence of a single outlier is considered for a general class of robust estimators in regression problems. In contrast, we focus on a variant of SVM and provide a detailed analysis of the robustness property based on the breakdown point. Our analysis takes into account an arbitrary number of outliers.
The paper is organized as follows. In Section 2, we introduce the problem setup and briefly review the topic of learning algorithms using the standard SVM. Section 3 introduces the robust variant of νSVM. We propose a modified learning algorithm of robust $(\nu ,\mu )$SVM in order to guarantee the robustness property of local optimal solutions. We show that the dual representation of robust $(\nu ,\mu )$SVM has an intuitive interpretation that is of great help for evaluating the breakdown point. In Section 4, we introduce a finitesample breakdown point as a measure of robustness. Then, we evaluate the breakdown point of robust $(\nu ,\mu )$SVM. The robustness of other SVMs is also considered. In Section 5, we discuss a method of tuning the learning parameters ν and μ on the basis of the robustness analysis in Section 4. Section 6 examines the generalization performance of robust $(\nu ,\mu )$SVM via numerical experiments. The conclusion is in Section 7. Detailed proofs of the theoretical results are presented in the Appendix.
2. Brief Introduction to Learning Algorithms
First of all, we summarize the notation used throughout this paper. Let $\mathbb{N}$ be the set of positive integers, and let $\left[m\right]$ for $m\in \mathbb{N}$ denote a finite set of $\mathbb{N}$ defined as $\{1,\dots ,m\}$. The set of all real numbers is denoted as $\mathbb{R}$. The function ${\left[z\right]}_{+}$ is defined as $max\{z,0\}$ for $z\in \mathbb{R}$. For a finite set A, the size of A is expressed as $\leftA\right$. For a reproducing kernel Hilbert space (RKHS) $\mathcal{H}$, the norm on $\mathcal{H}$ is denoted as ${\parallel \xb7\parallel}_{\mathcal{H}}$. See [3] for a description of RKHS.
Next, let us introduce the classification problem with an input space $\mathcal{X}$ and binary output labels $\{+1,1\}$. Given i.i.d. training samples $D=\{({x}_{i},{y}_{i}):i\in \left[m\right]\}\subset \mathcal{X}\times \{+1,1\}$ drawn from a probability distribution over $\mathcal{X}\times \{+1,1\}$, a learning algorithm produces a decision function $g:\mathcal{X}\to \mathbb{R}$ such that its sign predicts the output labels for input points in test samples. The decision function $g\left(x\right)$ predicts the correct label on the sample $(x,y)$ if and only if the inequality $yg\left(x\right)>0$ holds. The product $yg\left(x\right)$ is called the margin of the sample $(x,y)$ for the decision function g [21]. To make an accurate decision function, the margins on the training dataset should take large positive values.
In kernelbased νSVM [5], an RKHS $\mathcal{H}$ endowed with a kernel function $k:{\mathcal{X}}^{2}\to \mathbb{R}$ is used to estimate the decision function $g\left(x\right)=f\left(x\right)+b$, where $f\in \mathcal{H}$ and $b\in \mathbb{R}$. The misclassification penalty is measured by the hinge loss. More precisely, νSVM produces a decision function $f\left(x\right)+b$ as the optimal solution of the convex problem,
where $[\rho {y}_{i}{(f\left({x}_{i}\right)+b)]}_{+}$ is the hinge loss of the margin with the threshold ρ. The second term $\nu \rho $ is the penalty for the threshold ρ. The parameter ν in the interval $(0,1)$ is the regularization parameter. Usually, the range of ν that yields a meaningful classifier is narrower than the interval $(0,1)$, as shown in [5]. The first term in (1) is a regularization term to avoid overfitting to the training data. A large positive margin is preferable for each training data. The optimal ρ of νSVM is nonnegative. Indeed, the optimal solution $f\in \mathcal{H},b,\rho \in \mathbb{R}$ satisfies:
$$\begin{array}{c}\hfill \begin{array}{c}{\displaystyle \underset{f,b,\rho}{min}\phantom{\rule{4pt}{0ex}}\frac{1}{2}{\parallel f\parallel}_{\mathcal{H}}^{2}\nu \rho +\frac{1}{m}\sum _{i=1}^{m}\left[\rho {y}_{i}(f\left({x}_{i}\right)+b\right){]}_{+}}\hfill \\ {\displaystyle \mathrm{subject}\phantom{\rule{4.pt}{0ex}}\mathrm{to}\phantom{\rule{4.pt}{0ex}}\phantom{\rule{4pt}{0ex}}f\in \mathcal{H},\phantom{\rule{4pt}{0ex}}b,\rho \in \mathbb{R},}\hfill \end{array}\end{array}$$
$$\begin{array}{cc}\hfill \nu \rho & \le \frac{1}{2}{\parallel f\parallel}_{\mathcal{H}}^{2}\nu \rho +\frac{1}{m}\sum _{i=1}^{m}{\left[\rho {y}_{i}\left(f\left({x}_{i}\right)+b\right)\right]}_{+}\hfill \\ & \le \frac{1}{2}{\parallel 0\parallel}_{\mathcal{H}}^{2}\nu \xb70+\frac{1}{m}\sum _{i=1}^{m}{\left[0{y}_{i}(0+0)\right]}_{+}=0.\hfill \end{array}$$
The representer theorem [22,23] indicates that the optimal decision function of (1) is of the form,
for ${\alpha}_{j}\in \mathbb{R}$. Thanks to this theorem, even when $\mathcal{H}$ is an infinite dimensional space, the above optimization problem can be reduced to a finite dimensional quadratic convex problem. This is the great advantage of using RKHS for nonparametric statistical inference [5]. The input point ${x}_{j}$ with a nonzero coefficient ${\alpha}_{j}$ is called a support vector. A remarkable property of νSVM is that the regularization parameter ν provides a lower bound on the fraction of support vectors.
$$\begin{array}{c}\hfill g\left(x\right)=\sum _{j=1}^{m}{\alpha}_{j}k(x,{x}_{j})+b\end{array}$$
As pointed out in [24], νSVM is closely related to a financial risk measure called conditional value at risk (CVaR) [25]. Suppose that $\nu m\in \mathbb{N}$ holds for a parameter $\nu \in (0,1)$. Then, the CVaR of samples ${r}_{1},\dots ,{r}_{m}\in \mathbb{R}$ at level ν is defined as the average of its νtail, i.e., $\frac{1}{\nu m}{\sum}_{i=1}^{\nu m}{r}_{\sigma \left(i\right)}$, where σ is a permutation on $\left[m\right]$ such that ${r}_{\sigma \left(1\right)}\ge \cdots \ge {r}_{\sigma \left(m\right)}$ holds. The definition of CVaR for general random variables is presented in [25].
In the literature, ${r}_{i}$ is defined as the negative margin ${r}_{i}={y}_{i}g\left({x}_{i}\right)$. For a regularization parameter ν satisfying $\nu m\in \mathbb{N}$ and a fixed decision function $g\left(x\right)=f\left(x\right)+b$, the objective function in (1) is expressed as:
$$\begin{array}{c}\hfill \underset{\rho \in \mathbb{R}}{min}\frac{1}{2}{\parallel f\parallel}_{\mathcal{H}}^{2}\nu \rho +\frac{1}{m}\sum _{i=1}^{m}\left[\rho {y}_{i}(f\left({x}_{i}\right)+b\right){]}_{+}=\frac{1}{2}{\parallel f\parallel}_{\mathcal{H}}^{2}+\frac{1}{m}\sum _{i=1}^{\nu m}{r}_{\sigma \left(i\right)}.\end{array}$$
The proof is presented in Theorem 10 of [25]. Hence, νSVM yields a decision function that minimizes the sum of the regularization term and the CVaR of the negative margins at level ν.
In CSVM [1], the decision function is obtained by solving:
in which the hinge loss ${[1{y}_{i}(f\left({x}_{i}\right)+b)]}_{+}$ with the fixed threshold $\rho =1$ is used. A positive regularization parameter $C>0$ is used instead of ν. For each training data, νSVM and CSVM can be made to provide the same decision function by appropriately tuning ν and C. In this paper, we focus on νSVM and its robust variants rather than CSVM. The parameter ν has the explicit meaning shown above, and this interpretation will be significant when we derive the robustness property of our method.
$$\begin{array}{c}\hfill \begin{array}{c}{\displaystyle \underset{f,b}{min}\phantom{\rule{4pt}{0ex}}\frac{1}{2}{\parallel f\parallel}_{\mathcal{H}}^{2}+C\sum _{i=1}^{m}\left[1{y}_{i}(f\left({x}_{i}\right)+b\right){]}_{+}}\hfill \\ {\displaystyle \mathrm{subject}\phantom{\rule{4.pt}{0ex}}\mathrm{to}\phantom{\rule{4.pt}{0ex}}\phantom{\rule{4pt}{0ex}}f\in \mathcal{H},\phantom{\rule{4pt}{0ex}}b\in \mathbb{R},}\hfill \end{array}\end{array}$$
The hinge loss in (4) is replaced with the socalled ramp loss:
in the robust CSVM proposed in [10,13,17]. By truncating the hinge loss, the influence of outliers is suppressed, and the estimated classifier is expected to be robust against outliers in the training data.
$$min\{1,\phantom{\rule{0.166667em}{0ex}}{[1{y}_{i}(f\left({x}_{i}\right)+b)]}_{+}\}$$
3. Robust Variants of SVM
3.1. Outlier Indicators for Robust Learning Methods
Here, we introduce robust $(\nu ,\mu )$SVM, which is a robust variant of νSVM. To remove the influence of outliers, an outlier indicator, ${\eta}_{i}\in [0,1],i\in \left[m\right]$, is assigned for each training sample, where ${\eta}_{i}=0$ is intended to indicate that the sample $({x}_{i},{y}_{i})$ is an outlier. The same idea is used in ROD [17]. Assume that the ratio of outliers is less than or equal to μ. For ν and μ such that $0\le \mu <\nu <1$; robust $(\nu ,\mu )$SVM can be formalized using RKHS $\mathcal{H}$ as follows:
$$\begin{array}{c}\hfill \begin{array}{c}{\displaystyle \underset{f,b,\rho ,\eta}{min}\phantom{\rule{4pt}{0ex}}\frac{1}{2}{\parallel f\parallel}_{\mathcal{H}}^{2}(\nu \mu )\rho +\frac{1}{m}\sum _{i=1}^{m}{\eta}_{i}{\left[\rho {y}_{i}\left(f\left({x}_{i}\right)+b\right)\right]}_{+},}\hfill \\ {\displaystyle \mathrm{subject}\phantom{\rule{4.pt}{0ex}}\mathrm{to}\phantom{\rule{4.pt}{0ex}}\phantom{\rule{4pt}{0ex}}f\in \mathcal{H},\phantom{\rule{0.166667em}{0ex}}b,\rho \in \mathbb{R},}\hfill \\ {\displaystyle \phantom{\mathrm{subject}\phantom{\rule{4.pt}{0ex}}\mathrm{to}\phantom{\rule{4.pt}{0ex}}}\phantom{\rule{4pt}{0ex}}\eta =({\eta}_{1},\dots ,{\eta}_{m})\in {[0,1]}^{m},\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{4pt}{0ex}}\sum _{i=1}^{m}{\eta}_{i}\ge m(1\mu ).}\hfill \end{array}\end{array}$$
The optimal solution, $f\in \mathcal{H}$ and $b\in \mathbb{R}$, provides the decision function $g\left(x\right)=f\left(x\right)+b$ for classification. The optimal ρ is nonnegative, the same as with νSVM. Influence from samples with large negative margins can be removed by setting ${\eta}_{i}$ to zero.
The representer theorem ensures that the optimal decision function of (5) is represented by (2). Suppose that the decision function $g\left(x\right)=f\left(x\right)+b$ of the form (2), threshold ρ and outlier indicator η satisfy the KKT (Karush–Kuhn–Tucker) condition [26] (Chapter 5) of (5). As in the case of the standard νSVM, the number of support vectors in $f\left(x\right)$ is bounded below by $(\nu \mu )m$. In addition, the margin error on the training samples with ${\eta}_{i}=1$ is bounded above by $\nu \mu $; i.e.,
holds.
$$\begin{array}{c}\hfill \frac{1}{m}\left\{i\in \left[m\right]:{\eta}_{i}=1,\phantom{\rule{0.166667em}{0ex}}{y}_{i}g\left({x}_{i}\right)<\rho \}\right\le \nu \mu \end{array}$$
In sequel sections, we develop a learning algorithm and investigate its robustness property against outliers. In order to avoid technical difficulties in the theoretical analysis of robust $(\mu ,\nu )$SVM, we assume that $\nu m$ and $\mu m$ are positive integers throughout this paper. This is not a severe limitation unless the sample size is extremely small. This assumption ensures that the optimal solution of η in (5) lies in the binary product set ${\{0,1\}}^{m}$.
Now, let us show the equivalence of robust $(\nu ,\mu )$SVM and CVaR$({\alpha}_{L},{\alpha}_{U})$SVM [16]. Given ν and μ, the optimization problem (5) can be represented as:
where ${r}_{i}={y}_{i}(f\left({x}_{i}\right)+b)$ is the negative margin and $\sigma \left(i\right),i\in \left[m\right]$ is the permutation such that ${r}_{\sigma \left(1\right)}\ge \cdots \ge {r}_{\sigma \left(m\right)}$ as defined in Section 2. The second term in (6) is the average of the negative margins included in the middle interval presented in Figure 1, and it is expressed by the difference of CVaRs at levels ν and μ. A learning algorithm based on this interpretation is proposed in [16] under the name CVaR$({\alpha}_{L},{\alpha}_{U})$SVM with ${\alpha}_{L}=1\nu $ and ${\alpha}_{U}=1\mu $.
$$\begin{array}{c}\hfill \underset{f\in \mathcal{H},\phantom{\rule{0.166667em}{0ex}}b\in \mathbb{R}}{min}\phantom{\rule{0.166667em}{0ex}}\frac{1}{2}{\parallel f\parallel}_{\mathcal{H}}^{2}+(\nu \mu )\xb7\frac{1}{(\nu \mu )m}\sum _{i=\mu m+1}^{\nu m}\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}{r}_{\sigma \left(i\right)},\end{array}$$
Robust $(\nu ,\mu )$SVM is also closely related to the robust outlier detection (ROD) algorithm [17], which is a robust variant of CSVM. In ROD, the classifier is given by the optimal solution of:
where $\lambda >0$ is a regularization parameter and η is an outlier indicator. The linear kernel is used in the original ROD [17]. To obtain a classifier, the ROD solves a semidefinite relaxation of (7). In [18], it is proven that a KKT point of (7) with the learning parameter $(\lambda ,\mu )$ corresponds to that of robust $(\nu ,\mu )$SVM for some parameter ν.
$$\begin{array}{c}\hfill \begin{array}{c}{\displaystyle \underset{f,b,\eta}{min}\frac{\lambda}{2}{\parallel f\parallel}_{\mathcal{H}}^{2}+\sum _{i=1}^{m}{\eta}_{i}{[1{y}_{i}(f\left({x}_{i}\right)+b)]}_{+},}\hfill \\ {\displaystyle \mathrm{subject}\phantom{\rule{4.pt}{0ex}}\mathrm{to}\phantom{\rule{4.pt}{0ex}}\phantom{\rule{4pt}{0ex}}f\in \mathcal{H},\phantom{\rule{4pt}{0ex}}\phantom{\rule{4pt}{0ex}}b\in \mathbb{R},}\hfill \\ {\displaystyle \phantom{\mathrm{subject}\phantom{\rule{4.pt}{0ex}}\mathrm{to}\phantom{\rule{4.pt}{0ex}}}\phantom{\rule{4pt}{0ex}}\eta =({\eta}_{1},\dots ,{\eta}_{m})\in {[0,1]}^{m},\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{4pt}{0ex}}\sum _{i=1}^{m}{\eta}_{i}\ge m(1\mu ),}\hfill \end{array}\end{array}$$
3.2. Learning Algorithm
It is hard to obtain a global optimal solution of (5), since the objective function is nonconvex. The difference of convex functions algorithm (DCA) [27] and concaveconvex programming (CCCP) [28] are popular methods to efficiently obtain practical numerical solutions of nonconvex optimization problems. Indeed, DCA is used in robust CSVM using the ramp loss [12] and ERSVM [18].
Let us show an expression of the objective function in (5) as a difference of convex functions. The set of feasible outlier indicators is denoted as:
$$\begin{array}{cc}\hfill {E}_{\mu}& =\left\{{({\eta}_{1},\dots ,{\eta}_{m})}^{T}\in {[0,1]}^{m}\phantom{\rule{0.166667em}{0ex}}:\phantom{\rule{0.166667em}{0ex}}\sum _{i=1}^{m}{\eta}_{i}\ge m(1\mu )\right\}.\hfill \end{array}$$
For the negative margin ${r}_{i}={y}_{i}(f\left({x}_{i}\right)+b)$, the objective function in robust $(\nu ,\mu )$SVM is then represented as:
which is derived from (3) and (6).
$$\begin{array}{cc}& \phantom{=}\underset{\rho \in \mathbb{R},\eta \in {E}_{\mu}}{min}\phantom{\rule{4pt}{0ex}}\frac{1}{2}{\parallel f\parallel}_{\mathcal{H}}^{2}(\nu \mu )\rho +\frac{1}{m}\sum _{i=1}^{m}{\eta}_{i}{\left[\rho +{r}_{i}\right]}_{+}\hfill \\ & =\underset{\rho \in \mathbb{R}}{min}\phantom{\rule{4pt}{0ex}}\frac{1}{2}{\parallel f\parallel}_{\mathcal{H}}^{2}\nu \rho +\frac{1}{m}\sum _{i=1}^{m}{\left[\rho +{r}_{i}\right]}_{+}\underset{\eta \in {E}_{\mu}}{max}\frac{1}{m}\sum _{i=1}^{m}(1{\eta}_{i}){r}_{i},\hfill \end{array}$$
We derive the DCA using the decomposition (8). The optimization algorithm is a simplified variant of the learning algorithm proposed in [18]. The representer theorem ensures that the optimal decision function is represented by $g\left(x\right)={\sum}_{i=1}^{m}{\alpha}_{i}k(x,{x}_{i})+b$ when the kernel function of the RKHS $\mathcal{H}$ is $k(x,{x}^{\prime})$. From (8), the objective function of the robust $(\nu ,\mu )$SVM is expressed as:
using the convex functions ${\psi}_{0}$ and ${\psi}_{1}$ defined as:
where α is the column vector ${({\alpha}_{1},\dots ,{\alpha}_{m})}^{T}\in {\mathbb{R}}^{m}$ and $K\in {\mathbb{R}}^{m\times m}$ is the Gram matrix defined by ${K}_{ij}=k({x}_{i},{x}_{j}),\phantom{\rule{0.166667em}{0ex}}i,j\in \left[m\right]$. Let ${\alpha}_{t}\in {\mathbb{R}}^{m},\phantom{\rule{0.166667em}{0ex}}{b}_{t},{\rho}_{t}\in \mathbb{R}$ be the solution obtained after t iterations of the DCA. Next, the solution is updated to the optimal solution of:
where $(u,v)\in {\mathbb{R}}^{m+1}$ with $u\in {\mathbb{R}}^{m},v\in \mathbb{R}$ is an element of the subgradient of ${\psi}_{1}$ at $({\alpha}_{t},{b}_{t})$. Let $\mathrm{conv}S$ be the convex hull of the set S, and let $a\circ b$ denote componentwise multiplication of two vectors a and b. Accordingly, the subgradient of ${\psi}_{1}$ can be expressed as:
where ${1}_{m}$ denotes an mdimensional vector of all ones. A parameter $\eta \in {E}_{\mu}$ that meets the condition in the above subgradient is obtained by sorting the negative margin ${r}_{i}$, $i\in \left[m\right]$ at $(\alpha ,b)=({\alpha}_{t},{b}_{t})$.
$$\begin{array}{c}\hfill \mathsf{\Phi}(\alpha ,b,\rho )={\psi}_{0}(\alpha ,b,\rho ){\psi}_{1}(\alpha ,b)\end{array}$$
$$\begin{array}{cc}\hfill {\psi}_{0}(\alpha ,b,\rho )& =\frac{1}{2}{\alpha}^{T}K\alpha \nu \rho +\frac{1}{m}\sum _{i=1}^{m}[\rho +{r}_{i}],\hfill \\ \hfill {\psi}_{1}(\alpha ,b)& =\underset{\eta \in {E}_{\mu}}{max}\frac{1}{m}\sum _{i=1}^{m}(1{\eta}_{i}){r}_{i},\hfill \end{array}$$
$$\begin{array}{c}\hfill \underset{\alpha ,b,\rho}{min}{\psi}_{0}(\alpha ,b,\rho ){u}^{T}\alpha vb,\end{array}$$
$$\begin{array}{cc}& \phantom{=}\partial {\psi}_{1}({\alpha}_{t},{b}_{t})\hfill \\ & =\mathrm{conv}\{(u,v)\phantom{\rule{0.166667em}{0ex}}:\phantom{\rule{0.166667em}{0ex}}u=\frac{1}{m}K(y\circ ({1}_{m}\eta )),\phantom{\rule{0.166667em}{0ex}}v=\frac{1}{m}{y}^{T}({1}_{m}\eta ),\hfill \\ & \phantom{=\mathrm{conv}\{}\phantom{\rule{4pt}{0ex}}\mathrm{where}\phantom{\rule{4.pt}{0ex}}\eta \in {E}_{\mu}\phantom{\rule{4.pt}{0ex}}\mathrm{is}\phantom{\rule{4.pt}{0ex}}\mathrm{a}\phantom{\rule{4.pt}{0ex}}\mathrm{maximum}\phantom{\rule{4.pt}{0ex}}\mathrm{solution}\phantom{\rule{4.pt}{0ex}}\mathrm{of}\phantom{\rule{4.pt}{0ex}}\mathrm{the}\phantom{\rule{4.pt}{0ex}}\mathrm{problem}\phantom{\rule{4.pt}{0ex}}\mathrm{in}\phantom{\rule{4.pt}{0ex}}{\psi}_{1}({\alpha}_{t},{b}_{t})\},\hfill \end{array}$$
Let us describe the learning algorithm for robust $(\nu ,\mu )$SVM. We propose a modification of DCA to guarantee the robustness of the local optimal solution. The DCA for robust $(\nu ,\mu )$SVM based on Expression (9) is used to obtain a good numerical solution of the outlier indicator. Let $\mathrm{Loss}(f,b,\rho ,\eta )$ be the objective function of robust $(\nu ,\mu )$SVM:
$$\begin{array}{c}\hfill \mathrm{Loss}(f,b,\rho ,\eta )=\frac{1}{2}{\parallel f\parallel}_{\mathcal{H}}^{2}(\nu \mu )\rho +\frac{1}{m}\sum _{i=1}^{m}{\eta}_{i}{[\rho {y}_{i}(f\left({x}_{i}\right)+b)]}_{+}.\end{array}$$
The learning algorithm is presented in Algorithm 1. Given training samples $D=\{({x}_{i},{y}_{i}):i\in \left[m\right]\}$, the learning algorithm outputs the decision function ${f}_{D}+{b}_{D}\in \mathcal{H}+\mathbb{R}$. The dual problem of (10) is presented as (11) in Algorithm 1.
The numerical solution given by DCA is modified in Steps 7 and 8. Step 7 of Algorithm 1 is equivalent to solving (5) with the additional equality constraint $\eta =\overline{\eta}\in {E}_{\mu}$. This is almost the same as the standard νSVM using the training samples with ${\overline{\eta}}_{i}=1$ and the regularization parameter $(\nu \mu )/(1\mu )\in (0,1)$ instead of ν. Hence, the optimal solution ${f}_{D}$ is efficiently obtained. In Step 8, the problem is reduced to the optimization of the onedimensional piecewise linear function of b. This fact is shown in Appendix C, when we prove the robustness property of ${b}_{D}$ in Section 4.2. Hence, finding a local optimal solution of the problem in Step 8 is tractable.
Throughout the learning algorithm, the objective value monotonically decreases. Indeed, the DCA has the monotone decreasing property of the objective value [27]. Let $\overline{f},\overline{b},\overline{\rho},\overline{\eta}$ be the numerical solution obtained at the last iteration of DCA. Then, we have:
$$\begin{array}{cc}\hfill \mathrm{Loss}(\overline{f},\overline{b},\overline{\rho},\overline{\eta})& \ge \underset{f\in \mathcal{H},b,\rho \in \mathbb{R}}{min}\mathrm{Loss}(f,b,\rho ,\overline{\eta})\hfill \\ & =\underset{b,\rho \in \mathbb{R}}{min}\mathrm{Loss}({f}_{D},b,\rho ,\overline{\eta})\hfill \\ & \ge \underset{b,\rho \in \mathbb{R},\eta \in {E}_{\mu}}{min}\mathrm{Loss}({f}_{D},b,\rho ,\eta )\hfill \\ & =\underset{\rho \in \mathbb{R},\eta \in {E}_{\mu}}{min}\mathrm{Loss}({f}_{D},{b}_{D},\rho ,\eta ).\hfill \end{array}$$
It is straightforward to guarantee the monotone decrease of the objective value even if ${b}_{D}$ is a local optimal solution.
Algorithm 1 Learning Algorithm of Robust $(\nu ,\mu )$SVM 
Input: Training dataset $D=\{({x}_{i},{y}_{i}):i\in \left[m\right]\}$, Gram matrix $K\in {\mathbb{R}}^{m\times m}$ defined as ${K}_{ij}=k({x}_{i},{x}_{j}),i,j\in \left[m\right]$, and training labels $y={({y}_{1},\dots ,{y}_{m})}^{T}\in {\{+1,1\}}^{m}$. The matrix $\tilde{K}\in {\mathbb{R}}^{m\times m}$ is defined as ${\tilde{K}}_{ij}={y}_{i}{y}_{j}{K}_{ij}$. Let $g\left(x\right)=f\left(x\right)+b\in \mathcal{H}+\mathbb{R}$ be the initial decision function.

3.3. Dual Problem and Its Interpretation
The partial dual problem of (5) with a fixed outlier indicator $\eta \in {[0,1]}^{m}$ has an intuitive geometric picture. Some variants of νSVM can be geometrically interpreted on the basis of the dual form [29,30,31]. Substituting (2) into the objective function in (5), we obtain the Lagrangian of problem (5) with a fixed $\eta \in {E}_{\mu}$ as:
where nonnegative slack variables ${\xi}_{i},\phantom{\rule{0.166667em}{0ex}}i\in \left[m\right]$ are introduced to represent the hinge loss. Here, the parameters ${\beta}_{i}$ and ${\gamma}_{i}$ for $i\in \left[m\right]$ are nonnegative Lagrange multipliers. For a fixed $\eta \in {E}_{\mu}$, the Lagrangian is convex in the parameters $\alpha ,b,\rho $ and ξ and concave in $\beta =({\beta}_{1},\dots ,{\beta}_{m})$ and $\gamma =({\gamma}_{1},\dots ,{\gamma}_{m})$. Hence, the minmax theorem [32] (Proposition 6.4.3) yields:
$$\begin{array}{cc}\hfill {L}_{\eta}(\alpha ,b,\rho ,\xi ;\beta ,\gamma )& =\frac{1}{2}\sum _{i,j=1}^{m}{\alpha}_{i}{\alpha}_{j}k({x}_{i},{x}_{j})(\nu \mu )\rho +\frac{1}{m}\sum _{i=1}^{m}{\eta}_{i}{\xi}_{i}\sum _{i=1}^{m}{\beta}_{i}{\xi}_{i}\hfill \\ & \phantom{=}+\sum _{i=1}^{m}{\gamma}_{i}\left(\rho {\xi}_{i}{y}_{i}\left(\sum _{j}k({x}_{i},{x}_{j}){\alpha}_{j}+b\right)\right),\hfill \end{array}$$
$$\begin{array}{cc}& \phantom{=}\underset{\alpha ,b,\rho ,\xi}{inf}\underset{\beta ,\gamma \ge 0}{sup}{L}_{\eta}(\alpha ,b,\rho ,\xi ;\beta ,\gamma )\hfill \\ & =\underset{\beta ,\gamma \ge 0}{sup}\underset{\alpha ,b,\rho ,\xi}{inf}{L}_{\eta}(\alpha ,b,\rho ,\xi ;\beta ,\gamma )\hfill \\ & =\underset{\beta ,\gamma \ge 0}{sup}\underset{\alpha ,b,\rho ,\xi}{inf}\rho \left(\sum _{i}{\gamma}_{i}(\nu \mu )\right)+\sum _{i}{\xi}_{i}\left(\frac{{\eta}_{i}}{m}{\beta}_{i}{\gamma}_{i}\right)\hfill \\ & \phantom{=}\phantom{\rule{2.em}{0ex}}+\frac{1}{2}\sum _{i,j}{\alpha}_{i}{\alpha}_{j}k({x}_{i},{x}_{j})\sum _{i}{\gamma}_{i}{y}_{i}\sum _{j}k({x}_{i},{x}_{j}){\alpha}_{j}b\sum _{i}{y}_{i}{\gamma}_{i}\hfill \\ & =\underset{\gamma}{max}\left\{\frac{1}{2}\parallel \sum _{i}{\gamma}_{i}{y}_{i}k(\xb7,{x}_{i}){\parallel}_{\mathcal{H}}^{2}\phantom{\rule{4pt}{0ex}}:\phantom{\rule{4pt}{0ex}}\sum _{i:{y}_{i}=+1}{\gamma}_{i}=\sum _{i:{y}_{i}=1}{\gamma}_{i}=\frac{\nu \mu}{2},\phantom{\rule{4pt}{0ex}}0\le {\gamma}_{i}\le \frac{{\eta}_{i}}{m}\right\}.\hfill \end{array}$$
The last equality comes from the optimality condition with respect to the variables $\alpha ,b,\rho ,\xi $. Given the optimal solution ${\gamma}_{i},\phantom{\rule{0.166667em}{0ex}}i\in \left[m\right]$ of the dual problem, the optimal coefficient ${\alpha}_{i}$ in the primal problem is given by ${\alpha}_{i}={\gamma}_{i}{y}_{i}$, and the bias term b is obtained from the complementary slackness of ${\gamma}_{i}$ such that $0<{\gamma}_{i}<{\eta}_{i}/m$ and ${\eta}_{i}=1$.
Let us give a geometric interpretation of the above expression. For the training data $D=\{({x}_{i},{y}_{i}):i\in \left[m\right]\}$, the convex sets, ${\mathcal{U}}_{\eta}^{+}[\nu ,\mu ;D]$ and ${\mathcal{U}}_{\eta}^{}[\nu ,\mu ;D]$, are defined as the reduced convex hulls of the data points for each label, i.e.,
$$\begin{array}{cc}& \phantom{=}{\mathcal{U}}_{\eta}^{\pm}[\nu ,\mu ;D]\hfill \\ & =\left\{\sum _{i:{y}_{i}=\pm 1}{\gamma}_{i}^{\prime}k(\xb7,{x}_{i})\in \mathcal{H}:\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}\sum _{i:{y}_{i}=\pm 1}{\gamma}_{i}^{\prime}=1,\phantom{\rule{4pt}{0ex}}0\le {\gamma}_{i}^{\prime}\le \frac{2{\eta}_{i}}{(\nu \mu )m}\phantom{\rule{4pt}{0ex}}\mathrm{for}\phantom{\rule{4.pt}{0ex}}i\phantom{\rule{4.pt}{0ex}}\mathrm{such}\phantom{\rule{4.pt}{0ex}}\mathrm{that}\phantom{\rule{4.pt}{0ex}}{y}_{i}=\pm 1\right\}.\hfill \end{array}$$
The coefficients ${\gamma}_{i}^{\prime},\phantom{\rule{0.166667em}{0ex}}i\in \left[m\right]$ in ${\mathcal{U}}_{\eta}^{\pm}[\nu ,\mu ;D]$ are bounded above by a nonnegative real number. Hence, the reduced convex hull is a subset of the convex hull of the data points in the RKHS $\mathcal{H}$. Let ${\mathcal{V}}_{\eta}[\nu ,\mu ;D]$ be the Minkowski difference of two subsets,
where $A\ominus B$ of subsets A and B denotes $\{ab:a\in A,\phantom{\rule{0.166667em}{0ex}}b\in B\}$. We obtain:
for each $\eta \in {E}_{\mu}$. As a result, the optimal value of (5) is given as ${(\nu \mu )}^{2}/8\times \mathrm{opt}(\nu ,\mu ;D)$, where:
$$\begin{array}{c}\hfill {\mathcal{V}}_{\eta}[\nu ,\mu ;D]={\mathcal{U}}_{\eta}^{+}[\nu ,\mu ;D]\ominus {\mathcal{U}}_{\eta}^{}[\nu ,\mu ;D],\end{array}$$
$$\begin{array}{c}\hfill \underset{\alpha ,b,\rho ,\xi}{inf}\underset{\beta ,\gamma \ge 0}{sup}{L}_{\eta}(\alpha ,b,\rho ,\xi ;\beta ,\gamma )\phantom{\rule{0.166667em}{0ex}}=\phantom{\rule{0.166667em}{0ex}}\frac{{(\nu \mu )}^{2}}{8}min\left\{{\parallel f\parallel}_{\mathcal{H}}^{2}\phantom{\rule{0.166667em}{0ex}}:\phantom{\rule{0.166667em}{0ex}}f\in {\mathcal{V}}_{\eta}[\nu ,\mu ;D]\right\}\end{array}$$
$$\begin{array}{c}\hfill \mathrm{opt}(\nu ,\mu ;D)=\underset{\eta \in {E}_{\mu}}{max}\underset{f\in {\mathcal{V}}_{\eta}[\nu ,\mu ;D]}{min}{\parallel f\parallel}_{\mathcal{H}}^{2}.\end{array}$$
Therefore, the dual form of robust $(\nu ,\mu )$SVM can be expressed as the maximization of the minimum distance between two reduced convex hulls, ${\mathcal{U}}_{\eta}^{+}[\nu ,\mu ;D]$ and ${\mathcal{U}}_{\eta}^{}[\nu ,\mu ;D]$. The estimated decision function in robust $(\nu ,\mu )$SVM is provided by the optimal solution of (12) up to a scaling factor depending on $\nu \mu $. Moreover, the optimal value is proportional to the squared RKHS norm of $f\in \mathcal{H}$ in the decision function $g\left(x\right)=f\left(x\right)+b$.
4. Breakdown Point of Robust SVMs
4.1. FiniteSample Breakdown Point
Let us describe how to evaluate the robustness of learning algorithms. There are a number of robustness measures for evaluating the stability of estimators as discussed later in Section 4.3. In this paper, we use the finitesample breakdown point, and it will be referred to as the breakdown point for short. The breakdown point quantifies the degree of impact that the outliers have on the estimators when the contamination ratio is not necessarily infinitesimal [33]. In this section, we present an exact evaluation of the breakdown point of robust SVMs.
The breakdown point indicates the largest amount of contamination such that the estimator still gives information about the noncontaminated data [20] (Chapter 3.2). More precisely, for an estimator ${\theta}_{D}$ based on a dataset D of size m that takes a value in a normed parameter space, the finitesample breakdown point is defined as:
where ${\mathcal{D}}_{\kappa}$ is the family of datasets of size m including at least $m\kappa $ elements in common with the noncontaminated dataset D, i.e.,
$$\begin{array}{c}\hfill {\epsilon}^{*}=\underset{\kappa =0,1,\dots ,m}{max}\{\phantom{\rule{0.166667em}{0ex}}\kappa /m\phantom{\rule{0.166667em}{0ex}}:\phantom{\rule{0.166667em}{0ex}}{\theta}_{{D}^{\prime}}\phantom{\rule{4.pt}{0ex}}\mathrm{is}\phantom{\rule{4.pt}{0ex}}\mathrm{uniformly}\phantom{\rule{4.pt}{0ex}}\mathrm{bounded}\phantom{\rule{4.pt}{0ex}}\mathrm{for}\phantom{\rule{4.pt}{0ex}}{D}^{\prime}\in {\mathcal{D}}_{\kappa}\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}\},\end{array}$$
$$\begin{array}{c}\hfill {\mathcal{D}}_{\kappa}=\left\{\phantom{\rule{0.166667em}{0ex}}{D}^{\prime}\phantom{\rule{0.166667em}{0ex}}:\phantom{\rule{0.166667em}{0ex}}{D}^{\prime}=m,\phantom{\rule{0.166667em}{0ex}}{D}^{\prime}\cap D\ge m\kappa \phantom{\rule{0.166667em}{0ex}}\right\}.\end{array}$$
For simplicity, the dependency of ${\mathcal{D}}_{\kappa}$ on the dataset D is dropped. The condition of the breakdown point ${\epsilon}^{*}$ can be rephrased as:
where $\parallel \xb7\parallel $ is the norm on the parameter space. In most cases of interest, ${\epsilon}^{*}$ does not depend on the dataset D. For example, the breakdown point of the onedimensional median estimator is ${\epsilon}^{*}=\lfloor (m1)/2\rfloor /m$.
$$\begin{array}{c}\hfill \underset{{D}^{\prime}\in {\mathcal{D}}_{\kappa}}{sup}\parallel {\theta}_{{D}^{\prime}}\parallel <\infty ,\end{array}$$
4.2. Breakdown Point of Robust $(\nu ,\mu )$SVM
The parameters of robust $(\nu ,\mu )$SVM have a clear meaning unlike those of robust CSVM and ROD. In fact, $\nu \mu $ is a lower bound of the number of support vectors and an upper bound of the margin error, as mentioned in Section 3.1. In addition, we show that the parameter μ is exactly equal to the breakdown point of the decision function under a mild assumption. Such an intuitive interpretation will be of great help in tuning the parameters in the learning algorithm. Section 5 describes how to tune the learning parameters.
To start with, let us derive a lower bound of the breakdown point for the optimal value of Problem (5) that is expressed as $\mathrm{opt}(\nu ,\mu ;D)$ up to a constant factor. As shown in Section 3.3, the boundedness of $\mathrm{opt}(\nu ,\mu ;D)$ is equivalent to the boundedness of the RKHS norm of $f\in \mathcal{H}$ in the estimated decision function $g\left(x\right)=f\left(x\right)+b$. Given a labeled dataset $D=\{({x}_{i},{y}_{i}):i\in \left[m\right]\}$, let us define the label ratio r as:
$$\begin{array}{c}\hfill r=\frac{1}{m}min\left\{\phantom{\rule{0.166667em}{0ex}}\right\{i:{y}_{i}=+1\},\{i:{y}_{i}=1\}\left\phantom{\rule{0.166667em}{0ex}}\right\}.\end{array}$$
In what follows, we assume $m\nu ,m\mu \in \mathbb{N}$ to avoid technical difficulty.
Theorem 1.
Let D be a labeled dataset of size m with a label ratio $r>0$. For the parameters $\nu ,\mu $ such that $0\le \mu <\nu <1$ and $\nu m,\mu m\in \mathbb{N}$, we assume $\mu <r/2$. Then, the following two conditions are equivalent:
 (i)
 The inequality$$\begin{array}{c}\hfill \nu \mu \le 2(r2\mu )\end{array}$$
 (ii)
 Uniform boundedness,$$\begin{array}{c}\hfill sup\{\mathrm{opt}(\nu ,\mu ;{D}^{\prime})\phantom{\rule{0.166667em}{0ex}}:\phantom{\rule{0.166667em}{0ex}}{D}^{\prime}\in {\mathcal{D}}_{\mu m}\}<\infty \end{array}$$
The proof of the above theorem is given in Appendix A. The inequality $\mu <r/2$ has an intuitive interpretation. If $\mu <r/2$ is violated, the majority of, say, positive labeled samples in the noncontaminated training dataset can be replaced with outliers. In such a situation, the statistical features in the original dataset will not be retained.
Remark 1.
The condition (14) has an intuitive interpretation. Assume that ${m}_{+}<m/2$. After removing some training samples due to the optimal outlier indicator η, there exist at least ${m}_{+}m\mu m\mu ={m}_{+}2m\mu $ positive training samples for any ${D}^{\prime}\in {\mathcal{D}}_{m\mu}$. In the standard νSVM, the condition $\nu \le 2r$ guarantees the boundedness of the optimal value, $\mathrm{opt}(\nu ,0;D)$, for a noncontaminated dataset D [29]. For the robust $(\nu ,\mu )$SVM, ν and r are replaced with $\nu \mu $ and $({m}_{+}2m\mu )/m$, respectively. As a result, the inequality (14) is obtained as a sufficient condition of $\mathrm{opt}(\nu ,\mu ;{D}^{\prime})<\infty $ for each ${D}^{\prime}\in {\mathcal{D}}_{m\mu}$. This implies the pointwise boundedness of $\mathrm{opt}(\nu ,\mu ;{D}^{\prime})$. However, this interpretation does not prove the uniform boundedness of $\mathrm{opt}(\nu ,\mu ;{D}^{\prime})$ for any ${D}^{\prime}\in {\mathcal{D}}_{m\mu}$. In the proof in Appendix A, we prove the uniform boundedness over ${\mathcal{D}}_{m\mu}$.
The inequality (14) indicates the tradeoff between the ratio of outliers μ and the ratio of support vectors $\nu \mu $. This result is reasonable. The number of support vectors corresponds to the dimension of the statistical model. When the ratio of outliers is large, a simple statistical model should be used to obtain robust estimators.
When the contamination ratio in the training dataset is greater than the parameter μ of robust $(\nu ,\mu )$SVM, the estimated decision function is not necessarily bounded.
Theorem 2.
Suppose that ν and μ are rational numbers such that $0<\mu <1/4$ and $\mu <\nu <1$. Then, there exists a dataset D of size m with the label ratio r such that $\mu <r/2$ and:
hold, where ${\mathcal{D}}_{\mu m+1}$ is defined from D.
$$\begin{array}{c}\hfill sup\{\mathrm{opt}(\nu ,\mu ;{D}^{\prime})\phantom{\rule{0.166667em}{0ex}}:\phantom{\rule{0.166667em}{0ex}}{D}^{\prime}\in {\mathcal{D}}_{\mu m+1}\}=\infty \end{array}$$
The proof is given in Appendix B. Theorems 1 and 2 provide lower and upper bounds of the breakdown point, respectively. Hence, the breakdown point of the function part $f\in \mathcal{H}$ in the estimated decision function $g=f+b$ is exactly equal to ${\epsilon}^{*}=\mu $, when the learning parameters of robust $(\nu ,\mu )$SVM satisfy $\mu <r/2$ and $\nu \mu \le 2(r2\mu )$. Otherwise, the breakdown point of f is strictly less than μ. Note that the results in Theorems 1 and 2 hold for the global optimal solution.
Remark 2.
Let us consider the robustness of the local optimal solution ${f}_{D}$ obtained by robust $(\nu ,\mu )$SVM. Let ${f}_{\mathrm{opt}}$ be the global optimal solution of robust $(\nu ,\mu )$SVM. For the outlier indicator $\eta =\overline{\eta}\in {E}_{\mu}$ in Algorithm 1, we have:
where the last equality is guaranteed by the result in Section 3.3. Therefore, ${f}_{D}$ is less sensitive to contamination than the RKHS element of the global optimal solution.
$$\begin{array}{c}\hfill \parallel {f}_{\mathrm{opt}}{\parallel}_{\mathcal{H}}^{2}=\mathrm{opt}(\nu ,\mu ;D)\ge {min\{\parallel f\parallel}_{\mathcal{H}}^{2}\phantom{\rule{0.166667em}{0ex}}:\phantom{\rule{0.166667em}{0ex}}f\in {\mathcal{V}}_{\overline{\eta}}[\nu ,\mu ;D]\}=\parallel {f}_{D}{\parallel}_{\mathcal{H}}^{2}.\end{array}$$
Now, we will show the robustness of the bias term b. Let ${b}_{D}$ be the estimated bias parameter obtained by Algorithm 1. We will derive a lower bound of the breakdown point of the bias term. Then, we will show that the breakdown point of robust $(\nu ,\mu )$SVM with a bounded kernel is given by a simple formula.
Theorem 3.
Let D be an arbitrary dataset of size m with a label ratio r that is greater than zero. Suppose that ν and μ satisfy $0<\mu <\nu <1$, $\nu m,\mu m\in \mathbb{N}$, and $\mu <r/2$. For a nonnegative integer ℓ, we assume:
$$\begin{array}{c}\hfill 0\le 2\left(\mu \frac{\ell}{m}\right)<\nu \mu <2(r2\mu ).\end{array}$$
Then, uniform boundedness,
holds, where ${\mathcal{D}}_{\mu m\ell}$ is defined from D.
$$\begin{array}{c}\hfill sup\left\{\phantom{\rule{0.166667em}{0ex}}\right{b}_{{D}^{\prime}}:{D}^{\prime}\in {\mathcal{D}}_{\mu m\ell}\phantom{\rule{0.166667em}{0ex}}\}<\infty ,\end{array}$$
The proof is given in Appendix C, in which a detailed analysis is needed especially when the kernel function is unbounded. The proof shows that the uniform boundedness holds even if ${b}_{{D}^{\prime}}$ is a local optimal solution in Algorithm 1. Note that the inequality (15) is a sufficient condition of Inequality (14). Theorem 3 guarantees that the breakdown point of the estimated decision function ${f}_{D}+{b}_{D}$ is not less than $\mu \ell /m$, when (15) holds.
The robustness of ${b}_{D}$ for a bounded kernel is considered in the theorem below.
Theorem 4.
Let D be an arbitrary dataset of size m with a label ratio r that is greater than zero. For parameters such that $0<\mu <\nu <1$ and $\nu m,\mu m\in \mathbb{N}$, suppose that $\mu <r/2$ and $\nu \mu <2(r2\mu )$ hold. In addition, assume that the kernel function $k(x,{x}^{\prime})$ of the RKHS $\mathcal{H}$ is bounded, i.e., ${sup}_{x\in \mathcal{X}}k(x,x)<\infty $. Then, uniform boundedness,
holds, where ${\mathcal{D}}_{\mu m}$ is defined from D.
$$\begin{array}{c}\hfill sup\left\{\phantom{\rule{0.166667em}{0ex}}\right{b}_{{D}^{\prime}}:{D}^{\prime}\in {\mathcal{D}}_{\mu m}\phantom{\rule{0.166667em}{0ex}}\}<\infty ,\end{array}$$
The proof is given in Appendix D. Compared with Theorem 3 in which arbitrary kernel functions are treated, Theorem 4 ensures that a tighter lower bound of the breakdown point is obtained for bounded kernels. The above result agrees with those of other studies. The authors of [14] proved that bounded kernels produce robust estimators for regression problems in the sense of bounded response, i.e., robustness against a single outlier.
Combining Theorems 1–4, we find that the breakdown point of robust $(\nu ,\mu )$SVM with $\mu <r/2$ is given as follows.
 Bounded kernel: For $\nu \mu >2(r2\mu )$, the breakdown point of ${f}_{D}\in \mathcal{H}$ is less than μ. For $\nu \mu \le 2(r2\mu )$, the breakdown point of $({f}_{D},{b}_{D})$ is equal to μ.
 Unbounded kernel: For $\nu \mu >2(r2\mu )$, the breakdown point of ${f}_{D}\in \mathcal{H}$ is less than μ. For $2\mu <\nu \mu \le 2(r2\mu )$, the breakdown point of $({f}_{D},{b}_{D})$ is equal to μ. When $0<\nu \mu <min\{2\mu ,2(r2\mu \left)\right\}$, the breakdown point of ${f}_{D}$ is equal to μ, and the breakdown point of ${b}_{D}$ is bounded from below by $\mu \ell /m$ and from above by μ, where $\ell \in \mathbb{N}$ depends on ν and μ, as shown in Theorem 3.
Figure 2 shows the breakdown point of robust $(\nu ,\mu )$SVM. The line $\nu \mu =2(r2\mu )$ is critical. For unbounded kernels, we only obtain a bound of the breakdown point.
4.3. Breakdown Point Revisited
Let us reconsider the breakdown point of learning methods.
4.3.1. Effective Case of Breakdown Point
Suppose that the function ${\widehat{f}}_{D}\in \mathcal{H}$ is obtained by a learning method using the dataset D. Learning methods are categorized into two types according to the norm of ${\widehat{f}}_{D}$. The first type is the learning methods satisfying ${sup}_{{D}^{\prime}}{\parallel {\widehat{f}}_{{D}^{\prime}}\parallel}_{\mathcal{H}}=\infty $, and the second type is the ones such that ${sup}_{{D}^{\prime}}{\parallel {\widehat{f}}_{{D}^{\prime}}\parallel}_{\mathcal{H}}<\infty $, where the supremum is taken over arbitrary dataset of size m, i.e., ${D}^{\prime}\in {\mathcal{D}}_{m}={(\mathcal{X}\times \{+1,1\})}^{m}$.
For learning methods of the first type, the breakdown point indicates the number of outliers such that the estimator remains in a uniformlybounded region. This is meaningful information about the robustness of the learning method. In this case, the larger breakdown point is regarded as a more robust method. As shown in Theorems 1 and 2, the robust $(\nu ,\mu )$SVM is a learning method of the first type.
The second type implies that the hypothesis space of the learning method is bounded regardless of datasets. The CSVM, robust CSVM and ROD belong to learning methods of the second type. Indeed, given a labeled dataset $D=\{({x}_{i},{y}_{i}):i\in \left[m\right]\}$, the nonnegative property of the hinge loss in CSVM leads to:
where the last inequality comes from the fact that the objective value at $f=0$ and $b=0$ is greater than or equal to the optimal value. Likewise, one can prove that robust CSVM and ROD have the same property. In this case, the naive definition of the breakdown point shown in Section 4.1 is not adequate, because the boundary effect of the hypothesis set is not taken into account. In the general definition of the breakdown point, the boundary of the hypothesis space is taken into account [20] (Chapter 3.2.5).
$$\begin{array}{c}\hfill \frac{1}{2}\parallel {\widehat{f}}_{D}{\parallel}_{\mathcal{H}}^{2}\le \frac{1}{2}{\parallel {\widehat{f}}_{D}\parallel}_{\mathcal{H}}^{2}+C\sum _{i=1}^{m}{[1{y}_{i}({\widehat{f}}_{D}\left({x}_{i}\right)+{\widehat{b}}_{D})]}_{+}\le mC,\end{array}$$
In this paper, we focus on the breakdown point of learning algorithms of the first type. Then, the analysis based on the breakdown point suggests proper choices of hyperparameters $(\nu ,\mu )$ as shown in succeeding sections.
4.3.2. Other Robust Estimators
Robust statistical inference has been studied for a long time in mathematical statistics, and a number of robust estimators have been proposed for many kinds of statistical problems [20,34,35]. In mathematical analysis, one needs to quantify the influence of samples on estimators. Here, the influence function, change of variance and breakdown point are often used as measures of robustness. In the machine learning literature, these measures have been used to analyze the theoretical properties of SVM and its robust variants. In [36], the robustness of a learning algorithm using a convex loss function was investigated on the basis of an influence function defined over an RKHS. When the influence function is uniformly bounded on the RKHS, the learning algorithm is regarded to be robust against outliers. It was proven that the least squares loss provides a robust learning algorithm for classification problems in this sense [36].
From the standpoint of the breakdown point, however, convex loss functions do not provide robust estimators, as shown in [20] (Chapter 5.16). Yu et al. [14] proved that the breakdown point of a learning algorithm using clipped loss is greater than or equal to $1/m$ in regression problems. In Section 4.2, we show a detailed analysis of the breakdown point for robust $(\nu ,\mu )$SVM.
5. Admissible Region for Learning Parameters
The theoretical analysis in Section 4.2 suggests that robust $(\nu ,\mu )$SVM satisfying $0<\nu \mu <2(r2\mu )$ is a good choice for obtaining a robust classifier, especially when a bounded kernel is used. Here, r is the label ratio of the noncontaminated original data D, and usually, it is unknown in realworld data analysis. Thus, we need to estimate r from the contaminated dataset ${D}^{\prime}$.
If an upper bound of the outlier ratio is known to be $\tilde{\mu}$, we have ${D}^{\prime}\in {\mathcal{D}}_{\tilde{\mu}m}$, where ${\mathcal{D}}_{\tilde{\mu}m}$ is defined from D. Let ${r}^{\prime}$ be the label ratio of ${D}^{\prime}$. Then, the label ratio of the original dataset D should satisfy ${r}_{\mathrm{low}}\le r\le {r}_{\mathrm{up}}$, where ${r}_{\mathrm{low}}=max\{{r}^{\prime}\tilde{\mu},0\}$ and ${r}_{\mathrm{up}}=min\{{r}^{\prime}+\tilde{\mu},1/2\}$. Let ${\mathsf{\Lambda}}_{\mathrm{low}}$ and ${\mathsf{\Lambda}}_{\mathrm{up}}$ be:
$$\begin{array}{c}\hfill \begin{array}{c}{\displaystyle {\mathsf{\Lambda}}_{\mathrm{low}}=\{(\nu ,\mu )\phantom{\rule{0.166667em}{0ex}}:\phantom{\rule{0.166667em}{0ex}}0\le \mu \le \tilde{\mu},\phantom{\rule{0.166667em}{0ex}}0<\nu \mu <2({r}_{\mathrm{low}}2\mu )\},}\hfill \\ {\displaystyle \phantom{\rule{4pt}{0ex}}{\mathsf{\Lambda}}_{\mathrm{up}}=\{(\nu ,\mu )\phantom{\rule{0.166667em}{0ex}}:\phantom{\rule{0.166667em}{0ex}}0\le \mu \le \tilde{\mu},\phantom{\rule{0.166667em}{0ex}}0<\nu \mu <2({r}_{\mathrm{up}}2\mu )\}.}\hfill \end{array}\end{array}$$
Robust $(\nu ,\mu )$SVM with $(\nu ,\mu )\in {\mathsf{\Lambda}}_{\mathrm{low}}$ reaches the breakdown point μ for any noncontaminated dataset D such that ${D}^{\prime}\in {\mathcal{D}}_{\mu m}$ for given ${D}^{\prime}$. On the other hand, the parameters $(\nu ,\mu )$ on the outside of ${\mathsf{\Lambda}}_{\mathrm{up}}$ are not necessary. Indeed, for any noncontaminated data D such that ${D}^{\prime}\in {\mathcal{D}}_{\tilde{\mu}m}$ for given ${D}^{\prime}$, the parameters $(\nu ,\mu )$ satisfying $\nu \mu >2({r}_{\mathrm{up}}2\mu )$ do not yield a learning method that reaches the breakdown point μ.
When the upper bound $\tilde{\mu}$ is unknown, we set $\tilde{\mu}=r/2$. As shown in the comments after Theorem 1, the outlier ratio greater than $r/2$ can totally violate the statistical features of the original dataset. In such a case, we need to reconsider the observation process. For $\tilde{\mu}=r/2$, we obtain ${\overline{r}}_{\mathrm{low}}\le r\le {\overline{r}}_{\mathrm{up}}$, where ${\overline{r}}_{\mathrm{low}}=2{r}^{\prime}/3$ and ${\overline{r}}_{\mathrm{up}}=min\{2{r}^{\prime},1/2\}$. Hence, in the worst case, the admissible set of learning parameters ν and μ is:
$$\begin{array}{c}\hfill \begin{array}{c}{\displaystyle {\overline{\mathsf{\Lambda}}}_{\mathrm{low}}=\{(\nu ,\mu )\phantom{\rule{0.166667em}{0ex}}:\phantom{\rule{0.166667em}{0ex}}0<\nu \mu <2({\overline{r}}_{\mathrm{low}}2\mu )\},\phantom{\rule{4pt}{0ex}}\mathrm{or}}\hfill \\ {\displaystyle \phantom{\rule{4pt}{0ex}}{\overline{\mathsf{\Lambda}}}_{\mathrm{up}}=\{(\nu ,\mu )\phantom{\rule{0.166667em}{0ex}}:\phantom{\rule{0.166667em}{0ex}}0<\nu \mu <2({\overline{r}}_{\mathrm{up}}2\mu )\}.}\hfill \end{array}\end{array}$$
Given contaminated training data ${D}^{\prime}$, for any D of size m with a label ratio $r\in [{\overline{r}}_{\mathrm{low}},{\overline{r}}_{\mathrm{up}}]$, such that ${D}^{\prime}\in {\mathcal{D}}_{\mu m}$ with $\mu <{\overline{r}}_{\mathrm{low}}/2$, robust $(\nu ,\mu )$SVM with $(\nu ,\mu )\in {\overline{\mathsf{\Lambda}}}_{\mathrm{low}}$ provides a classifier with the breakdown point μ. A parameter $(\nu ,\mu )$ on the outside of ${\overline{\mathsf{\Lambda}}}_{\mathrm{up}}$ is not necessary, for the same reasons as for ${\mathsf{\Lambda}}_{\mathrm{up}}$.
The admissible region of $(\nu ,\mu )$ is useful when the parameters are determined by a grid search based on crossvalidation. On the other hand, C of robust CSVM and λ in ROD can take a wide range of positive real numbers. Hence, differently from robust $(\nu ,\mu )$SVM, these algorithms need heuristics to determine the region of the grid search for the learning parameters.
The numerical experiments presented in Section 6 applied a grid search to the region ${\overline{\mathsf{\Lambda}}}_{\mathrm{up}}$.
6. Numerical Experiments
We conducted numerical experiments on synthetic and benchmark datasets to compare a number of SVMs. Algorithm 1 was used for robust $(\nu ,\mu )$SVM, and DCA in [12] was used for robust CSVM with the ramp loss. We used CPLEX to solve the convex quadratic problems.
6.1. DCA versus Global Optimization Methods
As has been shown in many studies including [37], DCA quite often gives global optimal solutions to many different and various nonconvex optimization problems. We examined how often DCA produces global optimal solutions to robust $(\nu ,\mu )$SVM with the 01 valued outlier indicator. Here, the numerical solution of DCA in robust $(\nu ,\mu )$SVM denotes the output of Step 5 in Algorithm 1. In these numerical experiments, the optimization problem was formulated as a mixed integer programming (MIP) problem, and the CPLEX MIP solver was used to compute the global optimal solution of robust $(\nu ,\mu )$SVM based on a relatively small dataset. The numerical solution given by DCA was compared with the global optimal solution.
In binary classification problems, positive (resp. negative) samples were generated from a multivariate normal distribution with mean ${\mu}_{p}={1}_{d}\in {\mathbb{R}}^{d}$ (resp. ${\mu}_{n}={1}_{d}\in {\mathbb{R}}^{d}$) and a variancecovariance matrix $cI$, where I is the identity matrix and c is a positive constant. Each class had 20 samples. For such a small dataset, the global optimal solution was obtained by the CPLEX MIP solver. Outliers were added by flipping positive labels randomly, and the outlier ratio was $10\%$. The DCA with the multistart method was used to solve the robust $(\nu ,\mu )$SVM using the linear kernel. In the multistart method, a number of initial points were randomly generated, and for each initial point, a numerical solution was obtained by DCA. Among these numerical solutions, the point that attained the smallest objective value was chosen as the output of the multistart method. $\mathrm{opt}\left(\mathrm{DCA}\right)$ was the objective value at the numerical solution of DCA, and $\mathrm{opt}\left(\mathrm{MIP}\right)$ was the global optimal value. Note that the optimal value of the problem in robust $(\nu ,\mu )$SVM is nonpositive, i.e., $\mathrm{opt}\left(\mathrm{MIP}\right)\le 0$. In addition, one can find that any numerical solution obtained by DCA satisfies $\mathrm{opt}\left(\mathrm{DCA}\right)\le 0$.
In the numerical experiments, 100 training datasets such that $\mathrm{opt}\left(\mathrm{MIP}\right)<{10}^{4}$ were randomly generated, and $\mathrm{opt}\left(\mathrm{DCA}\right)$ was computed for each dataset. Table 1 shows the number of times that $\mathrm{opt}\left(\mathrm{DCA}\right)/\mathrm{opt}\left(\mathrm{MIP}\right)\ge 0.97$ holds out of 100 trials. When the achievable lowest test error, i.e., the Bayes error, was large, the DCA tended to yield a local optimal solution that was not globally optimal. When the Bayes error was small, DCA produced approximately global optimal solutions in almost all trials. Even when DCA using a single initial point failed to find the global optimal solution, the multistart method with five or 10 initial points greatly improved the quality of the numerical solutions. In numerical experiments, DCA was more than 50 times more computationally efficient than the MIP solver.
6.2. Computational Cost
We conducted numerical experiments to compare the computational cost of robust $(\nu ,\mu )$SVM with that of robust CSVM. Both learning algorithms employed the DCA. The numerical experiments were conducted on AMD Opteron Processors 6176 (2.3 GHz) with 48 cores, running Cent OS Linux Release 6.4. We used three benchmark datasets, Sonar, BreastCancer and spam, which were also used in the experiments in Section 6.5. m training samples were randomly chosen from each dataset, and each dataset was contaminated by outliers. The outlier ratio was $5\%$, and outliers were added by flipping the labels randomly. Robust $(\nu ,\mu )$SVM and robust CSVM with the linear kernel were used to obtain classifiers from the contaminated datasets. This process was repeated 20 times for each dataset. Table 2 presents the average computation time and average ratio of support vectors (SV ratio) together with standard deviations. The support vector was numerically identified as the data point ${x}_{i}$ having the coefficient ${\alpha}_{i}$ such that ${\alpha}_{i}$ is greater than ${10}^{10}$. Although the SV ratio is bounded below by $\nu \mu $, the bound was not necessarily tight. A similar tendency is often observed in νSVM. In terms of the computation time, two learning algorithms were not significantly different except in the case of robust CSVM with a small C that induces a strong regularization.
6.3. Outlier Detection
Robust $(\nu ,\mu )$SVM uses an outlier indicator to suppress the influence of outliers. Figure 3 shows that the outlier indicator in the robust $(\nu ,\mu )$SVM using the linear kernel is able to detect outliers in a synthetic setting. Similar results have been reported for learning methods using outlier indicators such as ROD and ERSVM. Systematic experiments using a recallprecision criterion were presented in [17,19].
6.4. Breakdown Point
We investigate the validity of Inequality (14) in Theorem 1. In the numerical experiments, the original data D were generated using mlbench.spiralsin the mlbench library of the R language [38]. Given an outlier ratio μ, positive samples of size $\mu m$ were randomly chosen from D, and they were replaced with randomlygenerated outliers to obtain a contaminated dataset ${D}^{\prime}\in {\mathcal{D}}_{\mu m}$. The original data D and an example of the contaminated data ${D}^{\prime}\in {\mathcal{D}}_{\mu m}$ are shown in Figure 4. The decision function $g\left(x\right)=f\left(x\right)+b$ was estimated from ${D}^{\prime}$ by using robust $(\nu ,\mu )$SVM. Here, the true outlier ratio μ was used as the parameter of the learning algorithm. The norms of f and b were then calculated. The above process was repeated 30 times for each pair of parameters $(\nu ,\mu )$, and the maximum values of ${\parallel f\parallel}_{\mathcal{H}}$ and $\leftb\right$ were computed.
Figure 5 shows the results of the numerical experiments. The maximum norm of the estimated decision function is plotted for the parameter $(\mu ,\nu \mu )$ on the same axis as in Figure 2. The top (bottom) panels show the results for a Gaussian (linear) kernel. The left and middle columns show the maximum norm of f and b, respectively. The maximum test errors are presented in the right column. In all panels, the red points denote the top 50 percent of values, and the asterisks (∗) are the point that violates the inequality $\nu \mu \le 2(r2\mu )$. In this example, the numerical results agree with the theoretical analysis in Section 4; i.e., the norm becomes large when the inequality $\nu \mu \le 2(r2\mu )$ is violated. Accordingly, the test error gets close to $0.5$; no information for classification. Even when the unbounded linear kernel is used, robustness is confirmed for the parameters in the left lower region in the right panel of Figure 2.
In the bottom right panel, the test error gets large when the inequality $\nu \mu \le 2(r2\mu )$ holds. This result comes from the problem setup. Even with noncontaminated data, the test error of the standard νSVM is approximately $0.5$, because the linear kernel works poorly for spiral data. Thus, the worstcase test error under the target distribution can go beyond $0.5$. For the parameter at which (14) is violated, the test error is always close to $0.5$. Thus, a learning method with such parameters does not provide any useful information for classification.
6.5. Prediction Accuracy
As shown in Section 5, the theoretical analysis of the breakdown point yields the admissible region, such as ${\overline{\mathsf{\Lambda}}}_{\mathrm{up}}$, for learning parameters in robust $(\nu ,\mu )$SVM. Learning parameters outside the admissible region produce an unstable learning algorithm. Hence, one can reduce the computational cost of tuning the learning parameters by ignoring outside of the admissible region. In this section, we verify the usefulness of the admissible region.
We compared the generalization ability of robust $(\nu ,\mu )$SVM with νSVM and robust CSVM using the ramp loss. In robust $(\nu ,\mu )$SVM, a grid search of the region ${\overline{\mathsf{\Lambda}}}_{\mathrm{up}}$ is used to choose the learning parameters, ν and μ.
The datasets are presented in Table 3. The datasets are from the mlbench and kernlab libraries of the R language [38]. The number of positive samples in these datasets is less than or equal to the number of negative samples. Before running the learning algorithms, we standardized each input variable to be mean zero and standard deviation one.
We randomly split the dataset into training and test sets. To evaluate the robustness, the training data were contaminated by outliers. More precisely, we randomly chose positive labeled samples in the training data and changed their labels to negative; i.e., we added outliers by flipping the labels. After that, robust $(\nu ,\mu )$SVM, robust CSVM using the ramp loss and the standard νSVM were used to obtain classifiers from the contaminated training dataset. The prediction accuracy of each classifier was evaluated over test data that had no outliers. Linear and Gaussian kernels were employed for each learning algorithm. The learning parameters, such as $\mu ,\nu $ and C, were determined by conducting a grid search based on fivefold crossvalidation over the training data. For robust $(\nu ,\mu )$SVM, the parameter $(\mu ,\nu )$ was selected from the admissible region ${\overline{\mathsf{\Lambda}}}_{\mathrm{up}}$ in (16). For standard νSVM, the candidate of the regularization parameter ν was selected from the interval $(0,2{r}^{\prime})$, where ${r}^{\prime}$ is the label ratio of the contaminated training data. For robust CSVM, the regularization parameter C was selected from the interval $[{10}^{7},\phantom{\rule{0.166667em}{0ex}}{10}^{7}]$. In the grid search of the parameters, 24 or 25 candidates were examined for each learning method. Thus, we needed to solve convex or nonconvex optimization problems more than $24\times 5$ times in order to obtain a classifier. The above process was repeated 30 times, and the average test error was calculated.
The results are presented in Table 3. For noncontaminated training data, robust $(\nu ,\mu )$SVM and robust CSVM were comparable to the standard νSVM. When the outlier ratio is high, we can conclude that robust $(\nu ,\mu )$SVM and robust CSVM tend to work better than the standard νSVM. In this experiment, the kernel function does not affect the relative prediction performance of these learning methods. In large datasets, such as spam and Satellite, robust $(\nu ,\mu )$SVM tends to outperform robust CSVM. When the learning parameters, such as $\nu ,\mu $ and C, are appropriately chosen by using a large dataset, the learning algorithms with multiple learning parameters clearly work better than those with a single learning parameter. In addition, in robust CSVM, there is a difficulty in choosing the regularization parameter. Indeed, the parameter C does not have a clear meaning, and thus, it is not so easy to determine its candidates in the grid search optimization. In contrast, ν in νSVM and its robust variant has a clear meaning, i.e., a lower bound of the ratio of support vectors and an upper bound of the margin error on the training data [5]. Such a clear meaning is helpful for choosing candidate points of regularization parameters.
7. Concluding Remarks
We have investigated the breakdown point of robust variants of SVMs. The theoretical analysis provides inequalities of learning parameters, ν and μ, in robust $(\nu ,\mu )$SVM that guarantee the robustness of the learning algorithm. Numerical experiments showed that the inequalities are critical to obtaining a robust classifier. The exact evaluation of the breakdown point for robust $(\nu ,\mu )$SVM enables us to restrict the range of the learning parameters and to increase the chance of finding a robust classifier with good performance for the same computational cost.
In our paper, the dual representation of robust SVMs is applied to the calculation of the breakdown point. Theoretical analysis using the dual representation can be a powerful tool for the detailed analysis of other learning algorithms.
On the theoretical side, it is interesting to establish the relationship between the robustness, say breakdown point, and the convergence speed of learning algorithms, as presented for the parametric inference in mathematical statistics [34] (Chapter 2.4). Furthermore, it is important to determine the optimal parameter choice of $(\nu ,\mu )$ in robust $(\nu ,\mu )$SVM as an extension of the parameter choice for νSVM [39]. Another important issue is to develop efficient optimization algorithms. Although the DC algorithm [12,27] and convex relaxation [14,17] are promising methods, more scalable algorithms will be required to deal with massive datasets that are often contaminated by outliers. Recently, a computationallyefficient algorithm, called iteratively weighted SVM (IWSVM), was developed to solve optimization problems in the robust CSVM and its variants [40]. Moreover, a fixed point of IWSVM is assured to be a local optimal solution obtained by the DC algorithm. It will be worthwhile to investigate the applicability of IWSVM to robust $(\nu ,\mu )$SVM.
Acknowledgments
This work was supported by JSPS KAKENHI, Grant Number 16K00044 and 15K00031.
Author Contributions
Takafumi Kanamori and Akiko Takeda contributed the theoretical analysis; Takafumi Kanamori and Shuhei Fujiwara performed the experiments; Takafumi Kanamori and Akiko Takeda wrote the paper. All authors have read and approved the final manuscript.
Conflicts of Interest
The authors declare no conflict of interest.
Appendix A. Proof of Theorem 1
The proof is decomposed into two lemmas. Lemma A1 shows that Condition (i) is sufficient for Condition (ii), and Lemma A2 shows that Condition (ii) does not hold if Inequality (14) is violated. For the dataset $D=\{({x}_{i},{y}_{i}):i\in \left[m\right]\}$, let ${I}_{+}$ and ${I}_{}$ be the index sets defined as ${I}_{\pm}=\{i:{y}_{i}=\pm 1\}$. When the parameter μ is equal to zero, the theorem holds according to the argument on the standard νSVM [29]. Below, we assume $\mu >0$.
Lemma A1.
Under the assumptions of Theorem 1, Condition (i) leads to Condition (ii).
Proof of Lemma A1.
We will show that ${\mathcal{V}}_{\eta}[\nu ,\mu ;{D}^{\prime}]$ is not empty for any ${D}^{\prime}\in {\mathcal{D}}_{\mu m}$. For a contaminated dataset ${D}^{\prime}=\{({x}_{i}^{\prime},{y}_{i}^{\prime}):i\in \left[m\right]\}\in {\mathcal{D}}_{\mu m}$, let us define ${\tilde{I}}_{+}\subset {I}_{+}$ as an index set, such that the sample $({x}_{i},{y}_{i})\in D$ for $i\in {\tilde{I}}_{+}$ is replaced with $({x}_{i}^{\prime},{y}_{i}^{\prime})\in {D}^{\prime}$ as an outlier. In the same way, ${\tilde{I}}_{}\subset {I}_{}$ is defined for negative samples in D. Therefore, for any index i in ${I}_{+}\setminus {\tilde{I}}_{+}$ or ${I}_{}\setminus {\tilde{I}}_{}$, we have $({x}_{i},{y}_{i})=({x}_{i}^{\prime},{y}_{i}^{\prime})$. The assumptions of the theorem ensure ${\tilde{I}}_{+}+{\tilde{I}}_{}\le \mu m$. Let us define ${J}_{\eta ,+}=\{i\in {I}_{+}\setminus {\tilde{I}}_{+}:{\eta}_{i}=1\}$ and ${J}_{\eta ,}=\{i\in {I}_{}\setminus {\tilde{I}}_{}:{\eta}_{i}=1\}$. These sets are not empty. Indeed, we have:
where Condition (i) in Theorem 1 is used in the second inequality. Likewise, we have ${J}_{\eta ,}>0$.
$$\begin{array}{c}\hfill {J}_{\eta ,+}\ge {m}_{+}m\mu m\mu \ge \frac{(\nu \mu )m}{2}>0,\end{array}$$
We define two points in $\mathcal{H}$ as:
$$\begin{array}{cc}\hfill {f}_{\eta ,+}& =\frac{1}{{J}_{\eta ,+}}\sum _{i\in {J}_{\eta ,+}}k(\xb7,{x}_{i}^{\prime})=\frac{1}{{J}_{\eta ,+}}\sum _{i\in {J}_{\eta ,+}}k(\xb7,{x}_{i}),\hfill \\ \hfill {f}_{\eta ,}& =\frac{1}{{J}_{\eta ,}}\sum _{i\in {J}_{\eta ,}}k(\xb7,{x}_{i}^{\prime})=\frac{1}{{J}_{\eta ,}}\sum _{i\in {J}_{\eta ,}}k(\xb7,{x}_{i}).\hfill \end{array}$$
Then, we have:
$$\begin{array}{cc}\hfill {f}_{\eta ,+}& \in {\mathcal{U}}_{\eta}^{+}[\nu ,\mu ;{D}^{\prime}]\cap \mathrm{conv}\{k(\xb7,{x}_{i})\phantom{\rule{0.166667em}{0ex}}:\phantom{\rule{0.166667em}{0ex}}i\in {I}_{+}\},\hfill \\ \hfill {f}_{\eta ,}& \in {\mathcal{U}}_{\eta}^{}[\nu ,\mu ;{D}^{\prime}]\cap \mathrm{conv}\{k(\xb7,{x}_{i})\phantom{\rule{0.166667em}{0ex}}:\phantom{\rule{0.166667em}{0ex}}i\in {I}_{}\}.\hfill \end{array}$$
Because $1/{J}_{\eta ,+}$ and $1/{J}_{\eta ,}$ are both less than or equal to $\frac{2}{(\nu \mu )m}$ due to (A1), ${\eta}_{i}=1$ holds for all $i\in {J}_{\eta ,+}\cup {J}_{\eta ,}$.
Now, let us prove the inequality,
$$\begin{array}{c}\hfill \underset{{D}^{\prime}\in {\mathcal{D}}_{\mu m}}{sup}\underset{\eta \in {E}_{\mu}}{max}\underset{f\in {\mathcal{V}}_{\eta}[\nu ,\mu ;{D}^{\prime}]}{inf}{\parallel f\parallel}_{\mathcal{H}}^{2}<\infty .\end{array}$$
The above argument leads to:
for any $\eta \in {E}_{\mu}$. Let us define:
for the original dataset D. Then, we obtain:
$$\begin{array}{c}\hfill \underset{f\in {\mathcal{V}}_{\eta}[\nu ,\mu ;{D}^{\prime}]}{min}{\parallel f\parallel}_{\mathcal{H}}^{2}\le {\parallel {f}_{\eta ,+}{f}_{\eta ,}\parallel}_{\mathcal{H}}^{2}\end{array}$$
$$\begin{array}{c}\hfill \mathcal{C}\left[D\right]=\mathrm{conv}\{k(\xb7,{x}_{i})\phantom{\rule{0.166667em}{0ex}}:\phantom{\rule{0.166667em}{0ex}}i\in {I}_{+}\}\ominus \mathrm{conv}\{k(\xb7,{x}_{i})\phantom{\rule{0.166667em}{0ex}}:\phantom{\rule{0.166667em}{0ex}}i\in {I}_{}\}\end{array}$$
$$\begin{array}{cc}\hfill \mathrm{opt}(\nu ,\mu ;{D}^{\prime})& =\underset{\eta \in {E}_{\mu}}{max}\underset{f\in {\mathcal{V}}_{\eta}[\nu ,\mu ;{D}^{\prime}]}{min}{\parallel f\parallel}_{\mathcal{H}}^{2}\hfill \\ & \le \underset{\eta \in {E}_{\mu}}{max}{\parallel {f}_{\eta ,+}{f}_{\eta ,}\parallel}_{\mathcal{H}}^{2}\hfill \\ & \le \phantom{\rule{4pt}{0ex}}\underset{\eta \in {E}_{\mu}}{max}\underset{f\in \mathcal{C}\left[D\right]}{max}{\parallel f\parallel}_{\mathcal{H}}^{2}\hfill \\ & =\phantom{\rule{4pt}{0ex}}\underset{f\in \mathcal{C}\left[D\right]}{max}{\parallel f\parallel}_{\mathcal{H}}^{2}\hfill \\ & \le \phantom{\rule{4pt}{0ex}}4\underset{i\in \left[m\right]}{max}k({x}_{i},{x}_{i})<\infty .\phantom{\rule{2.em}{0ex}}\phantom{\rule{2.em}{0ex}}\left(\mathrm{triangle}\phantom{\rule{4.pt}{0ex}}\mathrm{inequality}\right)\hfill \end{array}$$
The upper bound does not depend on the contaminated dataset ${D}^{\prime}\in {\mathcal{D}}_{\mu m}$. Thus, the inequality (A2) holds. ☐
Lemma A2.
Under the condition of Theorem 1, we assume $\nu \mu >2(r2\mu )$. Then, we have:
$$\begin{array}{c}\hfill sup\{\mathrm{opt}(\nu ,\mu ;{D}^{\prime}):{D}^{\prime}\in {\mathcal{D}}_{\mu m}\}=\infty .\end{array}$$
Proof of Lemma A2.
We will use the same notation as in the proof of Lemma A1. Without loss of generality, we can assume $r={I}_{}/m$. We will prove that there exists a feasible parameter ${\eta}^{\prime}\in {E}_{\mu}$ and a contaminated training set ${D}^{\prime}=\{({x}_{i}^{\prime},{y}_{i}^{\prime}):i\in \left[m\right]\}\in {\mathcal{D}}_{\mu m}$ such that ${\mathcal{U}}_{{\eta}^{\prime}}^{}[\nu ,\mu ;{D}^{\prime}]$ becomes empty. The construction of the dataset ${D}^{\prime}$ is illustrated in Figure A1. Suppose that ${\tilde{I}}_{+}=0$ and ${\tilde{I}}_{}=\mu m$ and that ${y}_{i}^{\prime}=+1$ holds for all $i\in {\tilde{I}}_{}$, meaning that all outliers in ${D}^{\prime}$ are made by flipping the labels of the negative samples in D. This is possible, because $\mu m<{I}_{}/2<{I}_{}$ holds. The outlier indicator ${\eta}^{\prime}=({\eta}_{1}^{\prime},\dots ,{\eta}_{m}^{\prime})\in {E}_{\mu}$ is defined by ${\eta}_{i}^{\prime}=0$ for $\mu m$ samples in ${I}_{}\setminus {\tilde{I}}_{}$, and ${\eta}_{i}^{\prime}=1$ otherwise. This assignment is possible because ${I}_{}\setminus {\tilde{I}}_{}={I}_{}\mu m>\mu m$. Then, we have:
where $\nu \mu >2(r2\mu )$ is used in the last inequality. In addition, ${y}_{i}^{\prime}=1$ holds only when $i\in {I}_{}\setminus {\tilde{I}}_{}$. Therefore, we have ${\mathcal{U}}_{{\eta}^{\prime}}^{}[\nu ,\mu ;{D}^{\prime}]=\varnothing $. The infeasibility of the dual problem means that the primal problem is unbounded or infeasible. In this case, the infeasibility of the primal problem is excluded. Hence, a contaminated dataset ${D}^{\prime}\in {\mathcal{D}}_{\mu m}$ and an outlier indicator ${\eta}^{\prime}\in {E}_{\mu}$ exist such that:
holds. ☐
$$\begin{array}{cc}\hfill {J}_{{\eta}^{\prime},}& =\{i\in {I}_{}\setminus {\tilde{I}}_{}\phantom{\rule{0.166667em}{0ex}}:\phantom{\rule{0.166667em}{0ex}}{\eta}_{i}^{\prime}=1\}={I}_{}\setminus {\tilde{I}}_{}\mu m={I}_{}2\mu m<\frac{(\nu \mu )m}{2},\hfill \end{array}$$
$$\begin{array}{c}\hfill \mathrm{opt}(\nu ,\mu ;{D}^{\prime})\ge \underset{f\in {\mathcal{V}}_{{\eta}^{\prime}}[\nu ,\mu ;{D}^{\prime}]}{min}{\parallel f\parallel}_{\mathcal{H}}^{2}=\infty \end{array}$$
Figure A1.
Index sets ${\tilde{I}}_{\pm}$ and value of ${\eta}_{i}^{\prime}$ defined in the proof of Lemma A2.
Appendix B. Proof of Theorem 2
Proof.
For a rational number $\mu \in (0,1/4)$, there exists an $m\in \mathbb{N}$ such that $\mu m\in \mathbb{N}$ and $2\mu m+1\le m(2\mu m+1)$ hold. For such m, let $D=\{({x}_{i},{y}_{i}):i\in \left[m\right]\}$ be training data such that ${I}_{}=2\mu m+1$ and ${I}_{+}=m(2\mu m+1)$, where the index sets ${I}_{\pm}$ are defined in the proof of Appendix A. Since the label ratio of D is $r=min\left\{\right{I}_{},{I}_{+}\left\right\}/m=2\mu +1/m$, and we have $\mu <r/2$. For ${\mathcal{D}}_{\mu m+1}$ defined from D, let ${D}^{\prime}=\{({x}_{i}^{\prime},{y}_{i}^{\prime}):i\in \left[m\right]\}\in {\mathcal{D}}_{\mu m+1}$ be a contaminated dataset of D such that $\mu m+1$ outliers are made by flipping the labels of the negative samples in D. Thus, there are $\mu m$ negative samples in ${D}^{\prime}$. Let us define the outlier indicator ${\eta}^{\prime}=({\eta}_{1}^{\prime},\dots ,{\eta}_{m}^{\prime})\in {E}_{\mu}$ such that ${\eta}_{i}^{\prime}=0$ for $\mu m$ negative samples in ${D}^{\prime}$. Then, any sample in ${D}^{\prime}$ with ${\eta}_{i}^{\prime}=1$ should be a positive one. Hence, we have ${\mathcal{U}}_{{\eta}^{\prime}}^{}[\nu ,\mu ;{D}^{\prime}]=\varnothing $. The infeasibility of the dual problem means that the primal problem is unbounded. Thus, we obtain $\mathrm{opt}(\nu ,\mu ;{D}^{\prime})=\infty $. ☐
Appendix C. Proof of Theorem 3
Let us define ${f}_{D}+{b}_{D}$ with ${f}_{D}\in \mathcal{H},\phantom{\rule{0.166667em}{0ex}}{b}_{D}\in \mathbb{R}$ as the decision function estimated using robust $(\nu ,\mu )$SVM based on the dataset D.
Proof.
The noncontaminated dataset is denoted as $D=\{({x}_{i},{y}_{i}):i\in \left[m\right]\}$. For the dataset D, let ${I}_{+}$ and ${I}_{}$ be the index sets defined by ${I}_{\pm}=\{i:{y}_{i}=\pm 1\}$. Inequality (14) holds under the conditions of Theorem 3. Given a contaminated dataset ${D}^{\prime}=\{({x}_{i}^{\prime},{y}_{i}^{\prime}):i\in \left[m\right]\}\in {\mathcal{D}}_{\mu m\ell}$, let ${r}_{i}^{\prime}\left(b\right)$ be the negative margin of ${f}_{{D}^{\prime}}+b$, i.e., ${r}_{i}^{\prime}\left(b\right)={y}_{i}^{\prime}({f}_{{D}^{\prime}}\left({x}_{i}^{\prime}\right)+b)$ for $({x}_{i}^{\prime},{y}_{i}^{\prime})\in {D}^{\prime}$. For $b\in \mathbb{R}$, the function $\zeta \left(b\right)$ is defined as:
where the index set ${T}_{b}$ is defined by the sorted negative margins as follows:
$$\begin{array}{c}\hfill \zeta \left(b\right)=\frac{1}{m}\sum _{i\in {T}_{b}}{r}_{i}^{\prime}\left(b\right),\end{array}$$
$$\begin{array}{c}\hfill {T}_{b}=\left\{\sigma \left(j\right)\in \left[m\right]\phantom{\rule{0.166667em}{0ex}}:\phantom{\rule{0.166667em}{0ex}}\mu m+1\le j\le \nu m,\phantom{\rule{4pt}{0ex}}{r}_{\sigma \left(1\right)}^{\prime}\left(b\right)\ge \cdots \ge {r}_{\sigma \left(m\right)}^{\prime}\left(b\right)\right\}.\end{array}$$
The estimated bias term ${b}_{{D}^{\prime}}$ is a local optimal solution of $\zeta \left(b\right)$ because of (6). The function $\zeta \left(b\right)$ is continuous. In addition, $\zeta \left(b\right)$ is linear on the interval such that ${T}_{b}$ is unchanged. Hence, $\zeta \left(b\right)$ is a continuous piecewise linear function. Below, we prove that local optimal solutions of $\zeta \left(b\right)$ are uniformly bounded regardless of the contaminated dataset ${D}^{\prime}\in {\mathcal{D}}_{\mu m\ell}$. To prove the uniform boundedness, we control the slope of $\zeta \left(b\right)$.
For the noncontaminated data D, let R be a positive real number such that:
$$\begin{array}{c}\hfill sup\left\{\right{f}_{{D}^{\prime \prime}}\left(x\right):(x,y)\in D,\phantom{\rule{0.166667em}{0ex}}{D}^{\prime \prime}\in {\mathcal{D}}_{\mu m\ell}\}\le R.\end{array}$$
The existence of R is guaranteed. Indeed, one can choose:
because the RKHS norm of ${f}_{{D}^{\prime \prime}}$ is uniformly bounded above for ${D}^{\prime \prime}\in {\mathcal{D}}_{\mu m\ell}$ and D is a finite set. For the contaminated dataset ${D}^{\prime}=\{({x}_{i}^{\prime},{y}_{i}^{\prime})\phantom{\rule{0.166667em}{0ex}}:\phantom{\rule{0.166667em}{0ex}}i\in \left[m\right]\}\in {\mathcal{D}}_{\mu m\ell}$, let us define the index sets ${I}_{\pm}^{\prime},{I}_{\mathrm{in},\pm}^{\prime}$ and ${I}_{\mathrm{out},\pm}^{\prime}$ for each label by:
$$\begin{array}{c}\hfill R=\underset{{D}^{\prime \prime}\in {\mathcal{D}}_{\mu m\ell}}{sup}{\parallel {f}_{{D}^{\prime \prime}}\parallel}_{\mathcal{H}}\xb7\underset{(x,y)\in D}{max}\sqrt{k(x,x)}<\infty ,\end{array}$$
$$\begin{array}{cc}\hfill {I}_{\pm}^{\prime}& =\{i\in \left[m\right]\phantom{\rule{0.166667em}{0ex}}:\phantom{\rule{0.166667em}{0ex}}{y}_{i}^{\prime}=\pm 1\},\hfill \\ \hfill {I}_{\mathrm{in},\pm}^{\prime}& =\{i\in {I}_{\pm}^{\prime}\phantom{\rule{0.166667em}{0ex}}:\phantom{\rule{0.166667em}{0ex}}{f}_{{D}^{\prime}}\left({x}_{i}^{\prime}\right)\le R\},\hfill \\ \hfill {I}_{\mathrm{out},\pm}^{\prime}& =\{i\in {I}_{\pm}^{\prime}\phantom{\rule{0.166667em}{0ex}}:\phantom{\rule{0.166667em}{0ex}}{f}_{{D}^{\prime}}\left({x}_{i}^{\prime}\right)>R\}.\hfill \end{array}$$
For any noncontaminated sample $({x}_{i},{y}_{i})\in D$, we have ${f}_{{D}^{\prime}}\left({x}_{i}\right)\le R$. Hence, $({x}_{i}^{\prime},{y}_{i}^{\prime})\in {D}^{\prime}$ for $i\in {I}_{\mathrm{out},\pm}^{\prime}$ should be an outlier that is not included in D. This fact leads to:
$$\begin{array}{cc}& {I}_{\mathrm{out},+}^{\prime}+{I}_{\mathrm{out},}^{\prime}\phantom{\rule{0.166667em}{0ex}}\le \phantom{\rule{0.166667em}{0ex}}\mu m\ell ,\hfill \\ & {I}_{\mathrm{in},\pm}^{\prime}\phantom{\rule{0.166667em}{0ex}}\ge \phantom{\rule{0.166667em}{0ex}}{I}_{\pm}(\mu m\ell )\ge (r\mu )m+\ell .\hfill \end{array}$$
On the basis of the argument above, we can prove two propositions:
 The function $\zeta \left(b\right)$ is increasing for $b>R$.
 The function $\zeta \left(b\right)$ is decreasing for $b<R$.
In addition, for any ${D}^{\prime}\in {\mathcal{D}}_{\mu m\ell}$, the Lipschitz constant of $\zeta \left(b\right)$ is greater than or equal to $1/m$ for $R<\leftb\right$.
Let us prove the first statement. If $b>R$ holds, we have:
from the definition of the index set ${I}_{\mathrm{in},}^{\prime}$. Let us consider two cases:
$$\begin{array}{c}\hfill Rb<min\{{r}_{i}^{\prime}\left(b\right):i\in {I}_{\mathrm{in},}^{\prime}\}\end{array}$$
 (i)
 for all $i\in {T}_{b}$, $Rb<{r}_{i}^{\prime}\left(b\right)$ holds and
 (ii)
 there exists an index $i\in {T}_{b}$ such that ${r}_{i}^{\prime}\left(b\right)\le Rb$.
For a fixed b such that $b>R$, let us assume (i) above. Then, for any index i in ${I}_{+}^{\prime}\cap {T}_{b}$, we have $R<{f}_{{D}^{\prime}}\left({x}_{i}^{\prime}\right)$, meaning that $i\in {I}_{\mathrm{out},+}^{\prime}$. Hence, the size of the set ${I}_{+}^{\prime}\cap {T}_{b}$ is less than or equal to $\mu m\ell $. Therefore, the size of the set ${I}_{}^{\prime}\cap {T}_{b}$ is greater than or equal to $(\nu \mu )m(\mu m\ell )=(\nu 2\mu )m+\ell $. The first inequality of (15) leads to $(\nu 2\mu )m+\ell >\mu m\ell $. Therefore, in the set ${T}_{b}$, the number of negative samples is more than the number of positive samples.
For a fixed b such that $b>R$, let us assume (ii) above. Due to the inequality (A3), for any index $i\in {I}_{\mathrm{in},}^{\prime}$, the negative margin ${r}_{i}^{\prime}\left(b\right)$ is at the top $\nu m$ of those ranked in the descending order. Hence, the size of the set ${I}_{}^{\prime}\cap {T}_{b}$ is greater than or equal to ${I}_{\mathrm{in},}^{\prime}\mu m\ge (r2\mu )m$. Therefore, the size of the set ${I}_{+}^{\prime}\cap {T}_{b}$ is less than or equal to $(\nu \mu )m(r2\mu )m=(\nu r+\mu )m$. The second inequality of (15) leads to $(\nu r+\mu )m<(r2\mu )m$. Furthermore, in the case of (ii), the negative label dominates the positive label in the set ${T}_{b}$.
For negative (resp. positive) samples, the negative margin is expressed as ${r}_{i}^{\prime}\left(b\right)={u}_{i}+b$ (resp. ${r}_{i}^{\prime}\left(b\right)={u}_{i}b$) with a constant ${u}_{i}\in \mathbb{R}$. Thus, the continuous piecewise linear function $\zeta \left(b\right)$ is expressed as:
where ${a}_{b},{c}_{b}\in \mathbb{R}$ are constants as long as ${T}_{b}$ is unchanged. As proven above, ${a}_{b}$ is a positive integer, since negative samples outnumber positive samples in ${T}_{b}$ when $b>R$. As a result, local optimal solutions of the bias term should satisfy:
$$\begin{array}{c}\hfill \zeta \left(b\right)=\frac{{c}_{b}+b\xb7{a}_{b}}{m},\end{array}$$
$$\begin{array}{c}\hfill sup\{{b}_{{D}^{\prime}}:{D}^{\prime}\in {\mathcal{D}}_{\mu m\ell}\}\le R.\end{array}$$
In the same manner, we can prove the second statement by using the fact that $b<R$ is a sufficient condition of:
$$\begin{array}{c}\hfill R+b<min\{{r}_{i}^{\prime}\left(b\right):i\in {I}_{\mathrm{in},+}^{\prime}\}.\end{array}$$
Then, we have:
$$\begin{array}{c}\hfill inf\{{b}_{{D}^{\prime}}:{D}^{\prime}\in {\mathcal{D}}_{\mu m\ell}\}\ge R.\end{array}$$
In summary, we obtain:
☐
$$\begin{array}{c}\hfill sup\left\{\right{b}_{{D}^{\prime}}:{D}^{\prime}\in {\mathcal{D}}_{\mu m\ell}\}\le R<\infty .\end{array}$$
Appendix D. Proof of Theorem 4
Proof.
We will use the same notation as in the proof of Theorem 3 in Appendix C. Note that Inequality (14) holds under the assumption of Theorem 4. The reproducing property of the RKHS inner product yields:
for any ${D}^{\prime}=\{({x}_{i}^{\prime},{y}_{i}^{\prime})\phantom{\rule{0.166667em}{0ex}}:\phantom{\rule{0.166667em}{0ex}}i\in \left[m\right]\}\in {\mathcal{D}}_{\mu m}$ due to the boundedness of the kernel function and Inequality (14). Hence, for a sufficiently large $R\in \mathbb{R}$, the sets ${I}_{\mathrm{out},+}^{\prime}$ and ${I}_{\mathrm{out},}^{\prime}$ become empty for any ${D}^{\prime}\in {\mathcal{D}}_{\mu m}$.
$$\begin{array}{c}\hfill {f}_{{D}^{\prime}}\left({x}_{i}^{\prime}\right)\phantom{\rule{0.166667em}{0ex}}\le \phantom{\rule{0.166667em}{0ex}}\parallel {f}_{{D}^{\prime}}{\parallel}_{\mathcal{H}}\sqrt{k({x}_{i}^{\prime},{x}_{i}^{\prime})}\phantom{\rule{0.166667em}{0ex}}\le \phantom{\rule{0.166667em}{0ex}}\underset{{D}^{\prime \prime}\in {\mathcal{D}}_{\mu m}}{sup}\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}{\parallel {f}_{{D}^{\prime \prime}}\parallel}_{\mathcal{H}}\xb7\underset{x\in \mathcal{X}}{sup}\sqrt{k(x,x)}<\infty \end{array}$$
Under Inequality (A3), suppose that $Rb<{r}_{i}^{\prime}\left(b\right)$ holds for all $i\in {T}_{b}$. Then, for $i\in {I}_{+}^{\prime}\cap {T}_{b}$, we have $R<{f}_{{D}^{\prime}}\left({x}_{i}^{\prime}\right)$. Thus, $i\in {I}_{\mathrm{out},+}^{\prime}$ holds. Since ${I}_{\mathrm{out},+}^{\prime}$ is the empty set, ${I}_{+}^{\prime}\cap {T}_{b}$ is also the empty set. Therefore, ${T}_{b}$ has only negative samples. Let us consider the other case; i.e., there exists an index $i\in {T}_{b}$, such that ${r}_{i}^{\prime}\left(b\right)\le Rb$. Assuming that $\nu \mu <2(r2\mu )$, we can prove that the negative labels dominate the positive labels in ${T}_{b}$ in the same manner as the proof of Theorem 3. For any ${D}^{\prime}\in {\mathcal{D}}_{\mu m}$, the function $\zeta \left(b\right)$ is strictly increasing for $b>R$. In the same way, we can prove that $\zeta \left(b\right)$ is strictly decreasing for $b<R$. Moreover, for any ${D}^{\prime}\in {\mathcal{D}}_{\mu m}$ and for $\leftb\right>R$, one can prove that the absolute value of the slope of $\zeta \left(b\right)$ is bounded below by $1/m$ according to the argument in the proof of Theorem 3. As a result, we obtain $sup\left\{\right{b}_{{D}^{\prime}}:{D}^{\prime}\in {\mathcal{D}}_{\mu m}\}\le R$. ☐
References
 Cortes, C.; Vapnik, V. Supportvector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
 Schölkopf, B.; Smola, A.J. Learning with Kernels; MIT Press: Cambridge, MA, USA, 2002. [Google Scholar]
 Berlinet, A.; ThomasAgnan, C. Reproducing Kernel Hilbert Spaces in Probability and Statistics; Kluwer Academic: Boston, MA, USA, 2004. [Google Scholar]
 PerezCruz, F.; Weston, J.; Hermann, D.J.L.; Schölkopf, B. Extension of the νSVM Range for Classification. In Advances in Learning Theory: Methods, Models and Applications 190; IOS Press: Amsterdam, The Netherlands, 2003; pp. 179–196. [Google Scholar]
 Schölkopf, B.; Smola, A.; Williamson, R.; Bartlett, P. New support vector algorithms. Neural Comput. 2000, 12, 1207–1245. [Google Scholar] [CrossRef] [PubMed]
 Suykens, J.A.K.; Vandewalle, J. Least squares support vector machine classifiers. Neural Process. Lett. 1999, 9, 293–300. [Google Scholar] [CrossRef]
 Bartlett, P.L.; Jordan, M.I.; McAuliffe, J.D. Convexity, classification, and risk bounds. J. Am. Stat. Assoc. 2006, 101, 138–156. [Google Scholar] [CrossRef]
 Steinwart, I. On the influence of the kernel on the consistency of support vector machines. J. Mach. Learn. Res. 2001, 2, 67–93. [Google Scholar]
 Zhang, T. Statistical behavior and consistency of classification methods based on convex risk minimization. Ann. Stat. 2004, 32, 56–134. [Google Scholar] [CrossRef]
 Shen, X.; Tseng, G.C.; Zhang, X.; Wong, W.H. On ψlearning. J. Am. Stat. Assoc. 2003, 98, 724–734. [Google Scholar] [CrossRef]
 Yu, Y.; Yang, M.; Xu, L.; White, M.; Schuurmans, D. Relaxed Clipping: A Global Training Method for Robust Regression and Classification. In Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2010; pp. 2532–2540. [Google Scholar]
 Collobert, R.; Sinz, F.; Weston, J.; Bottou, L. Trading Convexity for Scalability. In Proceedings of the ICML06, 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; ACM Press: New York, NY, USA, 2006; pp. 201–208. [Google Scholar]
 Wu, Y.; Liu, Y. Robust truncated hinge loss support vector machines. J. Am. Stat. Assoc. 2007, 102, 974–983. [Google Scholar] [CrossRef]
 Yu, Y.; Aslan, Ö.; Schuurmans, D. A PolynomialTime Form of Robust Regression. In Advances in Neural Information Processing Systems 25; Curran Associates, Inc.: Red Hook, NY, USA, 2012; pp. 2483–2491. [Google Scholar]
 Feng, Y.; Yang, Y.; Huang, X.; Mehrkanoon, S.; Suykens, J.A. Robust Support Vector Machines for Classification with Nonconvex and Smooth Losses. Neural Comput. 2016, 28, 1217–1247. [Google Scholar] [CrossRef] [PubMed]
 Tsyurmasto, P.; Uryasev, S.; Gotoh, J. Support Vector Classification with Positive Homogeneous Risk Functionals; Technical Report, Research Report 20134; Department of Industrial and Systems Engineering, University of Florida: Gainesville, FL, USA, 2013. [Google Scholar]
 Xu, L.; Crammer, K.; Schuurmans, D. Robust Support Vector Machine Training Via Convex Outlier Ablation. In Proceedings of the AAAI, Boston, MA, USA, 16–20 July 2006; pp. 536–542.
 Fujiwara, S.; Takeda, A.; Kanamori, T. DC Algorithm for Extended Robust Support Vector Machine; Technical Report METR 2014–38; The University of Tokyo: Tokyo, Japan, 2014. [Google Scholar]
 Takeda, A.; Fujiwara, S.; Kanamori, T. Extended robust support vector machine based on financial risk minimization. Neural Comput. 2014, 26, 2541–2569. [Google Scholar] [CrossRef] [PubMed]
 Maronna, R.; Martin, R.D.; Yohai, V. Robust Statistics: Theory and Methods; Wiley: Hoboken, NJ, USA, 2006. [Google Scholar]
 Schapire, R.E.; Freund, Y.; Bartlett, P.L.; Lee, W.S. Boosting the margin: A new explanation for the effectiveness of voting methods. Ann. Stat. 1998, 26, 1651–1686. [Google Scholar] [CrossRef]
 Kimeldorf, G.S.; Wahba, G. Some results on Tchebycheffian spline functions. J. Math. Anal. Appl. 1971, 33, 82–95. [Google Scholar] [CrossRef]
 Wahba, G. Advances in Kernel Methods; Chapter Support Vector Machines, Reproducing Kernel Hilbert Spaces, and Randomized GACV; MIT Press: Cambridge, MA, USA, 1999; pp. 69–88. [Google Scholar]
 Takeda, A.; Sugiyama, M. νSupport Vector Machine as Conditional ValueatRisk Minimization. In Proceedings of the ICML, ACM International Conference Proceeding Series, Yokohama, Japan, 3–5 December 2008; Cohen, W.W., McCallum, A., Roweis, S.T., Eds.; ACM: New York, NY, USA, 2008; Volume 307, pp. 1056–1063. [Google Scholar]
 Rockafellar, R.T.; Uryasev, S. Conditional valueatrisk for general loss distributions. J. Bank. Financ. 2002, 26, 1443–1472. [Google Scholar] [CrossRef]
 Boyd, S.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
 Le Thi, H.A.; Dinh, T.P. Convex analysis approach to d.c. programming: Theory, algorithms and applications. Acta Math. Vietnam. 1997, 22, 289–355. [Google Scholar]
 Yuille, A.L.; Rangarajan, A. The concaveconvex procedure. Neural Comput. 2003, 15, 915–936. [Google Scholar] [CrossRef] [PubMed]
 Crisp, D.J.; Burges, C.J.C. A Geometric Interpretation of νSVM Classifiers. In Advances in Neural Information Processing Systems 12; Solla, S.A., Leen, T.K., Müller, K.R., Eds.; MIT Press: Cambridge, MA, USA, 2000; pp. 244–250. [Google Scholar]
 Kanamori, T.; Takeda, A.; Suzuki, T. Conjugate relation between loss functions and uncertainty sets in classification problems. J. Mach. Learn. Res. 2013, 14, 1461–1504. [Google Scholar]
 Takeda, A.; Mitsugi, H.; Kanamori, T. A Unified Robust Classification Model. In Proceedings of the 29th International Conference on Machine Learning (ICML12), ICML’12, Edinburgh, Scotland, 26 June–1 July 2012; Langford, J., Pineau, J., Eds.; Omnipress: New York, NY, USA, 2012; pp. 129–136. [Google Scholar]
 Bertsekas, D.; Nedic, A.; Ozdaglar, A. Convex Analysis and Optimization; Athena Scientific: Belmont, MA, USA, 2003. [Google Scholar]
 Donoho, D.; Huber, P. The Notion of Breakdown Point. In A Festschrift for Erich L. Lehmann; CRC Press: Boca Raton, FL, USA, 1983; pp. 157–184. [Google Scholar]
 Hampel, F.R.; Rousseeuw, P.J.; Ronchetti, E.M.; Stahel, W.A. Robust Statistics. The Approach Based on Influence Functions; John Wiley and Sons, Inc.: Hoboken, NJ, USA, 1986. [Google Scholar]
 Huber, P.J.; Ronchetti, E.M. Robust Statistics, 2nd ed.; Wiley: New York, NY, USA, 2009. [Google Scholar]
 Christmann, A.; Steinwart, I. On robustness properties of convex risk minimization methods for pattern recognition. J. Mach. Learn. Res. 2004, 5, 1007–1034. [Google Scholar]
 Le Thi, H.A.; Dinh, T.P. The DC (Difference of Convex Functions) Programming and DCA Revisited with DC Models of Real World Nonconvex Optimization Problems. Ann. Oper. Res. 2005, 133, 23–46. [Google Scholar]
 R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2014. [Google Scholar]
 Steinwart, I. On the optimal parameter choice for νsupport vector machines. IEEE Trans. Pattern Anal. Mach. Intell. 2003, 25, 1274–1284. [Google Scholar] [CrossRef]
 Wu, Y.; Liu, Y. Adaptively weighted large margin classifiers. J. Comput. Graph. Stat. 2013, 22, 416–432. [Google Scholar] [CrossRef] [PubMed]
Figure 1.
Distribution of negative margins ${r}_{i}={y}_{i}(f\left({x}_{i}\right)+b),i\in \left[m\right]$ for a fixed decision function $f\left(x\right)+b$.
Figure 2.
(a) breakdown point of $({f}_{D},{b}_{D})$ given by robust $(\nu ,\mu )$SVM with bounded kernel; (b) breakdown point of $({f}_{D},{b}_{D})$ given by robust $(\nu ,\mu )$SVM with unbounded kernel.
Figure 3.
Plot of contaminated dataset of size $m=200$. The outlier ratio is $0.05$, and the asterisks (∗) denote the outlier. In the panels of the upper (resp. lower) row, outliers are added by flipping labels (resp. flipping positive labels) randomly. The dashed line is the true decision boundary, and the solid line is the decision boundary estimated using νSVM with $\nu =0.3$ in (a,d); robust $(\nu ,\mu )$SVM with $(\nu ,\mu )=(0.3,0.05)$ in (b,e); and $(\nu ,\mu )=(0.3,0.1)$ in (c,f). The triangles denote the samples on which ${\eta}_{i}=0$ is assigned.
Figure 4.
(a) original data D; (b) contaminated data ${D}^{\prime}\in {\mathcal{D}}_{\mu m}$. In this example, the sample size is $m=200$, and the outlier ratio is $\mu =0.1$.
Figure 5.
Plots of maximum norms and worstcase test errors. The top (Bottom) panels show the results for a Gaussian (linear) kernel. Red points mean the top 50 percent of values; the asterisks (∗) are points that violate the inequality $\nu \mu \le 2(r2\mu )$. (a) Gaussian kernel; (b) linear kernel.
Table 1.
Number of times that the numerical solution of difference of convex functions algorithm (DCA) satisfies $\mathrm{opt}\left(\mathrm{DCA}\right)/\mathrm{opt}\left(\mathrm{MIP}\right)\ge 0.97$ out of 100 trials. The number of initial points used in the multistart method is denoted as #initial points. The“Dim.” and “Cov.” columns denote the dimension d and the covariance matrix of the input vectors in each label. The column labeled “Err.” shows the Bayes error of each problem setting.
Setting  Err. (%)  $(\mathit{\nu},\mathit{\mu})$ in Robust $(\mathit{\nu},\mathit{\mu})$SVM Using Linear Kernel  

$(0.4,0.1)$  $(0.5,0.1)$  $(0.6,0.1)$  
Dim.  Cov.  #Initial Points  #Initial Points  #Initial Points  
1  5  10  1  5  10  1  5  10  
2  I  7.9  87  96  97  90  99  99  93  99  99 
5  I  1.3  98  99  100  100  100  100  99  100  100 
10  I  0.1  100  100  100  100  100  100  100  100  100 
2  $5I$  26.4  78  84  88  76  85  90  75  85  86 
5  $10I$  24.0  46  84  90  53  83  90  66  90  90 
10  $50I$  32.7  16  59  73  31  72  77  46  85  92 
Table 2.
Computation time (Time) and ratio of support vectors (SV Ratio) of robust $(\nu ,\mu )$SVM and robust CSVMwith standard deviations.
Linear Kernel  Sonar ($\mathit{m}=104$)  BreastCancer ($\mathit{m}=350$)  Spam ($\mathit{m}=1000$)  

Robust $(\mathit{\nu},\mathit{\mu})$SVM, $(\mathit{\nu},\mathit{\mu})$  Time (s)  SV Ratio  Time (s)  SV Ratio  Time (s)  SV Ratio 
$(0.2,0.10)$  1.10 (0.22)  0.79 (0.14)  1.02 (0.17)  0.21 (0.11)  13.38 (3.90)  0.27 (0.22) 
$(0.2,0.05)$  0.87 (0.15)  0.75 (0.20)  0.73 (0.13)  0.18 (0.06)  11.29 (2.41)  0.64 (0.27) 
$(0.3,0.10)$  1.17 (0.19)  0.57 (0.12)  0.80 (0.13)  0.22 (0.07)  9.65 (2.13)  0.24 (0.04) 
$(0.3,0.05)$  0.81 (0.09)  0.58 (0.16)  0.63 (0.07)  0.28 (0.05)  8.64 (2.12)  0.36 (0.21) 
$(0.4,0.10)$  1.11 (0.18)  0.49 (0.10)  0.83 (0.14)  0.30 (0.03)  8.65 (1.25)  0.30 (0.02) 
$(0.4,0.05)$  0.90 (0.15)  0.62 (0.16)  0.76 (0.12)  0.36 (0.02)  8.72 (1.77)  0.38 (0.04) 
Robust $\mathit{C}$SVM, $\mathit{C}$  
${10}^{7}$  0.12 (0.02)  0.00 (0.00)  0.15 (0.02)  0.00 (0.00)  1.62 (0.08)  0.00 (0.00) 
1  0.61 (0.07)  0.45 (0.08)  0.60 (0.16)  0.04 (0.01)  7.38 (2.36)  0.08 (0.01) 
${10}^{7}$  1.02 (0.11)  0.54 (0.13)  0.68 (0.18)  0.03 (0.01)  10.16 (3.31)  0.11 (0.16) 
${10}^{12}$  1.07 (0.13)  0.47 (0.09)  0.63 (0.17)  0.05 (0.06)  20.98 (5.95)  0.30 (0.32) 
Table 3.
Test error and standard deviation of robust $(\nu ,\mu )$SVM, robust CSVM and νSVM. The dimension of the input vector, number of training samples, number of test samples and label ratio of all samples with no outliers are shown for each dataset. Linear and Gaussian kernels were used to build the classifier in each method. The outlier ratio in the training data ranged from 0% to 15%, and the test error was evaluated on the noncontaminated test data. The asterisks (*) mean the best result for a fixed kernel function in each dataset, and the double asterisks (**) mean that the corresponding method is 5% significant compared with the second best method under a onesided ttest. The learning parameters were determined by fivefold crossvalidation on the contaminated training data.
Data  Outlier  Linear Kernel  Gaussian Kernel  

Robust $(\mathit{\nu},\mathit{\mu})$SVM  Robust CSVM  νSVM  Robust $(\mathit{\nu},\mathit{\mu})$SVM  Robust CSVM  νSVM  
Sonar: $dimx=60$,  0%  *0.258(.032)  0.270(.038)  * 0.256(.051)  * 0.179(.038)  **0.188(0.043)  *0.181(0.039) 
$\#Train=104$,  5%  * 0.256(0.039)  0.273(0.047)  *0.258(0.046)  *0.225(0.042)  **0.229(0.051)  * 0.224(0.061) 
$\#Test=104$,  10%  * 0.297(0.060)  0.306(0.067)  *0.314(0.060)  *0.249(0.059)  ** 0.230(0.046)  *0.259(0.062) 
$r=0.466$.  15%  * 0.329(0.061)  0.339(0.064)  *0.345(0.062)  *0.280(0.053)  ** 0.280(0.050)  *0.294(0.064) 
BreastCancer: $dimx=10$,  0%  0.033(.010)  *0.035(0.008)  * 0.033(0.006)  ** 0.032(0.008)  *0.035(0.012)  0.033(0.010) 
$\#train=350$,  5%  0.034(0.009)  * 0.034(0.010)  *0.043(0.015)  ** 0.032(.005)  *0.033(0.007)  0.033(0.006) 
$\#test=349$,  10%  0.055(0.015)  * 0.051(0.026)  *0.076(0.036)  ** 0.035(0.008)  *0.043(0.025)  0.038(0.008) 
$r=0.345$  15%  0.136(0.058)  * 0.120(0.050)  *0.148(0.058)  **0.160(0.083)  * 0.145(0.070)  0.150(0.110) 
PimaIndiansDiabetes:  0%  **0.237(0.018)  * 0.232(0.014)  0.246(0.018)  * 0.238(0.021)  *0.240(0.019)  0.243(0.022) 
$dimx=8$, $\#train=384$,  5%  **0.239(0.019)  * 0.237(0.016)  0.269(0.036)  * 0.264(0.025)  *0.267(0.024)  0.273(0.024) 
$\#test=384$,  10%  ** 0.280(0.046)  *0.299(0.042)  0.330(0.030)  *0.302(0.039)  * 0.293(0.036)  0.315(0.038) 
$r=0.349$  15%  ** 0.338(0.042)  *0.349(0.030)  0.351(0.026)  * 0.344(0.028)  *0.344(0.031)  0.353(0.016) 
spam: $dimx=57$,  0%  **0.083(0.005)  0.088(0.006)  *0.083(0.005)  **0.081(0.005)  0.086(0.006)  * 0.081(0.006) 
$\#train=1000$,  5%  ** 0.094(0.008)  0.104(0.013)  *0.109(0.010)  **0.095(0.008)  0.097(0.009)  * 0.095(0.008) 
$\#test=3601$,  10%  ** 0.129(0.022)  0.152(0.020)  *0.166(0.067)  ** 0.129(0.015)  0.133(0.017)  0.141(.030) 
$r=0.394$  15%  ** 0.201(0.029)  0.240(0.030)  *0.256(0.091)  ** 0.206(0.018)  0.223(0.030)  0.240(0.055) 
Satellite: $dimx=36$,  0%  **0.097(0.004)  *0.096(0.003)  ** 0.094(0.003)  *0.069(0.031)  0.067(0.004)  ** 0.063(0.004) 
$\#train=2000$,  5%  **0.101(0.003)  * 0.100(0.005)  **0.100(0.004)  *0.072(0.015)  0.078(0.007)  **0.078(0.043) 
$\#test=4435$, $r=0.234$  10%  ** 0.148(0.020)  *0.161(0.026)  **0.161(0.019)  *0.117(0.034)  0.126(0.040)  **0.137(0.027) 
© 2017 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license ( http://creativecommons.org/licenses/by/4.0/).