DCA for Sparse Quadratic Kernel-Free Least Squares Semi-Supervised Support Vector Machine

Sun, Jun; Qu, Wentao

doi:10.3390/math10152714

Open AccessArticle

DCA for Sparse Quadratic Kernel-Free Least Squares Semi-Supervised Support Vector Machine

by

Jun Sun

¹

and

Wentao Qu

^2,*

¹

School of Mathematics and Statistics, Linyi University, Linyi 276000, China

²

Department of Applied Mathematics, Beijing Jiaotong University, Beijing 100044, China

^*

Author to whom correspondence should be addressed.

Mathematics 2022, 10(15), 2714; https://doi.org/10.3390/math10152714

Submission received: 12 April 2022 / Revised: 2 July 2022 / Accepted: 18 July 2022 / Published: 1 August 2022

(This article belongs to the Special Issue Computational Methods in Nonlinear Analysis and Their Applications)

Download

Browse Figures

Versions Notes

Abstract

With the development of science and technology, more and more data have been produced. For many of these datasets, only some of the data have labels. In order to make full use of the information in these data, it is necessary to classify them. In this paper, we propose a strong sparse quadratic kernel-free least squares semi-supervised support vector machine (

S S Q L S S^{3} V M

), in which we add a

ℓ_{0}

norm regularization term to make it sparse. An NP-hard problem arises since the proposed model contains the

ℓ_{0}

norm and another nonconvex term. One important method for solving the nonconvex problem is the DC (difference of convex function) programming. Therefore, we first approximate the

ℓ_{0}

norm by a polyhedral DC function. Moreover, due to the existence of the nonsmooth terms, we use the sGS-ADMM to solve the subproblem. Finally, empirical numerical experiments show the efficiency of the proposed algorithm.

Keywords:

sparsity; semi-supervised support vector machine; DC programming and DCA; sGS-ADMM

MSC:

65D15; 90C26; 90C30

1. Introduction

In today’s information society, more and more data is emerging, and in order to make reasonable and full use of the information contained in the data, it is common to classify or categorize the data, which usually involves a high cost (money or time). Machine learning is based on sample data. Supervised learning is a learning method that predicts a function or classifier from a labeled training dataset, and one of the most famous methods is the support vector machine (SVM), which has been widely used for data classification [1,2,3,4]. Nevertheless, in the real world, there are a large number of data that are unlabeled or only partially labeled, and it is difficult to obtain labels. Usually, a lot of work is needed to tag the data, which means that it is often unacceptable to invest a lot of resources in the work at a high (monetary or time) cost. For example, computer-aided diagnosis (CAD) with a breast cancer system is usually characterized by a special professional method to mark a large number of collected data. However, we know that it is very difficult to collect the pathological records of patients with labels. Generally, it takes at least five years to mark the patient record as alive or not alive. A major problem is the need for experts to spend a lot of time labeling. Supervised learning cannot be used in these contexts, and it is difficult to obtain a learner with good prediction results and strong generalization ability by using only a small amount of labeled data in supervised learning. In order to overcome this difficulty, researchers have proposed the semi-supervised learning method, which has attracted more and more attention [5,6]. When only a small portion of data has labels, exploiting potential information contained in them can help improve the learning performance [7,8].

Therefore, some researchers have introduced the idea of semi-supervised learning into SVM, for which a semi-supervised support vector machine (

S^{3} V M

) is proposed, similar to the SVM, that maximizes the margin between labeled and unlabeled data points. The

S^{3} V M

was originally proposed by Vapnik and Sterin [9] in 1977. It considered that the hyperplane needs to pass through the low-density area of data. In 1999, Bennett and Demiriz [10] proposed the first optimization formula of

S^{3} V M

and proved that

S^{3} V M

can be re-expressed as a mixed integer programming problem. They then used the method of solving integer programming to solve it accurately. Traditionally, since the categories of unlabeled data need to be predicted,

S^{3} V M

is usually formulated as a nonconvex nonsmooth optimization problem, and this raises some analytical and computational difficulties. In order to solve

S^{3} V M

with better results, scholars have proposed many semi-supervised methods to improve

S^{3} V M

, and many effective algorithms have been proposed. For more details of

S^{3} V M

methods, the reader is referred to [11,12,13,14,15,16,17,18] and references therein.

In addition, for linearly indivisible datasets, most traditional methods use kernel functions to map each data point from the original space to a higher dimensional space, and then find a hyperplane in the high dimensional space so that it can separate all mapped data points. However, selection of the appropriate kernel function for a given dataset becomes a new problem. To avoid this situation, Dagher [19] first proposed a kernel-free quadratic support vector machine (QSVM), which tries to find a quadratic decision function that can nonlinearly separate the data at the largest margin. Based on the above model, Yan et al. [20] proposed a kernel-free quadratic

S^{3} V M

, called SSQSSVM. They reformulated SSQSSVM as a mixed-integer programming problem, which they proved to be equivalent to a nonconvex optimization problem with absolute-value constraints. Then, using convex relaxation techniques, it can be transformed into a semidefinite programming optimization problem, which a CVX package can be used to solve. In 2018, Zhan [21] proposed a sparse quadratic kernel-free least squares semi-supervised support vector machine (

S Q L S S^{3} V M

) with a

ℓ_{1}

norm regularization term for the binary classification problem and used the proximal alternating direction method of multipliers (P-ADMM) to solve it. Tian et al. [22] applied this to online credit scoring, and also added fuzzy weights to further improve the accuracy and robustness of the classification. Gao et al. [23] proposed a new quadratic kernel-free least square twin support vector machine (QLSTSVM) for binary classification problems. They adopted the alternating direction method of multipliers to solve it and achieved good numerical results.

We know that to accurately characterize the sparsity, it is suggested to impose the

ℓ_{0}

norm, which is the most natural and suitable concept for modeling the sparsity. There are two main ways to deal with the

ℓ_{0}

norm: one is to approximate it with a convex function [24] or nonconvex function [25], and the other is to deal with it directly [26]. The

ℓ_{1}

norm is the best convex approximation of the

ℓ_{0}

norm, and it can encourage sparsity in only some cases with restrictive assumptions [27]. It has also been shown to be, in certain cases, inconsistent and biased [28]. With the in-depth study of the properties of

ℓ_{0}

norm, previous application works have shown that the

ℓ_{0}

norm has better properties and better sparse performance and robustness than the

ℓ_{1}

norm approach [29,30,31,32].

In this paper, we propose a new strong sparse quadratic kernel-free least squares semi-supervised support sector machine model with

ℓ_{0}

norm regularization term that has stronger interpretability, which finds a separating hyperplane surface far away from both the labeled and unlabeled points with the least number of features.The main contributions of the paper are summarized as follows:

(I): We propose a new strong sparse quadratic kernel-free least squares semi-supervised support vector machine model called the strong sparse quadratic kernel-free least squares semi-supervised support vector machine ( $S S Q L S S^{3} V M$ ). In order to make it have better sparse performance and robustness, we add an $ℓ_{0}$ norm regularization term directly to the objective rather than its approximation.
(II): We treat the problem by nonconvex methods based on the difference of the convex function (DC) program and DCA (DC algorithm), which is one of the most important methods for nonconvex problems [33,34,35]. Therefore, we used a nonconvex function (which can be expressed as the form of difference of two convex functions) to approximate the $ℓ_{0}$ norm, and the third term can be expressed in the form of a subtractive convex function.
(III): We conducted experiments on all proposed models and algorithms on several benchmark datasets to investigate their efficiency in classification.

The rest of this paper is organized as follows: DC programming and

S S Q L S S^{3} V M

are briefly presented in Section 2, while Section 3 is devoted to the development of DCA with sGS-ADMM for solving the

S S Q L S S^{3} V M

. In Section 4, the computational experiments are reported. Finally, we conclude the paper with future work in Section 5.

2. Preliminaries

In this section, we will recall some elementary concepts of the quadratic kernel-free semi-supervised support vector machine and the outline of DC programming and DCA.

2.1. Sparse Quadratic Kernel-Free Least Squares Semi-Supervised Support Vector Machine

Given a training set that consists of m labeled points

{(x_{i}, y_{i}) \in R^{n} \times {+ 1, - 1}, i = 1, \dots, s}

and p unlabeled points

x_{i} \in R^{n}, i = s + 1, \dots, n

, our goal is to find a quadratic surface

f (x) = x^{T} Q x + b^{T} x + c = 0

that can directly separate the data into two classes with the largest margin, where Q is a symmetric matrix.

According to [20], the formula of the

S^{3} V M

problem can be written as follows

\begin{matrix} \begin{matrix} min_{w, b, c, ξ_{i}, η_{i}, γ_{i}} & \sum_{i = 1}^{n} | | Q x + {b | |}^{2} + C_{1} \sum_{i = 1}^{s} ξ_{i} + C_{2} \sum_{i = s + 1}^{n} min (η_{i}, γ_{i}) \\ s . t . & y_{i} (x^{T} Q x + b^{T} x + c) \geq 1 - ξ_{i}, ξ_{i} \geq 0, i = 1, 2, \dots, s, \\ (x^{T} Q x + b^{T} x + c) \geq 1 - η_{i}, η_{i} \geq 0, i = s + 1, s + 2, \dots, n, \\ - (x^{T} Q x + b^{T} x + c) \geq 1 - γ_{i}, γ_{i} \geq 0, i = s + 1, s + 2, \dots, n . \end{matrix} \end{matrix}

(1)

Take the upper triangular element of matrix Q, denoted as vector a, that is

a = {[a_{11}, a_{12}, \dots, a_{1 m}, a_{22}, a_{23}, \dots, a_{2 m}, \dots, a_{m m}]}^{T} \in R^{\frac{m^{2} + m}{2}} .

If the data

x_{i} = {x_{i 1}, \dots, x_{i m}} \in R^{m}

, then

h_{x_{i}}

is a vector formed as follows

h_{x_{i}} = {[\frac{1}{2} x_{i 1}^{2}, x_{i 1} x_{i 2}, \dots, x_{i 1} x_{i m}, \frac{1}{2} x_{i 2}^{2}, \dots, x_{i 2} x_{i m}, \dots, \frac{1}{2} x_{i m}^{2}]}^{T} \in R^{\frac{m^{2} + m}{2}} .

Then, we can construct a

m \times \frac{m^{2} + m}{2}

-dimensional matrix

H_{x_{i}}

. In each jth column of

H_{x_{i}}

, find the positions of all the components of Q that have the form

a_{j k}

or

a_{k j}

(where k can be any value), set those positions in the jth column of

H_{x_{i}}

to

x_{i k}

, and set the other positions to zero.

Denoting

g_{x_{i}} = [\begin{matrix} h_{x_{i}} \\ x_{i} \\ 1 \end{matrix}]

,

w = [\begin{matrix} a \\ b \\ c \end{matrix}]

,

G = \frac{1}{l} [\begin{matrix} \sum_{i = 1}^{l} H_{x_{i}}^{T} H_{x_{i}} & \sum_{i = 1}^{n} H_{x_{i}}^{T} & 0 \\ \sum_{i = 1}^{l} H_{x_{i}} & n E & 0 \\ 0 & 0 & 0 \end{matrix}]

We have

y_{i} (\frac{1}{2} x_{i}^{T} Q x_{i} + b^{T} x_{i} + c) = y_{i} (h_{x_{i}}^{T} a + b^{T} x_{i} + c) = y_{i} (g_{x_{i}}^{T} w),

\frac{1}{2} x_{i}^{T} Q x_{i} + b^{T} x_{i} + c = h_{x_{i}}^{T} a + b^{T} x_{i} + c = (g_{x_{i}}^{T} w),

and

\frac{1}{2 n} \sum_{i = 1}^{n} | | Q x_{i} + {b | |}_{2}^{2} = \frac{1}{2 n} \sum_{i = 1}^{n} | | H_{x_{i}} a + b {| |}_{2}^{2} = \frac{1}{2} w^{T} G w

Therefore,

Q S^{3} V M

can be rewritten as

\begin{matrix} \begin{matrix} min_{w, ξ_{i}, η_{i}, γ_{i}} & \frac{1}{2} w^{T} G w + C_{1} \sum_{i = 1}^{s} ξ_{i} + C_{2} \sum_{i = s + 1}^{n} min (η_{i}, γ_{i}) \\ s . t . & y_{i} (g_{x_{i}}^{T} w) \geq 1 - ξ_{i}, ξ_{i} \geq 0, i = 1, 2, \dots, s, \\ (g_{x_{i}}^{T} w) \geq 1 - η_{i}, η_{i} \geq 0, i = s + 1, s + 2, \dots, n, \\ - (g_{x_{i}}^{T} w) \geq 1 - γ_{i}, γ_{i} \geq 0, i = s + 1, s + 2, \dots, n . \end{matrix} \end{matrix}

(2)

To obtain sparse solutions, we add the

ℓ_{0}

norm regularization term to the objective function. Hence, the strong sparse quadratic kernel-free least squares semi-supervised support vector machine (

S S Q L S S^{3} V M

) can be written as

\begin{matrix} \begin{matrix} min_{w, ξ_{i}, η_{i}, γ_{i}} & {λ ∥ w ∥}_{0} + \frac{1}{2} w^{T} G w + C_{1} \sum_{i = 1}^{s} ξ_{i} + C_{2} \sum_{i = s + 1}^{n} min (η_{i}, γ_{i}) \\ s . t . & y_{i} (g_{x_{i}}^{T} w) \geq 1 - ξ_{i}, ξ_{i} \geq 0, i = 1, 2, \dots, s, \\ (g_{x_{i}}^{T} w) \geq 1 - η_{i}, η_{i} \geq 0, i = s + 1, s + 2, \dots, n, \\ - (g_{x_{i}}^{T} w) \geq 1 - γ_{i}, γ_{i} \geq 0, i = s + 1, s + 2, \dots, n, \end{matrix} \end{matrix}

(3)

where

{∥ \cdot ∥}_{0}

is defined as the number of its nonzero components.

2.2. Outline of DCA

DC programming and DCA solve the problem of minimizing function f, which is the difference between two convex functions on the subspace

X \subset R^{n}

and the dual space Y of X. In general, the DC program is an optimization problem in the following forms

\begin{matrix} α = \inf {f (x) = g (x) - h (x), x \in R^{n}} & (P_{d c}), \end{matrix}

where

g, h

are lower semi-continuous proper convex functions on

R^{n}

.

For a convex function

g (x)

, its conjugate function is defined as

g^{*} (y) = sup {〈 x, y 〉 - g (x) | x \in X} .

For

ϵ > 0

and

x^{0} \in d o m g

, the symbol

\partial_{ϵ} g (x^{0})

denotes

ϵ -

subdifferential of g at

x^{0},

that is,

\partial_{ϵ} g (x^{0}) = {y \in Y : g (x) \geq g (x^{0}) + 〈 x - x^{0}, y 〉 - ϵ \forall x \in X},

while

\partial g (x^{0})

stands for the usual (or exact) subdifferential of g at

x^{0}

.

According to the subdifferential calculus of a lower semi-continuous proper convex function [36], we can get

y_{0} \in \partial f (x_{0}) \Leftrightarrow x_{0} \in \partial f^{*} (y_{0}) \Leftrightarrow 〈 x_{0}, y_{0} 〉 = f (x_{0}) + f^{*} (y_{0}) .

On the basis of the definition of conjugate functions, we have

\begin{matrix} \begin{matrix} α & = \inf {f (x) = g (x) - h (x), x \in X} \\ = \inf {g (x) - sup {〈 x, y 〉 - h^{*} (y) : y \in Y} : x \in X} \\ = \inf {β (y) : y \in Y} \end{matrix} \end{matrix}

with

β (y) = \inf {g (x) - (〈 x, y 〉 - h^{*} (y)) : x \in X} .

Then, the following program is called the dual program of

(P_{d c})

,

\begin{matrix} α = \inf {h^{*} (y) - g^{*} (y) : y \in Y} . & (D_{d c}) \end{matrix}

We observe perfect symmetry between primal and dual programs

(P_{d c})

and

(D_{d c})

: the dual program to

(D_{d c})

is exactly

(P_{d c})

.

Definition 1

([37]). A point

x^{*}

is said to be a critical point of

g - h

if

\begin{matrix} \partial g (x^{*}) \cap \partial h (x^{*}) \neq \emptyset . \end{matrix}

(4)

Theorem 1

([38]). Let P and D denote the solution sets of problems

(P_{d c})

and

(D_{d c})

. Then,

(i)

x \in P

if and only if

\partial_{ϵ} h (x) \subset \partial_{ϵ} g (x)

.

(i i)

Dually,

y \in D

if and only if

\partial_{ϵ} g^{*} (y) \subset \partial_{ϵ} h^{*} (y)

.

(i i i)

\cup {\partial h (x) : x \in P} \subset D \subset d o m h^{*}

.

(i v)

\cup {\partial g^{*} (y) : y \in D} \subset P \subset d o m g

.

Theorem 2

([38]). Let

P_{l} = {x^{*} \in X : \partial h (x^{*}) \subset \partial g (x^{*}},

D_{l} = {y^{*} \in Y : \partial g^{*} (y^{*}) \subset \partial h^{*} (y^{*}} .

Then,

(i)

x^{*}

is a local minimizer of

g - h

, then

x^{*} \in P_{l}

.

(i i)

Let

x^{*}

be a critical point of

g - h

and

y^{*} \in \partial g (x^{*}) \cap \partial h (x^{*})

. Let U be a neighborhood of

x^{*}

such that

U \cap d o m g \subset d o m \partial h .

If for any

x \in U \cap d o m g

there is

y \in \partial h

such that

h^{*} (y) - g^{*} (y) \geq h^{*} (y^{*}) - g^{*} (y^{*})

, then

x^{*}

is a local minimizer of

g - h

. More precisely,

\begin{matrix} g (x) - h (x) \geq g (x^{*}) - h (x^{*}), \forall x \in U \cap d o m g . \end{matrix}

(5)

The necessary local optimality condition for (primal) DC program

P_{d c}

is

\begin{matrix} \partial g (x^{*}) \cap \partial h (x^{*}) \neq \emptyset . \end{matrix}

(6)

According to [38], for each fixed

x^{*} \in X

, we solve the following optimization problem

\begin{matrix} \inf {h^{*} (y) - g^{*} (y) : y \in \partial h (x^{*})} . & (S (x^{*})) \end{matrix}

This is equal to

\inf {〈 x^{*}, y 〉) - g^{*} (y) : y \in \partial h (x^{*})} .

In the same way, for each

y^{*} \in Y

, we define the problem

\begin{matrix} \inf {g (x) - h (x) : x \in \partial g^{*} (y^{*})} . & (T (y^{*})) \end{matrix}

Similarly, it can be rewritten as

\inf {〈 x, y^{*} 〉) - h (x) : x \in \partial g^{*} (y^{*})} .

Let

S (x^{*}), T (y^{*})

denote the solution sets of problems

(S (x^{*}))

and

(T (y^{*}))

, respectively. Based on the above, we give the DCA.

Given an initial point

x^{0} \in d o m g

, we calculate the following two sequences

{x^{k}}

and

{y^{k}}

, which are defined by

y^{k} \in S (x^{k}); x^{k + 1} \in T (y^{k}) .

Namely, at the k-th iteration, we calculate

\begin{matrix} \begin{matrix} x^{k} \in \partial g^{*} (y^{k - 1}) \to y^{k} \in \partial h (x^{k}) \\ = arg min {h^{*} (y) - [g^{*} (y^{k - 1}) + 〈 x^{k}, y - y^{k - 1} 〉] : y \in Y} \end{matrix} \end{matrix}

(7)

\begin{matrix} \begin{matrix} y^{k} \in \partial h (x^{k}) \to x^{k + 1} \in \partial g^{*} (y^{k}) \\ = arg min {g (x) - [h (x^{k}) + 〈 y^{k}, x - x^{k} 〉] : x \in X} \end{matrix} \end{matrix}

(8)

According to [36], sequences

{x^{k}}, {y^{k}}

in DCA are well defined if and only if

d o m \partial g \subset d o m \partial h a n d d o m \partial h^{*} \subset d o m \partial g^{*} .

The convergence properties of DCA and its theoretical basic can be found in [37,39].

3. ${SSQLSS}^{3} VM$ by DCA

The

S S Q L S S^{3} V M

can be rewritten as follows

\begin{matrix} \begin{matrix} min_{w, ξ, η, γ} & {λ ∥ w ∥}_{0} + \frac{1}{2} w^{T} G w + C_{1} 〈 e, ξ 〉 + C_{2} 〈 e, min {η, γ} 〉 \\ s . t . & D (A w) + ξ \geq e, \\ B w + η \geq e, \\ - B w + γ \geq e, \\ ξ, η, γ \geq 0, \end{matrix} \end{matrix}

(9)

where A is a matrix of the label data, D is a diagonal matrix whose diagonal elements are the label, and B is the unlabeled data, while

ξ, η, γ

are the relaxation variables.

We can see that the constraint set is a polyhedral convex set, denoted by K. Therefore, (9) can be rewritten as follows

\begin{matrix} \begin{matrix} min_{w, ξ, η, γ} & F (w, ξ, η, γ) = {λ ∥ w ∥}_{0} + \frac{1}{2} w^{T} G w + C_{1} 〈 e, ξ 〉 + C_{2} 〈 e, min {η, γ} 〉 \\ s . t . & (w, ξ, η, γ) \in K . \end{matrix} \end{matrix}

(10)

A

ℓ_{0}

norm DC polyhedron approximation is given in [40], and its practical effect is verified in [41]. Now, we will add this approximation to the problem (10).

For

x \in R

, we define the function

θ

as follows

θ (x) : = min {1, λ_{1} | x |} = 1 + λ_{1} | x | - max {1, λ_{1} | x |},

where

λ_{1}

is a known parameter. Hence,

{∥ w ∥}_{0}

can be approximately expressed by

{∥ w ∥}_{0} ≃ \sum_{i = 1}^{\frac{m^{2} + 3 m + 2}{2}} θ (w_{i}) .

We redefine

λ_{1} = λ_{1} λ .

Therefore, we can conclude that

θ

is a DC function with the following DC decomposition:

θ (x) = g (x) - h (x)

, where

g (x) = λ + λ_{1} | x |, h (x) = max {λ, λ_{1} | x |} .

According to the above description, the problem of (10) can be represented as

\begin{matrix} \begin{matrix} min_{w, ξ, η, γ} & F (w, ξ, η, γ) = \sum_{i = 1}^{\frac{m^{2} + 3 m + 2}{2}} θ (w_{i}) + \frac{1}{2} w^{T} G w + C_{1} 〈 e, ξ 〉 + C_{2} 〈 e, min {η, γ} 〉 \\ s . t . & (w, ξ, η, γ) \in K . \end{matrix} \end{matrix}

(11)

We note that

F (w, ξ, η, γ)

is a DC function

F (w, ξ, η, γ) = G (w, ξ, η, γ) - H (w, ξ, η, γ),

where

\begin{matrix} \begin{matrix} G (w, ξ, η, γ) = \frac{1}{2} w^{T} G w + C_{1} 〈 e, ξ 〉 + \sum_{i = 1}^{\frac{m^{2} + 3 m + 2}{2}} g (w_{i}), \end{matrix} \end{matrix}

(12)

and

\begin{matrix} \begin{matrix} H (w, ξ, η, γ) = - C_{2} 〈 e, min {η, γ} 〉 + \sum_{i = 1}^{\frac{m^{2} + 3 m + 2}{2}} h (w_{i}) . \end{matrix} \end{matrix}

(13)

Obviously, G and H are convex functions. Therefore, the approximation problem of (10) has the following form

\begin{matrix} \begin{matrix} min {G (w, ξ, η, γ) - H (w, ξ, η, γ) : (w, ξ, η, γ) \in K} . \end{matrix} \end{matrix}

(14)

We can apply the DCA to the problem (14) and obtain the Algorithm 1 as follows.

Algorithm 1: Algorithm DCA.

Step 0

Given the initial value

(w^{0}, ξ^{0}, η^{0}, γ^{0}),

set

k \Leftarrow 0

;

Step 1

Compute

({\bar{w}}^{k}, {\bar{ξ}}^{k}, {\bar{η}}^{k}, {\bar{γ}}^{k}) \in \partial H (w^{k}, ξ^{k}, η^{k}, γ^{k}))

Step 2

Solve the convex program;

\begin{matrix} (w^{k + 1}, ξ^{k + 1}, η^{k + 1}, γ^{k + 1}) = & arg min {\frac{1}{2} w^{T} G w + C_{1} 〈 e, ξ 〉 + λ_{1} {∥ w ∥}_{1} \\ - 〈 ({\bar{w}}^{k}, {\bar{ξ}}^{k}, {\bar{η}}^{k}, {\bar{γ}}^{k}) (w, ξ, η, γ) 〉} \end{matrix}

s . t . (w, ξ, η, γ) \in K

;

Step 3

If

∥ x^{k + 1} - x^{k} ∥ \leq ϵ,

then stop; otherwise

k \Leftarrow k + 1,

go to

Step 1

.

We know that the computation of a subgradient of H is easy, so we can take

({\bar{w}}^{k}, {\bar{ξ}}^{k}, {\bar{η}}^{k}, {\bar{γ}}^{k}) \in \partial H (w^{k}, ξ^{k}, η^{k}, γ^{k}))

as follows:

First, we compute

{\bar{η}}^{k}

by

\begin{matrix} \begin{matrix} {\bar{η}}^{k} = 0 i f η^{k} \geq γ^{k}, {\bar{η}}^{k} = - C_{2} i f η^{k} < γ^{k} . \end{matrix} \end{matrix}

(15)

Then,

{\bar{γ}}^{k}

can be obtained by

\begin{matrix} \begin{matrix} {\bar{γ}}^{k} = 0 i f γ^{k} \geq η^{k}, {\bar{γ}}^{k} = - C_{2} i f γ^{k} < η^{k} . \end{matrix} \end{matrix}

(16)

{\bar{w}}_{i}^{k} = \{\begin{matrix} 0 & i f - \frac{1}{λ_{1}} \leq w_{i}^{k} \leq \frac{1}{λ_{1}}, \\ λ & i f w_{i}^{k} > \frac{1}{λ_{1}}, \\ - λ & i f w_{i}^{k} < - \frac{1}{λ_{1}} . \end{matrix}

(17)

Furthermore, we solve the subproblem in Step 2, that is,

\begin{matrix} \begin{matrix} min & \frac{1}{2} w^{T} G w + C_{1} 〈 e, ξ 〉 + λ_{1} {∥ w ∥}_{1} - 〈 ({\bar{w}}^{k}, {\bar{ξ}}^{k}, {\bar{η}}^{k}, {\bar{γ}}^{k}), (w, ξ, η, γ) 〉 \\ s . t . & D (A w) + ξ \geq e, \\ B w + η \geq e, \\ - B w + γ \geq e, \\ ξ, η, γ \geq 0 . \end{matrix} \end{matrix}

(18)

We introduce relaxation variables

r_{1}, r_{2}, r_{3}, z

into (18) as follows

\begin{matrix} \begin{matrix} min & \frac{1}{2} w^{T} G w + C_{1} 〈 e, ξ 〉 + λ_{1} {∥ z ∥}_{1} - 〈 ({\bar{w}}^{k}, {\bar{ξ}}^{k}, {\bar{η}}^{k}, {\bar{γ}}^{k}), (w, ξ, η, γ) 〉 \\ s . t . & D (A w) + ξ - r_{1} = e, \\ B w + η - r_{2} = e, \\ - B w + γ - r_{3} = e, \\ w - z = 0, \\ ξ, η, γ \geq 0, r_{1}, r_{2}, r_{3} \geq 0 . \end{matrix} \end{matrix}

(19)

The above optimization problem (19) can be written equivalently as the following convex programming

\begin{matrix} \begin{matrix} min & \frac{1}{2} w^{T} G w + C_{1} 〈 e, ξ 〉 + λ_{1} {∥ z ∥}_{1} - 〈 ({\bar{w}}^{k}, {\bar{ξ}}^{k}, {\bar{η}}^{k}, {\bar{γ}}^{k}), (w, ξ, η, γ) 〉 + δ_{+} (ξ) \\ + δ_{+} (η) + δ_{+} (γ) + δ_{+} (r_{1}) + δ_{+} (r_{2}) + δ_{+} (r_{3}) \\ s . t . & D (A w) + ξ - r_{1} = e, \\ B w + η - r_{2} = e, \\ - B w + γ - r_{3} = e, \\ w - z = 0, \end{matrix} \end{matrix}

(20)

where

δ_{+} (u) = \{\begin{matrix} 0, & i f u \in R_{+}^{n} \\ \infty, & o t h e r w i s e . \end{matrix}

(21)

In (20), there are three blocks, and only the first problem of block w is smooth. The problems of the second block

{ξ, η, γ}

and third block

{r_{1}, r_{2}, r_{3}, z}

are nonsmooth terms, so we applied sGS-ADMM to the above problem.

Let

σ > 0

be given. The augmented Lagrange function for (20) is defined by

\begin{matrix} \begin{matrix} L_{σ} (w, ξ, η, γ, r_{1}, r_{2}, r_{3}, z; s_{1}, s_{2}, s_{3}, s_{4}) \\ = \frac{1}{2} w^{T} G w + C_{1} 〈 e, ξ 〉 + λ_{1} {∥ z ∥}_{1} - 〈 ({\bar{w}}^{k}, {\bar{ξ}}^{k}, {\bar{η}}^{k}, {\bar{γ}}^{k}), (w, ξ, η, γ) 〉 + δ_{+} (ξ) + δ_{+} (η) + δ_{+} (γ) \\ + δ_{+} (r_{1}) + δ_{+} (r_{2}) + δ_{+} (r_{3}) + 〈 D (A w) + ξ - r_{1} - e, s_{1} 〉 + 〈 B w + η - r_{2} - e, s_{2} 〉 \\ + 〈 - B w + γ - r_{3} - e, s_{3} 〉 + 〈 w - z, s_{4} 〉 + \frac{σ}{2} {∥ D (A w) + ξ - r_{1} - e ∥}^{2} \\ + \frac{σ}{2} ∥ B w + η - r_{2} {- e ∥}^{2} + \frac{σ}{2} ∥ - B w + γ - r_{3} {- e ∥}^{2} + \frac{σ}{2} {∥ w - z ∥}^{2} . \end{matrix} \end{matrix}

(22)

We provide the framework of sGS-ADMM for solving (20) and its more details are shown in Algorithm 2 below.

Algorithm 2 Algorithm sGS-ADMM.

Let

σ > 0

and

τ \in (0, \infty)

be the given parameters, Choose

(w^{0}, ξ^{0}, η^{0}, γ^{0}, r_{1}^{0}, r_{2}^{0}, r_{3}^{0}, z^{0}),

and

s_{1}^{0}, s_{2}^{0}, s_{3}^{0}, s_{4}^{0},

set

l \Leftarrow 0

. Perform the

l + 1

th iteration as follows:

Step 1

Compute

w^{l + \frac{1}{2}} = arg min L_{σ} (w, ξ^{l}, η^{l}, γ^{l}, r_{1}^{l}, r_{2}^{l}, r_{3}^{l}, z^{l}; s_{1}^{l}, s_{2}^{l}, s_{3}^{l}, s_{4}^{l})

;

Step 2

Compute

ξ^{l + 1} = arg min L_{σ} (w^{l + \frac{1}{2}}, ξ, η^{l}, γ^{l}, r_{1}^{l}, r_{2}^{l}, r_{3}^{l}, z^{l}; s_{1}^{l}, s_{2}^{l}, s_{3}^{l}, s_{4}^{l})

,

η^{l + 1} = arg min L_{σ} (w^{l + \frac{1}{2}}, ξ^{l + 1}, η, γ^{l}, r_{1}^{l}, r_{2}^{l}, r_{3}^{l}, z^{l}; s_{1}^{l}, s_{2}^{l}, s_{3}^{l}, s_{4}^{l})

,

γ^{l + 1} = arg min L_{σ} (w^{l + \frac{1}{2}}, ξ^{l + 1}, η^{l + 1}, γ, r_{1}^{l}, r_{2}^{l}, r_{3}^{l}, z^{l}; s_{1}^{l}, s_{2}^{l}, s_{3}^{l}, s_{4}^{l})

;

Step 3

Compute

w^{l + 1} = arg min L_{σ} (w, ξ^{l + 1}, η^{l + 1}, γ^{l + 1}, r_{1}^{l + 1}, r_{2}^{l + 1}, r_{3}^{l + 1}, z^{l + 1}; s_{1}^{l}, s_{2}^{l}, s_{3}^{l}, s_{4}^{l})

Step 4

Compute

r_{1}^{l + 1} = arg min L_{σ} (w^{l + \frac{1}{2}}, ξ^{l + 1}, η^{l + 1}, γ^{l + 1}, r_{1}, r_{2}^{l}, r_{3}^{l}, z^{l}; s_{1}^{l}, s_{2}^{l}, s_{3}^{l}, s_{4}^{l})

,

r_{2}^{l + 1} = arg min L_{σ} (w^{l + \frac{1}{2}}, ξ^{l + 1}, η^{l + 1}, γ^{l + 1}, r_{1}^{l + 1}, r_{2}, r_{3}^{l}, z^{l}; s_{1}^{l}, s_{2}^{l}, s_{3}^{l}, s_{4}^{l})

,

r_{3}^{l + 1} = arg min L_{σ} (w^{l + \frac{1}{2}}, ξ^{l + 1}, η^{l + 1}, γ^{l + 1}, r_{1}^{l + 1}, r_{2}^{l + 1}, r_{3}, z^{l}; s_{1}^{l}, s_{2}^{l}, s_{3}^{l}, s_{4}^{l})

,

z^{l + 1} = arg min L_{σ} (w^{l + \frac{1}{2}}, ξ^{l + 1}, η^{l + 1}, γ^{l + 1}, r_{1}^{l + 1}, r_{2}^{l + 1}, r_{3}^{l + 1}, z; s_{1}^{l}, s_{2}^{l}, s_{3}^{l}, s_{4}^{l})

;

Step 5

Compute

s_{1}^{l + 1} = s_{1}^{l} + τ σ (D (A w^{l + 1}) + ξ^{l + 1} - r_{1}^{l + 1} - e)

,

s_{2}^{l + 1} = s_{2}^{l} + τ σ (B w^{l + 1} + η^{l + 1} - r_{2}^{l + 1} - e)

,

s_{3}^{l + 1} = s_{3}^{l} + τ σ (- B w^{l + 1} + γ^{l + 1} - r_{3}^{l + 1} - e)

,

z^{l + 1} = z^{l} + τ σ (w^{l + 1} - z^{l + 1})

.

It is obvious that every subproblem is convex and easy to compute. Therefore, we will compute the subproblems in the algorithm sGS-ADMM one by one.

Firstly, we compute the first block about w.

\begin{matrix} \begin{matrix} w^{l + \frac{1}{2}} & = arg min L_{σ} (w, ξ^{l}, η^{l}, γ^{l}, r_{1}^{l}, r_{2}^{l}, r_{3}^{l}, z^{l}; s_{1}^{l}, s_{2}^{l}, s_{3}^{l}, s_{4}^{l}) \\ = \frac{1}{2} w^{T} G w - 〈 w, {\bar{w}}^{k} - A^{*} D^{*} s_{1}^{l} - B^{*} s_{2}^{l} + B^{*} s_{3}^{l} - s_{4}^{l} 〉 + \frac{σ}{2} {∥ D (A w) + ξ^{l} - r_{1}^{l} - e ∥}^{2} \\ + \frac{σ}{2} ∥ B w + η^{l} - r_{2}^{l} {- e ∥}^{2} + \frac{σ}{2} ∥ - B w + γ^{l} - r_{3}^{l} {- e ∥}^{2} + \frac{σ}{2} {∥ w - z^{l} ∥}^{2} . \end{matrix} \end{matrix}

(23)

The problem (23) is a convex and smooth optimization problem. Based on the optimality condition of the convex programming, we can obtain the following result

\begin{matrix} \begin{matrix} w^{l + \frac{1}{2}} & = {[G + σ (A^{*} D^{*} D A + 2 B^{*} B + I)]}^{- 1} [{\bar{w}}^{k} - A^{*} D^{*} s_{1}^{l} - B^{*} s_{2}^{l} + B^{*} s_{3}^{l} - s_{4}^{l} \\ - σ A^{*} D^{*} (ξ^{l} - r_{1}^{l} - e) - σ B^{*} (η^{l} - r_{2}^{l} - e) + σ B^{*} (γ^{l} - r_{3}^{l} - e) + σ z^{l}] . \end{matrix} \end{matrix}

(24)

Then, we compute the subproblem in Step 2 and use the last result

w^{l + \frac{1}{2}}

to solve the second block, which contains the same structure of the variables

ξ, η

, and

γ

\begin{matrix} \begin{matrix} ξ^{l + 1} & = arg min L_{σ} (w^{l + \frac{1}{2}}, ξ, η^{l}, γ^{l}, r_{1}^{l}, r_{2}^{l}, r_{3}^{l}, z^{l}; s_{1}^{l}, s_{2}^{l}, s_{3}^{l}, s_{4}^{l}) \\ = C_{1} 〈 e, ξ 〉 + δ_{+} (ξ) + 〈 D (A w^{l + \frac{1}{2}}) + ξ - r_{1}^{l} - e, s_{1}^{l} 〉 + \frac{σ}{2} {∥ D (A w^{l + \frac{1}{2}}) + ξ - r_{1}^{l} - e ∥}^{2} \\ = δ_{+} (ξ) + \frac{σ}{2} {∥ ξ + D A w^{l + \frac{1}{2}} - r_{1}^{l} - e + \frac{C_{1} e + s_{1}^{l}}{σ} ∥}^{2} \\ = Π_{+} (r_{1}^{l} + e - D A w^{l + \frac{1}{2}} - \frac{C_{1} e + s_{1}^{l}}{σ}) . \end{matrix} \end{matrix}

(25)

\begin{matrix} \begin{matrix} η^{l + 1} & = arg min L_{σ} (w^{l + \frac{1}{2}}, ξ^{l + 1}, η, γ^{l}, r_{1}^{l}, r_{2}^{l}, r_{3}^{l}, z^{l}; s_{1}^{l}, s_{2}^{l}, s_{3}^{l}, s_{4}^{l}) \\ = - 〈 {\bar{η}}^{k}, η 〉 + δ_{+} (η) + 〈 B w^{l + \frac{1}{2}} + η - r_{2}^{l} - e, s_{2}^{l} 〉 + \frac{σ}{2} {∥ B w^{l + \frac{1}{2}} + η - r_{2}^{l} - e ∥}^{2} \\ = δ_{+} (η) + \frac{σ}{2} {∥ η + B w^{l + \frac{1}{2}} - r_{2}^{l} - e + \frac{s_{2}^{l} - {\bar{η}}^{k}}{σ} ∥}^{2} \\ = Π_{+} (r_{2}^{l} + e - B w^{l + \frac{1}{2}} - \frac{s_{2}^{l} - {\bar{η}}^{k}}{σ}) . \end{matrix} \end{matrix}

(26)

\begin{matrix} \begin{matrix} γ^{l + 1} & = arg min L_{σ} (w^{l + \frac{1}{2}}, ξ^{l + 1}, η^{l + 1}, γ, r_{1}^{l}, r_{2}^{l}, r_{3}^{l}, z^{l}; s_{1}^{l}, s_{2}^{l}, s_{3}^{l}, s_{4}^{l}) \\ = - 〈 {\bar{γ}}^{k}, γ 〉 + δ_{+} (γ) + 〈 - B w^{l + \frac{1}{2}} + γ - r_{3}^{l} - e, s_{3}^{l} 〉 + \frac{σ}{2} {∥ - B w^{l + \frac{1}{2}} + γ - r_{3}^{l} - e ∥}^{2} \\ = δ_{+} (γ) + \frac{σ}{2} {∥ γ - B w^{l + \frac{1}{2}} - r_{3}^{l} - e + \frac{s_{3}^{l} - {\bar{γ}}^{k}}{σ} ∥}^{2} \\ = Π_{+} (B w^{l + \frac{1}{2}} + r_{3}^{l} + e - \frac{s_{3}^{l} - {\bar{γ}}^{k}}{σ}) . \end{matrix} \end{matrix}

(27)

Next, we update w again.

\begin{matrix} \begin{matrix} w^{l + 1} & = arg min L_{σ} (w, ξ^{l + 1}, η^{l + 1}, γ^{l + 1}, r_{1}^{l + 1}, r_{2}^{l + 1}, r_{3}^{l + 1}, z^{l + 1}; s_{1}^{l}, s_{2}^{l}, s_{3}^{l}, s_{4}^{l}) \\ = \frac{1}{2} w^{T} G w - 〈 w, {\bar{w}}^{k} - A^{*} D^{*} s_{1}^{l} - B^{*} s_{2}^{l} + B^{*} s_{3}^{l} - s_{4}^{l} 〉 + \frac{σ}{2} {∥ D (A w) + ξ^{l + 1} - r_{1}^{l + 1} - e ∥}^{2} \\ + \frac{σ}{2} ∥ B w + η^{l + 1} - r_{2}^{l + 1} {- e ∥}^{2} + \frac{σ}{2} ∥ - B w + γ^{l + 1} - r_{3}^{l + 1} {- e ∥}^{2} + \frac{σ}{2} {∥ w - z^{l + 1} ∥}^{2}, \end{matrix} \end{matrix}

(28)

Thus,

\begin{matrix} \begin{matrix} w^{l + 1} & = {[G + σ (A^{*} D^{*} D A + 2 B^{*} B + I)]}^{- 1} [{\bar{w}}^{k} - A^{*} D^{*} s_{1}^{l} - B^{*} s_{2}^{l} + B^{*} s_{3}^{l} - s_{4}^{l} \\ - σ A^{*} D^{*} (ξ^{l + 1} - r_{1}^{l + 1} - e) - σ B^{*} (η^{l + 1} - r_{2}^{l + 1} - e) \\ + σ B^{*} (γ^{l + 1} - r_{3}^{l + 1} - e) + σ z^{l + 1}] . \end{matrix} \end{matrix}

(29)

Finally, we compute the update of the relaxation variables

r_{1}, r_{2}, r_{3}, z

.

\begin{matrix} \begin{matrix} r_{1}^{l + 1} & = arg min L_{σ} (w^{l + \frac{1}{2}}, ξ^{l + 1}, η^{l + 1}, γ^{l + 1}, r_{1}, r_{2}^{l}, r_{3}^{l}, z^{l}; s_{1}^{l}, s_{2}^{l}, s_{3}^{l}, s_{4}^{l}) \\ = δ_{+} (r_{1}) + 〈 D A w^{l + \frac{1}{2}} + ξ^{l + 1} - r_{1} - e, s_{1}^{l} 〉 + \frac{σ}{2} {∥ D A w^{l + \frac{1}{2}} + ξ^{l + 1} - r_{1}^{l} - e ∥}^{2} \\ = δ_{+} (r_{1}) + \frac{σ}{2} {∥ D A w^{l + \frac{1}{2}} + ξ^{l + 1} - r_{1} - e + \frac{s_{1}^{l}}{σ} ∥}^{2} \\ = Π_{+} (D A w^{l + \frac{1}{2}} + ξ^{l + 1} + \frac{s_{1}^{l}}{σ} - e) . \end{matrix} \end{matrix}

(30)

\begin{matrix} \begin{matrix} r_{2}^{l + 1} & = arg min L_{σ} (w^{l + \frac{1}{2}}, ξ^{l + 1}, η^{l + 1}, γ^{l + 1}, r_{1}^{l + 1}, r_{2}, r_{3}^{l}, z^{l}; s_{1}^{l}, s_{2}^{l}, s_{3}^{l}, s_{4}^{l}) \\ = δ_{+} (r_{2}) + 〈 B w^{l + \frac{1}{2}} + η^{l + 1} - r_{2} - e, s_{2}^{l} 〉 + + \frac{σ}{2} {∥ B w^{l + \frac{1}{2}} + η^{l + 1} - r_{2} - e ∥}^{2} \\ = δ_{+} (r_{2}) + \frac{σ}{2} {∥ B w^{l + \frac{1}{2}} + η^{l + 1} - r_{2} - e + \frac{s_{2}^{l}}{σ} ∥}^{2} \\ = Π_{+} (B w^{l + \frac{1}{2}} + η^{l + 1} + \frac{s_{2}^{l}}{σ} - e) . \end{matrix} \end{matrix}

(31)

\begin{matrix} \begin{matrix} r_{3}^{l + 1} & = arg min L_{σ} (w^{l + \frac{1}{2}}, ξ^{l + 1}, η^{l + 1}, γ^{l + 1}, r_{1}^{l + 1}, r_{2}^{l + 1}, r_{3}, z^{l}; s_{1}^{l}, s_{2}^{l}, s_{3}^{l}, s_{4}^{l}) \\ = δ_{+} (r_{3}) + 〈 - B w^{l + \frac{1}{2}} + γ^{l + 1} - r_{3} - e, s_{3}^{l} 〉 + \frac{σ}{2} {∥ - B w^{l + \frac{1}{2}} + γ^{l + 1} - r_{3} - e ∥}^{2} \\ = δ_{+} (r_{3}) + \frac{σ}{2} {∥ - B w^{l + \frac{1}{2}} + γ^{l + 1} - r_{3} - e + \frac{s_{3}^{l}}{σ} ∥}^{2} \\ = Π_{+} (- B w^{l + \frac{1}{2}} + γ^{l + 1} + \frac{s_{3}^{l}}{σ} - e) . \end{matrix} \end{matrix}

(32)

\begin{matrix} \begin{matrix} z^{l + 1} & = arg min L_{σ} (w^{l + \frac{1}{2}}, ξ^{l + 1}, η^{l + 1}, γ^{l + 1}, r_{1}^{l + 1}, r_{2}^{l + 1}, r_{3}^{l + 1}, z; s_{1}^{l}, s_{2}^{l}, s_{3}^{l}, s_{4}^{l}) \\ = λ_{1} {∥ z ∥}_{1} + 〈 w^{l + \frac{1}{2}} - z, s_{4}^{l} 〉 + \frac{σ}{2} {∥ w^{l + \frac{1}{2}} - z ∥}^{2} \\ = λ_{1} {∥ z ∥}_{1} + \frac{σ}{2} {∥ w^{l + \frac{1}{2}} - z + \frac{s_{4}^{l}}{σ} ∥}^{2} \\ = P r o x_{λ_{1} {∥ \cdot ∥}_{1}} (w^{l + \frac{1}{2}} + \frac{s_{4}^{l}}{σ}) . \end{matrix} \end{matrix}

(33)

Lemma 1.

Let

P = [\begin{matrix} D A \\ B \\ - B \\ I \end{matrix}], Q = [\begin{matrix} I \\ I \\ I \end{matrix}], R = [\begin{matrix} - I \\ - I \\ - I \\ - I \end{matrix}] .

Then,

P^{T} P

,

Q^{T} Q

and

R^{T} R

are positive definite.

Theorem 3

([42]). Suppose that

(w^{l}, ξ^{l}, η^{l}, γ^{l})

is generated by the sGS-ADMM. Then, it converges to a solution of (18).

The proof of Theorem 3 is similar to that in [42], and we omit it here. Before we give the convergence of the DCA algorithm, we first provide a useful lemma.

Lemma 2

([38]). Let r be a proper lower semi-continuous convex function and

{x^{k}}

be a sequence such that

(i)

x^{k} \to x;

(i i)

There exists a bounded sequence

{y^{k}}

with

y^{k} \in \partial r (x^{k});

(i i i)

\partial r (x^{*})

is nonempty.

Then,

lim_{k \to \infty} r (x^{k}) = r (x^{*})

On the basis of the above lemma, we can obtain the following convergence theorem.

Theorem 4

(Convergence of DCA). Suppose the sequence

{(w^{k}, ξ^{k}, η^{k}, γ^{k})}

is generated by DCA. Then,

(i)

The sequence

{F (w^{k}, ξ^{k}, η^{k}, γ^{k})}

is monotonously decreasing.

(i i)

Suppose be a accumulation point

(w^{*}, ξ^{*}, η^{*}, γ^{*})

of the sequence

{(w^{k}, ξ^{k}, η^{k}, γ^{k})}

; then, the point

(w^{*}, ξ^{*}, η^{*}, γ^{*})

is a critical point of the F.

(i i i)

If

\begin{matrix} \begin{matrix} w^{*} \notin {\frac{1}{λ_{1}}, - \frac{1}{λ_{1}}}, & \forall i = 1, \dots, \frac{m^{2} + 3 m + 2}{2}, \\ η^{*} \neq γ^{*}, & \forall i = s + 1, \dots, n, \end{matrix} \end{matrix}

(34)

then

(w^{*}, ξ^{*}, η^{*}, γ^{*})

is a local minimizer of F.

Proof.

Let

a^{k} = (w^{k}, ξ^{k}, η^{k}, γ^{k}), b^{k} = ({\bar{w}}^{k}, {\bar{ξ}}^{k}, {\bar{η}}^{k}, {\bar{γ}}^{k}) .

(i)

Since

b^{k} \in \partial H (a^{k})

, it follows that

H (a^{k + 1}) \geq H (a^{k}) + 〈 a^{k + 1} - a^{k}, b^{k} 〉

. We have

\begin{matrix} \begin{matrix} (G - H) (a^{k + 1}) \leq G (a^{k + 1}) - 〈 a^{k + 1} - a^{k}, b^{k} 〉 - H (a^{k}) \end{matrix} \end{matrix}

(35)

Likewise,

a^{k + 1} \in \partial G^{*} (b^{k})

implies

G (a^{k}) \geq G (a^{k + 1}) + 〈 a^{k} - a^{k + 1}, b^{k} 〉 .

Therefore,

\begin{matrix} \begin{matrix} G (a^{k + 1}) - 〈 a^{k + 1} - a^{k}, b^{k} 〉 - H (a^{k}) \leq (G - H) (a^{k}) . \end{matrix} \end{matrix}

(36)

Finally, combining (35) and (36), we get

\begin{matrix} \begin{matrix} (G - H) (a^{k + 1}) \leq (H^{*} - G^{*}) (b^{k}) \leq (G - H) (a^{k}) . \end{matrix} \end{matrix}

(37)

(i i)

We know that H is a polyhedral convex function. Then,

\begin{matrix} \begin{matrix} - 〈 C_{2} e, min {η, γ} 〉 = max {- C_{2} 〈 (u, e - u), (η, γ) 〉 : u \in {0, 1}^{n - s}}, \end{matrix} \end{matrix}

(38)

\begin{matrix} \begin{matrix} \sum_{i = 1}^{\frac{m^{2} + 3 m + 2}{2}} max {1, λ | w_{i} |} = {max {〈 μ, λ w 〉 + 〈 e - | μ |, e 〉 : μ \in {- 1, 0, 1}}^{\frac{m^{2} + 3 m + 2}{2}}} . \end{matrix} \end{matrix}

(39)

Combine (38) and (39), and we have

\begin{matrix} \begin{matrix} H = max {- 〈 (- \frac{μ}{λ}, 0, C_{2} u, C_{2} (e - u)), (w, ξ, η, γ) 〉 + \frac{1}{λ} 〈 | μ | - e, e 〉 : \\ u \in {0, 1}^{n - s}, μ \in {- 1, 0, 1}^{\frac{m^{2} + 3 m + 2}{2}}} . \end{matrix} \end{matrix}

(40)

For simplicity, we redefine

\begin{matrix} \begin{matrix} H = max {〈 α^{i}, a 〉 - b^{i} : i \in I}, \end{matrix} \end{matrix}

(41)

where

a = (w, ξ, η, γ), I = {1, \dots, 2^{n - s} \times 3^{\frac{m^{2} + 3 m + 2}{2}}}

.

It is clear that

\partial H

is finite. In this case, the sequence

b^{k}

is discrete (i.e., it has only finitely many different elements, at most

2^{n - l} \times 3^{\frac{m^{2} + 3 m + 2}{2}}

).

Let

\begin{matrix} min G (a) - [〈 α^{i}, a 〉 - a^{i}] & (P_{i}) . \end{matrix}

Solving the subproblem in Step 2 at every iteration is equivalent to solving

(P_{i})

. Thus, the algorithm will be terminated after at most

2^{n - l} \times 3^{\frac{m^{2} + 3 m + 2}{2}}

steps.

Suppose

a^{*}

is an accumulation point of the sequence

{a^{k}}

. For the sake of simplicity, we denote

lim_{k \to \infty} a^{k} = a^{*} .

According to the proof of Theorem 3, we know that every

{a^{k}}

generated by the sGS-ADMM is bounded, so we can obtain that

{a^{k}}

is bounded. Thus, with the generation of

{b^{k}}

,

{b^{k}}

is bounded.

We can suppose (extracting subsequences if necessary) that the sequence

{b^{k}}

converges to a point

b^{*} \in \partial H (a^{*})

, and according to

(i)

, it follows that

lim_{k \to \infty} {G (a^{k}) + G^{*} (b^{k}) - 〈 a^{k}, b^{k} 〉} = 0

Thus,

lim_{k \to \infty} {G (a^{k}) + G^{*} (b^{k})} = lim_{k \to \infty} 〈 a^{k}, b^{k} 〉 = 〈 a^{*}, b^{*} 〉 .

Set

θ (a, b) = G (a) + G^{*} (b)

. It is clear that

θ

is a proper lower semi-continuous convex function. By Lemma 2, we have

θ (a^{*}, b^{*}) \leq \underset{k \to \infty}{lim inf} θ (a^{k}, b^{k}) = lim_{k \to \infty} θ (a^{k}, b^{k}) = 〈 a^{*}, b^{*} 〉,

that is,

θ (a^{*}, b^{*}) = G (a^{*}) + G^{*} (b^{*}) = 〈 a^{*}, b^{*} 〉 .

In other words,

b^{*} \in \partial G (a^{*}) .

Thus,

a^{*}

is a critical point of the F.

(i i i)

The formula (13), the second component of F, is a polyhedral convex function. According to the condition (41), H is differentiable at

a^{*}

. We define

J (a^{*}) = {i \in I : F (a^{*}) = G (a^{*}) - [〈 α^{i}, a^{*} 〉 - b^{i}]}

. There also exists at least one

j \in J (a^{*}

such that F is convex. Then, there is a neighborhood

U (a^{*}, 0)

such that

0 \in \partial F (a^{*})

. Therefore,

a^{*}

is a local minimizer of F. □

4. Numerical Experiments

In this section, we evaluate the proposed method. All codes are written in MATLAB, and all computations are performed on a Lenovo IdeaPad with Windows 10 Inter(R) Core(TM)i5-6200U CPU @2.30 GHz 2.40 GHz and 4 GB memory.

In particular, we apply this method to a simple example. In this dataset, there are 100 data points in each class, 10 of which are labeled. Circles represent positive classes, triangles represent negative classes, solid ones represent data with labels, and hollow ones represent data without labels. The distributions of the datasets are shown in Figure 1. From Figure 2, we can conclude the good performance of our algorithm.

We draw the separating surfaces derived by solving

S S Q L S S^{3} V M

for the simulated datasets in Figure 2. From Figure 2, we can conclude that our method performs well, and only three negative classes appear on the curved surface.

In this subsection, we perform several numerical experiments on real datasets. We use 10 public datasets (Iris, Skin, Seeds, Pima, Glass, Heart, Hepatitis, Iono, Sonar and BCI) obtained from the UCI Machine Learning Repository and Benchmark datasets. In the details column of the datasets from Table 1, the Iris and Seeds datasets contain multiple classes, and we select only two classes for numerical experiments. For each dataset, we perform 100 trials, each time randomly selecting the labeled data points, while the rest are considered as unlabeled.

To illustrate the effectiveness of our proposed model

S S Q L S S^{3} V M

, as well as the method, several alternative state-of-the-art methods are selected for comparison, such as SSQSSVM, CTSVM, Cut

S^{3}

VM, SVMlin, and LapSVM. First, the misclassification rate of several methods are shown in Table 2. The best results are shown in bold in the table. We observe that the misclassification rates of the

S S Q L S S^{3}

VM achieves the best accuracy rates, that is, the smallest misclassification rates on most of the data sets. In addition, due to the large size of the Skin dataset, SSQSSVM, CTSVM, Cut

S^{3}

VM, and SVMlin can lead to computational memory overflow, which is a common phenomenon for solving the relaxations of the

S^{3}

VM model; we present those results as -. We can also obtain low misclassification rates, which indicates that our proposed method can solve large-scale datasets and also shows the efficiency of our proposed method. The misclassification rates on the Seeds, Glass, and Sonar datasets are only a little higher than the P-ADMM method. Better results are achieved on the remaining other datasets, such the Heart dataset. We achieve 24.40% vs. 28.52%, 30.00%, 31.11%, 37.73%, and 33.33%, respectively, which are improved by 4.12%, 5.60%, 6.71%, 13.33%, and 8.93%, which shows that our proposed method is effective.

The average CPU time of the 100 experiments is reported in Table 3. The best results are shown in bold in the table. We can observe that SVMlin and P-ADMM are much faster than our method. Since our method is a nesting of two algorithms, the inner layer is sGS-ADMM, and the outer layer is DCA, it takes a longer time than the methods such as P-ADMM and SVMlin.

Overall, although our method is not the fastest, it has the best misclassification rate on most of the datasets, which shows the effectiveness of our model and algorithm.

5. Conclusions and Discussion

In this paper, we deal with a strong sparse quadratic kernel-free least squares semi-supervised support vector machine model by adding an

ℓ_{0}

norm regularization term to the objective function. We use the DC program (difference of convex function) and DCA (DC algorithm) to solve it. Firstly, we approximate the

ℓ_{0}

norm by a polyhedral DC function. Secondly, when solving the subproblem, we used sGS-ADMM due to the existence of the nonsmooth term. Empirical numerical experiments show the efficiency of the proposed algorithm.

Since the

ℓ_{0}

norm is highly nonconvex and computationally NP-hard, we use the convex difference method to overcome this problem. The next step is to directly use the

ℓ_{0}

norm regularity or the

ℓ_{0}

norm constraint to solve the

S S Q L S S^{3} V M

. In addition, the optimality condition of the model is not established here. These will be addressed in our future work.

Author Contributions

Data collection and analysis, J.S.; validation, J.S. and W.Q.; writing—original draft, J.S. and W.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are available from http://archive.ics.uci.edu/ml/index.php, accessed on 20 January 2022.

Acknowledgments

The authors would like to thank the Associate Editor and the anonymous referee for their helpful suggestions.

Conflicts of Interest

The authors declare no conflict of interest.

References

Yang, X.; Tan, L.; He, L. A robust least squares support vector machine for regression and classification with noise. Neurocomputing 2014, 140, 41–52. [Google Scholar] [CrossRef]
Nan, S.; Sun, L.; Chen, B.; Lin, Z.; Toh, K.A. Density-dependent quantized least squares support vector machine for large data sets. IEEE Trans. Neural Netw. Learn. Syst. 2017, 28, 94–106. [Google Scholar] [CrossRef]
Melki, G.; Kecman, V.; Ventura, S.; Cano, A. OLLAWV: Online learning algorithm using worst-violators. Appl. Soft. Comput. 2018, 66, 384–393. [Google Scholar] [CrossRef]
Sun, J.; Fujita, H.; Zheng, Y.; Ai, W. Multi-class financial distress prediction based on support vector machines integrated with the decomposition and fusion methods. Inform. Sci. 2021, 559, 153–170. [Google Scholar] [CrossRef]
Forestier, G.; Wemmert, C. Semi-supervised learning using multiple clusterings with limited labeled data. Inform. Sci. 2016, 361, 48–65. [Google Scholar] [CrossRef]
Tu, E.; Zhang, Y.; Zhu, L.; Yang, J.; Kasabov, N. A graph-based semi-supervised k nearest-neighbor method for nonlinear manifold distributed data classification. Inform. Sci. 2016, 367, 673–688. [Google Scholar] [CrossRef][Green Version]
Zhou, Z.H.; Li, M. Semisupervised regression with cotraining-style algorithms. IEEE Trans. Knowl. Data Eng. 2007, 19, 1479–1493. [Google Scholar] [CrossRef]
Xu, S.; An, X.; Qiao, X.; Zhu, L.; Li, L. Semi-supervised least-squares support vector regression machines. J. Inf. Comput. Sci. 2011, 8, 885–892. [Google Scholar]
Vapnik, V.; Sterin, A. On structural risk minimization or overall risk in a problem of pattern recognition. Autom. Rem. Contr. 1977, 10, 1495–1503. [Google Scholar]
Bennett, K.; Demiriz, A. Semi-supervised support vector machines. Adv. Neural Inf. Process. Syst. 1999, 11, 368–374. [Google Scholar]
Joachims, T. Transductive inference for text classification using support vector machines. In Proceedings of the Sixteenth International Conference on Machine Learning, Bled, Slovenia, 27–30 June 1999; pp. 200–209. [Google Scholar]
Chapelle, O.; Sindhwani, V.; Keerthi, S.S. Branch and bound for semi-supervised support vector machines. Adv. Neural Inf. Process. Syst. 2007, 19, 217–224. [Google Scholar]
Hoi, S.C.; Jin, R.; Zhu, J.; Lyu, M.R. Semi-supervised SVM batch mode active learning for image retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 23–28 June 2008; pp. 1–7. [Google Scholar]
Zhu, X.; Goldberg, A.B. Introduction to semi-supervised learning. Synth. Lect. Artif. Intell. Mach. Learn. 2009, 3, 1–130. [Google Scholar]
Chapelle, O.; Zien, A. Semi-supervised classification by low density separation. In Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics, Bridgetown, Barbados, 6–8 January 2005; pp. 57–64. [Google Scholar]
Li, Y.F.; Kwok, J.T.; Zhou, Z.H. Semi-supervised learning using label Mean. In Proceedings of the 26th International Conference on Machine Learning, Montreal, QC, Canada, 14–18 June 2009; pp. 1–8. [Google Scholar]
Liu, Y.; Xu, Z.; Li, C. Online semi-supervised support vector machine. Inform. Sci. 2018, 439, 125–141. [Google Scholar] [CrossRef]
Cui, L.; Xia, Y. Semi-supervised sparse least squares support vector machine based on Mahalanobis distance. Appl. Intell. 2022. [Google Scholar] [CrossRef]
Issam, D. Quadratic kernel-free non-linear support vector machine. J. Glob. Optim. 2007, 41, 15–30. [Google Scholar]
Yan, X.; Bai, Y.; Fang, S.C.; Luo, J. A kernel-free quadratic surface support vector machine for semi-supervised learning. J. Oper. Res. Soc. 2016, 67, 1001–1011. [Google Scholar] [CrossRef]
Zhan, Y.; Bai, Y.; Zhang, W.; Ying, S. A P-ADMM for sparse quadratic kernel-free least squares semi-supervised support vector machine. Neurocomputing 2018, 306, 37–50. [Google Scholar] [CrossRef]
Tian, Y.; Bian, B.; Tang, X.; Zhou, J. A new non-kernel quadratic surface approach for imbalanced data classification in online credit scoring. Inform. Sci. 2021, 563, 150–165. [Google Scholar] [CrossRef]
Gao, Q.Q.; Bai, Y.Q.; Zhan, Y.R. Quadratic kernel-free least square twin support vector machine for binary classification problems. J. Oper. Res. Soc. China 2019, 7, 539–559. [Google Scholar] [CrossRef]
Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. 1996, 46, 431–439. [Google Scholar] [CrossRef]
Fan, J.; Li, R. Variable selection via nonconvave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
Pan, L.; Xiu, N.; Fan, J. Optimality conditions for sparse nonlinear programming. Sci. China Math. 2017, 5, 5–22. [Google Scholar] [CrossRef]
Gribonval, R.; Nielsen, M. Sparse representation in union of bases. IEEE Trans. Inf. Theory 2003, 49, 3320–3325. [Google Scholar] [CrossRef]
Zou, H. The adaptive lasso and its oracle properties. J. Am. Stat. Ass. 2006, 101, 1418–1429. [Google Scholar] [CrossRef]
Le Thi, H.A.; Le, H.M.; Nguyen, N.V.; Pham Dinh, T.A. DC programming approach for feature selection in support vector machines learning. J. Adv. Data Anal. Classif. 2008, 2, 259–278. [Google Scholar] [CrossRef]
Le Thi, H.A.; Le, H.M.; Pham Dinh, T. Feature selection in machine learning: An exact penalty approach using a difference of convex function algorithm. Mach. Learn. 2015, 101, 163–186. [Google Scholar] [CrossRef]
Neumann, J.; Schnorr, C.; Steidl, G. Combined SVM-based feature selection and classification. Mach. Learn. 2005, 61, 129–150. [Google Scholar] [CrossRef]
Collobert, R.; Sinz, F.; Weston, J.; Bottou, L. Large scale transductive SVMs. J. Mach. Learn. 2006, 7, 1687–1712. [Google Scholar]
Le Thi, H.A.; Le, H.M.; Pham Dinh, T. Optimization based DC programming and DCA for hierarchical clustering. Eur. J. Oper. Res. 2007, 183, 1067–1085. [Google Scholar]
Liu, Y.; Shen, X.; Doss, H. Multicategory learning and support vector machine: Computational tools. J. Comput. Graph. Stat. 2005, 14, 219–236. [Google Scholar] [CrossRef]
Ronan, C.; Fabian, S.; Jason, W.; Le, B. Trading convexity for scalability. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 201–208. [Google Scholar]
Rockafellar, R.T. Convex Analysis; Princeton University Press: Princeton, NJ, USA, 2015. [Google Scholar]
Tao, P.D.; Souad, E.B. Duality in dc (difference of convex functions) optimization. Subgradient methods. Trends Math. Optim. 1988, 84, 277–293. [Google Scholar]
Tao, P.D.; An, L.H. Convex analysis approach to d.c. programming: Theory, algorithm and applications. Acta Math. Vietnam. 1997, 22, 289–355. [Google Scholar]
Le Thi, H.A.; Pham Dinh, T. The DC (difference of convex functions) programming and DCA revisited with DC models of real world nonconvex optimization problems. Ann. Oper. Res. 2005, 133, 23–46. [Google Scholar]
Peleg, D.; Meir, R. A bilinear formulation for vector sparsity optimization. Signal Process. 2008, 88, 375–389. [Google Scholar] [CrossRef]
Ong, C.S.; Le Thi, H.A. Learning sparse classifiers with difference of convex functions algorithms. Optim. Method. Softw. 2013, 28, 830–854. [Google Scholar] [CrossRef]
Chen, L.; Sun, D.; Toh, K.C. An efficient inexact symmetric Gauss-Seidel based majorized ADMM for high-dimensional convex composite conic programming. Math. Program. 2017, 161, 1–34. [Google Scholar] [CrossRef]

Figure 1. The distribution of the simulated data.

Figure 2. Decision function for the simulated data.

Table 1. Details of datasets.

Data Sets	Feature	Total Numbers	Labeled Number
Iris	4	100	10
Skin	4	245,057	24,505
Seeds	7	140	14
Pima	8	768	40
Glass	10	146	15
Heart	12	270	27
Hepatitis	19	80	8
Iono	34	351	20
Sonar	60	208	20
BCI	117	400	40

Table 2. Misclassification rates (%) for datasets.

Data Sets	P-ADMM	SSQSSVM	CTSVM	CutS $^{3}$ VM	SVMlin	DCA
2circles	2.50	4.50	5.00	20.00	36.50	2.30
Hyperbola	1.50	10.00	11.67	13.43	45.50	1.20
Iris	0.00	4.53	5.00	8.50	0.00	0.00
Skin	8.60	-	-	-	-	5.30
Seeds	5.00	7.86	6.43	8.33	7.14	5.20
Pima	29.81	-	31.64	31.24	60.87	27.70
Glass	0.00	2.05	2.05	10.81	3.64	0.60
Heart	28.52	30.00	31.11	37.73	33.33	24.40
Hepatitis	23.75	33.75	31.25	43.21	50.00	21.40
Iono	11.40	-	9.97	21.65	31.13	11.20
Sonar	23.56	-	25.96	36.54	46.7	23.78
BCI	42.00	-	35.25	53.10	64.17	41.60

Remark: - indicates that the computer is out of memory.

Table 3. CPU time(s) for data sets.

Datasets	P-ADMM	SSQSSVM	CTSVM	CutS $^{3}$ VM	SVMlin	DCA
2circles	2.41	137.46	5.27	7.05	0.15	3.47
Hyperbola	2.36	137.60	5.16	6.45	0.15	4.34
Iris	1.38	19.40	2.97	2.28	0.10	3.24
Skin	2907.80	-	-	-	-	3457.42
Seeds	2.31	92.59	4.13	3.00	0.12	3.45
Pima	11.89	-	51.55	26.90	2.08	15.67
Glass	3.02	209.49	2.29	0.97	0.11	5.34
Heart	5.19	1902.90	8.22	2.57	0.29	8.96
Hepatitis	3.59	548.57	1.01	1.14	0.11	6.73
Iono	36.33	-	11.12	6.35	0.50	50.24
Sonar	12.30	-	5.48	1.15	0.54	16.58
BCI	405.89	-	12.91	9.50	4.25	526.73

Remark: - presents that the computer is out of memory.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sun, J.; Qu, W. DCA for Sparse Quadratic Kernel-Free Least Squares Semi-Supervised Support Vector Machine. Mathematics 2022, 10, 2714. https://doi.org/10.3390/math10152714

AMA Style

Sun J, Qu W. DCA for Sparse Quadratic Kernel-Free Least Squares Semi-Supervised Support Vector Machine. Mathematics. 2022; 10(15):2714. https://doi.org/10.3390/math10152714

Chicago/Turabian Style

Sun, Jun, and Wentao Qu. 2022. "DCA for Sparse Quadratic Kernel-Free Least Squares Semi-Supervised Support Vector Machine" Mathematics 10, no. 15: 2714. https://doi.org/10.3390/math10152714

APA Style

Sun, J., & Qu, W. (2022). DCA for Sparse Quadratic Kernel-Free Least Squares Semi-Supervised Support Vector Machine. Mathematics, 10(15), 2714. https://doi.org/10.3390/math10152714

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DCA for Sparse Quadratic Kernel-Free Least Squares Semi-Supervised Support Vector Machine

Abstract

1. Introduction

2. Preliminaries

2.1. Sparse Quadratic Kernel-Free Least Squares Semi-Supervised Support Vector Machine

2.2. Outline of DCA

3. ${SSQLSS}^{3} VM$ by DCA

4. Numerical Experiments

5. Conclusions and Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

DCA for Sparse Quadratic Kernel-Free Least Squares Semi-Supervised Support Vector Machine

Abstract

1. Introduction

2. Preliminaries

2.1. Sparse Quadratic Kernel-Free Least Squares Semi-Supervised Support Vector Machine

2.2. Outline of DCA

3. SSQLSS 3 VM by DCA

4. Numerical Experiments

5. Conclusions and Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3. ${SSQLSS}^{3} VM$ by DCA