A New Machine Learning Algorithm Based on Optimization Method for Regression and Classification Problems

Inthakon, Warunun; Suantai, Suthep; Sarnmeta, Panitarn; Chumpungam, Dawan

doi:10.3390/math8061007

Open AccessArticle

A New Machine Learning Algorithm Based on Optimization Method for Regression and Classification Problems

by

Warunun Inthakon

,

Suthep Suantai

,

Panitarn Sarnmeta

and

Dawan Chumpungam

^*

Data Science Research Center, Department of Mathematics, Faculty of Science, Chiang Mai University, Chiang Mai 50200, Thailand

^*

Author to whom correspondence should be addressed.

Mathematics 2020, 8(6), 1007; https://doi.org/10.3390/math8061007

Submission received: 18 May 2020 / Revised: 11 June 2020 / Accepted: 14 June 2020 / Published: 19 June 2020

(This article belongs to the Special Issue Advances in Mathematical Methods for Machine Learning Algorithms for Computer Aided Diagnostic Systems)

Download

Browse Figures

Versions Notes

Abstract

:

A convex minimization problem in the form of the sum of two proper lower-semicontinuous convex functions has received much attention from the community of optimization due to its broad applications to many disciplines, such as machine learning, regression and classification problems, image and signal processing, compressed sensing and optimal control. Many methods have been proposed to solve such problems but most of them take advantage of Lipschitz continuous assumption on the derivative of one function from the sum of them. In this work, we introduce a new accelerated algorithm for solving the mentioned convex minimization problem by using a linesearch technique together with a viscosity inertial forward–backward algorithm (VIFBA). A strong convergence result of the proposed method is obtained under some control conditions. As applications, we apply our proposed method to solve regression and classification problems by using an extreme learning machine model. Moreover, we show that our proposed algorithm has more efficiency and better convergence behavior than some algorithms mentioned in the literature.

Keywords:

Hilbert space; convex minimization problem; machine learning; forward–backward algorithm

1. Introduction

In this work, we are dealing with a convex minimization problem, which can be formulated as

min_{x \in H} {f (x) + g (x)},

(1)

where

f, g : H \to R \cup {+ \infty}

are proper, lower-semicontinuous convex functions and H is a Hilbert space. Many real world problems, such as signal processing, image reconstruction and compressed sensing, can be described using this model [1,2,3,4]. Moreover, data classification can also be formulated as (1); for more information about the importance and development of data classification and its methods see [5,6,7,8]. Therefore, a convex minimization problem has a wide range of applications, some of which will be studied in this research.

If f is differentiable then it is well-known that an element

x \in H

is a solution of (1) if and only if

x = p r o x_{α g} (I - α ▿ f) (x),

(2)

where

α > 0,

p r o x_{α g} (x) = J_{α}^{\partial g} (x) = {(I - α \partial g)}^{- 1} (x),

I is an identity mapping, and

\partial g

is a subdifferential of

g .

In addition, if

▿ f

is L-Lipschitz continuous then the classical foward–backward algorithm [9] can be used to solve (1). It is defined as follows:

x_{n + 1} = p r o x_{α_{n} g} (I - α_{n} ▿ f) (x_{n})

(3)

where

α_{n}

is a suitable stepsize. This method has been extensively used due to its simplicity, as a result it has been improved by many works, as seen in [2,10,11,12]. One well-known method that has improved the convergence rate of (3) significantly is known as the fast iterative shrinkage-threshodling algorithm or FISTA. It was proposed by Beck and Teboulle [13], as seen in Algorithm 1.

Algorithm 1. FISTA.

1: Input

x_{1} = y_{0} \in H, t_{1} = 1, L > 0, k = number of iterations .

2: for

n = 1

to k do

3:

y_{n} = p r o x_{\frac{1}{L} g} (x_{n} - \frac{1}{L} ▿ f x_{n}),

4:

t_{n + 1} = \frac{1 + \sqrt{1 + 4 t_{n}^{2}}}{2},

5:

θ_{n} = \frac{t_{n} - 1}{t_{n + 1}},

6:

x_{n + 1} = y_{n} + θ_{n} (y_{n} - y_{n - 1}) .

7: end for

8: return

x_{n + 1}

They proved that FISTA has a better convergence rate than (3), however the convergence theorem of this method was not given. Recently, Laing and Schonlieb [14] modified FISTA by setting

t_{n + 1} = \frac{p + \sqrt{q + r t_{n}^{2}}}{2},

where

p, q > 0

and

0 < r \leq 4,

and proved its weak convergence theorem.

In the case that H is an infinite dimension Hilbert space, weak convergence results may not be enough, consequently modifications of some algorithms are needed to obtain strong convergence results. There are several ways to modify the methods for such purpose, for more information see [15,16,17,18]. One method that caught our attention was the viscosity-base inertial foward–backward algorithm (VIFBA) proposed by Verma et al. [19], as seen in Algorithm 2.

Algorithm 2. VIFBA.

1: Input

x_{0}, x_{1} \in H, L > 0, β_{n} \geq 0, α_{n} \in (0, 1), γ_{n} \in (0, \frac{2}{L}), c

-contractive mapping F,

k = number of iterations .

2: for

n = 1

to k do

3:

y_{n} = x_{n} + β_{n} (x_{n} - x_{n - 1}),

4:

z_{n} = α_{n} F (y_{n}) + (1 - α_{n}) y_{n}

,

5:

x_{n + 1} = p r o x_{γ_{n} g} (z_{n} - γ_{n} ▿ f z_{n})

.

6: end for

7: return

x_{n + 1}

They proved a strong convergence of this algorithm if the following conditions are satisfied for all

n \in N

:

A1.: $lim_{n \to \infty} α_{n} = 0,$ and $\sum_{n = 1}^{\infty} α_{n} = \infty,$
A2.: $γ_{n} \in (0, \frac{2}{L}),$ and $\sum_{n = 1}^{\infty} | γ_{n} - γ_{n + 1} | < \infty,$
A3.: $\sum_{n = 1}^{\infty} {‖ x_{n + 1} - x_{n} ‖}^{2} < \infty,$ and $lim_{n \to \infty} \frac{β_{n} ‖ x_{n} - x_{n - 1} ‖}{α_{n}} = 0 .$

Note that all the methods mentioned above require

▿ f

to be L-Lipschitz continuous, which is quite difficult to find in general. Therefore, some improvements are still desirable.

Very recently, Cruz and Nghia [20] proposed a linesearch technique which can be used to eliminate the L-Lipschitz continuous assumption of

▿ f

and replaced it with weaker assumptions. In their work, the following conditions are needed instead:

B1.: $f, g$ are proper lower semicontinuous convex functions with $d o m g = d o m f,$
B2.: f is differentiable on an open set containing $d o m g,$ and $▿ f$ is uniformly continuous on any bounded subset of $d o m g$ and maps any a bounded subset of $d o m g$ to a bounded set in $H .$

The linesearch step is defined as Algorithm 3 as follows.

Algorithm 3. Linesearch 1.

1: Input

x \in d o m g,

σ > 0

,

θ \in (0, 1),

and

δ \in (0, \frac{1}{2}) .

2: set

α = σ .

3: while

α ‖ ▿ f (p r o x_{α g} (x - α ▿ f x)) - ▿ f x ‖ > δ ‖ p r o x_{α g} (x - α ▿ f x) - x ‖

do

4:

α = θ α .

5: end while

6: return

α

They also proved that Linesearch 1 stops after finitely many steps, and proposed Algorithm 4 as follows.

Algorithm 4.

1: Input

x_{1} \in d o m g,

σ > 0

,

θ \in (0, 1),

and

δ \in (0, \frac{1}{2}), k = number of iterations .

2: for

n = 1

to k do

3:

γ_{n} =

Linesearch 1

(x_{n}, σ, θ, δ),

4:

x_{n + 1} = p r o x_{γ_{n} g} (I - γ_{n} ▿ f) (x_{n}) .

5: end for

6: return

x_{n + 1}

They also proved its weak convergence theorem. Again the weak convergence may not be enough in the context of infinite dimension space.

As we know, most of the work related to a convex minimization problem assumes the L-Lipschitz continuity of

▿ f

. This restriction can be relaxed using a linesearch technique. So, we are motivated to establish a novel accelerated algorithm for solving a convex minimization problem (1), which employs a linesearch technique introduced by Cruz and Nghia [20] together with VIFBA [19]. The novelty of our proposed method is a suitable combination of the two methods to obtain a fast and efficient method for solving (1). We improve Algorithm 4 by adding an inertial step, which enhances the performance of the algorithm. We also prove its strong convergence theorem under weaker assumptions on the control conditions than that of VIFBA. More precisely, we can eliminate the assumption A2 and replace A3 with a weaker assumption. As applications, we apply our main result to solve a data classification problem and a regression of a sine function. Then we compare the performance of our algorithm with FISTA, VIFBA, and Algorithm 4.

This work is organized as follows: In Section 2, we recall some useful concepts related to the topic. In Section 3, we provide a new algorithm and prove its strong convergence to a solution of (1). In Section 4, we conduct some numerical experiments with a data classification problem and a regression of a sine function and compare the performance of each algorithm (FISTA, VIFBA, Algorithms 4 and 5). Finally, the conclusion of this work is in Section 5.

2. Preliminaries

In this section, we review some important tools which will be used in the later sections. Throughout this paper, we denote

x_{n} \to x

and

x_{n} ⇀ x

as strong and weak convergence of

{x_{n}}

to

x,

respectively.

A mapping

T : H \to H

is said to be L-Lipschitz continuous if there exists

L > 0

such that

‖ T x - T x ‖ \leq L ‖ x - y ‖, for all x, y \in H .

For

x \in H,

a subdifferential of h at x is defined as follows:

\partial h (x) : = {u \in H : 〈 u, y - x 〉 + h (x) \leq h (y), y \in H} .

We have known from [21] that a subdifferential

\partial h

is maximal monotone. Moreover, a graph of

\partial h,

G p h (\partial h) : = {(u, v) \in H \times H : v \in \partial h (u)}

is demiclosed, i.e., for any sequence

(u_{n}, v_{n}) \in G p h (\partial h)

such that

{u_{n}}

converges weakly to u and

{v_{n}}

converges strongly to

v,

we have

(u, v) \in G p h (\partial h)

.

The proximal operator,

p r o x_{g} : H \to d o m g

with

p r o x_{g} (x) = {(I + \partial g)}^{- 1} (x)

, is single-valued with a full domain. Moreover, the following is satisfied:

\frac{x - p r o x_{α g} x}{α} \in \partial g (p r o x_{α g} x), for all x \in H and α > 0 .

(4)

The following lemmas are crucial for the main results.

Lemma 1

([22]). Let

f, g : H \to R

be two proper lower semicontinuous convex functions with

d o m g \subseteq d o m f

and

J (x, α) = p r o x_{α g} (x - α ▿ f x) .

Then for any

x \in d o m g

and

α_{2} \geq α_{1} > 0,

we have

\frac{α_{2}}{α_{1}} ‖ x - J (x, α_{1}) ‖ \geq ‖ x - J (x, α_{2}) ‖ \geq ‖ x - J (x, α_{1}) ‖ .

Lemma 2

([23]). Let H be a real Hilbert space. Then the following holds, for all

x, y \in H

and

α \in [0, 1]

,

(i): ${‖ x \pm y ‖}^{2} = {‖ x ‖}^{2} \pm 2 〈 x, y 〉 + {‖ y ‖}^{2},$
(ii): ${‖ α x + (1 - α) y ‖}^{2} = {α ‖ x ‖}^{2} + {(1 - α) ‖ y ‖}^{2} - α (1 - α) {‖ x - y ‖}^{2},$
(iii): ${‖ x + y ‖}^{2} \leq {‖ x ‖}^{2} + 2 〈 y, x + y 〉 .$

Lemma 3

([24]). Let

{a_{n}}

be a sequence of real numbers such that there exists a subsequence

{a_{m_{j}}}

of

{a_{n}}

such that

a_{m_{j}} < a_{m_{j} + 1}

for all

j \in N .

Then there exists a nondecreasing sequence

{n_{k}}

of

N

such that

lim_{k \to \infty} n_{k} = \infty

and for all sufficiently large

k \in N

the following holds:

a_{n_{k}} \leq a_{n_{k} + 1} a n d a_{k} \leq a_{n_{k} + 1} .

Lemma 4

([25]). Let

{a_{n}}

be a sequence of nonnegative real numbers,

{α_{n}}

a sequence in

(0, 1)

with

\sum_{n = 1}^{\infty} α_{n} = \infty

,

{b_{n}}

a sequence of nonnegative real numbers with

\sum_{n = 1}^{\infty} b_{n} < \infty

and

{ζ_{n}}

a sequence of real numbers with

\underset{n \to \infty}{lim sup} ζ_{n} \leq 0 .

Suppose that the following holds

a_{n + 1} \leq (1 - α_{n}) a_{n} + α_{n} ζ_{n} + b_{n},

for all

n \in N

, then

lim_{n \to \infty} a_{n} = 0 .

3. Main Results

In this section, we assume the existence of a solution of (1) and denote

S_{*}

the set of all such solutions. It is known that

S_{*}

is closed and convex. We propose a new algorithm, by combining a linesearch technique (Linesearch 1) with VIFBA, as seen in Algorithm 5. A diagram of this algorithm can be seen in Figure 1.

Algorithm 5:

1: Input

x_{0}, x_{1} \in H, σ > 0, θ \in (0, 1), δ \in (0, \frac{1}{2}), β_{n} \geq 0, α_{n} \in (0, 1), c

-contractive mapping

F, k =

number of iterations.

2: for

n = 1

to k do

3:

y_{n} = x_{n} + β_{n} (x_{n} - x_{n - 1}),

4:

z_{n} = α_{n} F (y_{n}) + (1 - α_{n}) y_{n}

,

5:

γ_{n} =

Linesearch 1

(z_{n}, σ, θ, δ)

,

6:

x_{n + 1} = p r o x_{γ_{n} g} (z_{n} - γ_{n} ▿ f z_{n})

.

7: end for

8: return

x_{n + 1}

We prove a strong convergence result of Algorithm 5 in Theorem 1 as follows.

Theorem 1.

Let H be a Hilbert space,

g : H \to R \cup {+ \infty}

proper lower-semicontinuous convex,

f : H \to R \cup {+ \infty}

proper convex differentiable with

▿ f

being uniformly continuous on any bounded subset of

H .

Suppose the following holds:

C1.: $lim_{n \to \infty} α_{n} = 0, \sum_{n = 1}^{\infty} α_{n} = \infty,$
C2.: $lim_{n \to \infty} \frac{β_{n}}{α_{n}} ‖ x_{n} - x_{n - 1} ‖ = 0 .$

Then a sequence

{x_{n}}

generated by Algorithm 5 converges strongly to

x^{*} = P_{S_{*}} F (x^{*}) .

Proof.

Since

S_{*}

is closed and convex, a mapping

P_{S_{*}} F

yields a fixed point. Let

x^{*} = P_{S_{*}} F (x^{*}),

by the definition of

x_{n}, y_{n}

and

z_{n}

, we obtain the following, for all

n \in N

:

‖ x_{n} - y_{n} ‖ = β_{n} ‖ x_{n} - x_{n - 1} ‖ = \frac{β_{n}}{α_{n}} ‖ x_{n} - x_{n - 1} ‖ α_{n} \to 0 as n \to \infty,

(5)

‖ z_{n} - y_{n} ‖ = α_{n} ‖ F y_{n} - y_{n} ‖, and

(6)

‖ y_{n} - x^{*} ‖ \leq ‖ x_{n} - x^{*} ‖ + \frac{β_{n}}{α_{n}} ‖ x_{n} - x_{n - 1} ‖ α_{n} \leq ‖ x_{n} - x^{*} ‖ + α_{n} M_{1}

(7)

for some

M_{1} \geq \frac{β_{n}}{α_{n}} ‖ x_{n} - x_{n - 1} ‖ .

The following also holds:

\begin{matrix} ‖ y_{n} - x^{*} ‖^{2} & \leq ‖ x_{n} - x^{*} ‖^{2} + 2 β_{n} ‖ x_{n} - x^{*} ‖ ‖ x_{n} - x_{n - 1} ‖ + β_{n} {‖ x_{n} - x_{n - 1} ‖}^{2} \\ = ‖ x_{n} - x^{*} ‖^{2} + \frac{β_{n}}{α_{n}} ‖ x_{n} - x_{n - 1} ‖ (2 α_{n} ‖ x_{n} - x^{*} ‖ + α_{n} β_{n} ‖ x_{n} - x_{n - 1} ‖), \forall n \in N . \end{matrix}

(8)

Next, we prove the following

‖ z_{n} - x^{*} ‖^{2} - ‖ x_{n + 1} - x^{*} ‖^{2} \geq (1 - 2 δ) {‖ x_{n + 1} - z_{n} ‖}^{2} .

(9)

Indeed, from (4) and the definition of

\partial g,

we obtain

\frac{z_{n} - x_{n + 1}}{γ_{n}} - ▿ f z_{n} \in \partial g (x_{n + 1}), therefore

g x^{*} - g x_{n + 1} \geq 〈 \frac{z_{n} - x_{n + 1}}{γ_{n}} - ▿ f z_{n}, x^{*} - x_{n + 1} 〉, also

f x^{*} - f z_{n} \geq 〈 ▿ f z_{n}, x^{*} - z_{n} 〉, and

f z_{n} - f x_{n + 1} \geq 〈 ▿ f x_{n + 1}, z_{n} - x_{n + 1} 〉 .

Hence, by using the above inequalities and the definition of

γ_{n}

, we have

\begin{matrix} f x^{*} - f z_{n} + g x^{*} - g x_{n + 1} & \geq \frac{1}{γ_{n}} 〈 z_{n} - x_{n + 1}, x^{*} - x_{n + 1} 〉 + 〈 ▿ f z_{n}, x_{n + 1} - z_{n} 〉, \\ = \frac{1}{γ_{n}} 〈 z_{n} - x_{n + 1}, x^{*} - x_{n + 1} 〉 + 〈 ▿ f z_{n} - ▿ f x_{n + 1}, x_{n + 1} - z_{n} 〉 \\ + 〈 ▿ f x_{n + 1}, x_{n + 1} - z_{n} 〉, \\ \geq \frac{1}{γ_{n}} 〈 z_{n} - x_{n + 1}, x^{*} - x_{n + 1} 〉 - ‖ ▿ f z_{n} - ▿ f x_{n + 1} ‖ ‖ x_{n + 1} - z_{n} ‖ \\ + 〈 ▿ f x_{n + 1}, x_{n + 1} - z_{n} 〉, \\ \geq \frac{1}{γ_{n}} 〈 z_{n} - x_{n + 1}, x^{*} - x_{n + 1} 〉 - \frac{δ}{γ_{n}} {‖ x_{n + 1} - z_{n} ‖}^{2} + f x_{n + 1} - f z_{n} . \end{matrix}

Consequently,

\frac{1}{γ_{n}} 〈 z_{n} - x_{n + 1}, x_{n + 1} - x^{*} 〉 \geq (f + g) (x_{n + 1}) - (f + g) (x^{*}) - \frac{δ}{γ_{n}} {‖ x_{n + 1} - z_{n} ‖}^{2} .

Since

〈 z_{n} - x_{n + 1}, x_{n + 1} - x^{*} 〉 = \frac{1}{2} (‖ z_{n} - x^{*} ‖^{2} - ‖ z_{n} - x_{n + 1} ‖^{2} - ‖ x_{n + 1} - x^{*} ‖^{2}),

\frac{1}{2 γ_{n}} (‖ z_{n} - x^{*} ‖^{2} - ‖ z_{n} - x_{n + 1} ‖^{2} - ‖ x_{n + 1} - x^{*} ‖^{2}) \geq (f + g) (x_{n + 1}) - (f + g) (x^{*}) - \frac{δ}{γ_{n}} {‖ x_{n + 1} - z_{n} ‖}^{2} .

Furthermore, it follows from

x^{*} \in S_{*}

that

\begin{matrix} ‖ z_{n} - x^{*} ‖^{2} - {‖ x_{n + 1} - x^{*} ‖}^{2} & \geq 2 γ_{n} [(f + g) (x_{n + 1}) - (f + g) (x^{*})] + (1 - 2 δ) {‖ x_{n + 1} - z_{n} ‖}^{2}, \\ \geq (1 - 2 δ) ‖ x_{n + 1} - z_{n} ‖^{2} . \end{matrix}

Next, we show that

{x_{n}}

is bounded. Indeed, from (7) and (9), we obtain

\begin{matrix} ‖ x_{n + 1} - x^{*} ‖ \leq ‖ z_{n} - x^{*} ‖ & = ‖ α_{n} F y_{n} + (1 - α_{n}) y_{n} - x^{*} ‖, \\ \leq α_{n} ‖ F y_{n} - F x^{*} ‖ + α_{n} ‖ F x^{*} - x^{*} ‖ + (1 - α_{n}) ‖ y_{n} - x^{*} ‖, \\ \leq (1 - (1 - c) α_{n}) ‖ y_{n} - x^{*} ‖ + α_{n} ‖ F x^{*} - x^{*} ‖, \\ \leq (1 - (1 - c) α_{n}) ‖ x_{n} - x^{*} ‖ + (1 - (1 - c) α_{n}) α_{n} M_{1} + α_{n} ‖ F x^{*} - x^{*} ‖, \\ \leq (1 - (1 - c) α_{n}) ‖ x_{n} - x^{*} ‖ + α_{n} (1 - c) (\frac{M_{1} + ‖ F x^{*} - x^{*} ‖}{1 - c}) . \end{matrix}

Inductively, we have

‖ x_{n + 1} - x^{*} ‖ \leq max {‖ x_{0} - x^{*} ‖, \frac{M_{1} + ‖ F x^{*} - x^{*} ‖}{1 - c}},

and hence

{x_{n}}

is bounded. Furthermore, by using (5) and (6),

{y_{n}}

and

{z_{n}}

are also bounded. To show the convergence of

{x_{n}},

we divide the proof into two cases.

Case 1 There exists

N_{0} \in N

such that

‖ x_{n + 1} - x^{*} ‖ \leq ‖ x_{n} - x^{*} ‖

for all

n \geq N_{0}

. So

lim_{n \to \infty} ‖ x_{n} - x^{*} ‖ = a,

for some

a \in R .

From (5) and (6), and the fact that

‖ z_{n} - x^{*} ‖ \leq ‖ z_{n} - y_{n} ‖ + ‖ y_{n} - x_{n} ‖ + ‖ x_{n} - x^{*} ‖,

we have

lim_{n \to \infty} ‖ z_{n} - x^{*} ‖ = a .

Using (9), we have

lim_{n \to \infty} ‖ x_{n + 1} - z_{n} ‖ = 0 .

Since

{z_{n}}

is bounded, there exists a subsequence

{z_{n_{k}}}

of

{z_{n}}

such that

z_{n_{k}} ⇀ w,

for some

w \in H,

and the following holds:

\underset{n \to \infty}{lim sup} 〈 F x^{*} - x^{*}, z_{n} - x^{*} 〉 = lim_{k \to \infty} 〈 F x^{*} - x^{*}, z_{n_{k}} - x^{*} 〉 = 〈 F x^{*} - x^{*}, w - x^{*} 〉 .

We claim that

w \in S_{*} .

In order to prove this, we need to consider two cases of

{z_{n_{k}}} .

The first case, if

γ_{n_{k}} \neq σ,

for finitely many

k .

Then, without loss of generality, we can assume that

γ_{n_{k}} = σ,

for all

k \in N

. From the definition of

γ_{n_{k}},

we have

‖ ▿ f x_{n_{k} + 1} - ▿ f z_{n_{k}} ‖ \leq \frac{δ}{σ} ‖ x_{n_{k} + 1} - z_{n_{k}} ‖ .

The uniform continuity of

▿ f

implies that

lim_{k \to \infty} ‖ ▿ f x_{n_{k} + 1} - ▿ f z_{n_{k}} ‖ = 0 .

We know that

\frac{x_{n_{k} + 1} - z_{n_{k}}}{γ_{n_{k}}} - ▿ f z_{n_{k}} + ▿ f x_{n_{k} + 1} \in \partial g (x_{n_{k} + 1}) + ▿ f x_{n_{k} + 1} = \partial (f + g) (x_{n_{k} + 1}) .

Since

G p h (\partial (f + g))

is demiclosed, we can have that

0 \in \partial (f + g) (w),

and hence

w \in S_{*} .

The second case, there exists a subsequence

{z_{n_{k_{j}}}}

of

{z_{n_{k}}}

such that

γ_{n_{k_{j}}} \leq σ θ,

for all

j \in N .

Let

{\hat{γ}}_{n_{k_{j}}}

=

\frac{γ_{n_{k_{j}}}}{θ}

and

{\hat{x}}_{n_{k_{j}}} = p r o x_{{\hat{γ}}_{n_{k_{j}}} g} (z_{n_{k_{j}}} - {\hat{γ}}_{n_{k_{j}}} ▿ f z_{n_{k_{j}}}) .

From the definition of

γ_{n_{k_{j}}}

, we have

‖ ▿ f {\hat{x}}_{n_{k_{j}}} - ▿ f z_{n_{k_{j}}} ‖ > \frac{δ}{{\hat{γ}}_{n_{k_{j}}}} ‖ {\hat{x}}_{n_{k_{j}}} - z_{n_{k_{j}}} ‖ .

(10)

Moreover, from Lemma 1 we have

\frac{1}{θ} ‖ z_{n_{k_{j}}} - x_{n_{k_{j}} + 1} ‖ \geq ‖ z_{n_{k_{j}}} - {\hat{x}}_{n_{k_{j}}} ‖, therefore ‖ z_{n_{k_{j}}} - {\hat{x}}_{n_{k_{j}}} ‖ \to 0 as n \to \infty,

which implies that

{\hat{x}}_{n_{k_{j}}} ⇀ w

. Since

▿ f

is uniformly continuous, we also have

‖ ▿ f {\hat{x}}_{n_{k_{j}}} - ▿ f z_{n_{k_{j}}} ‖ \to 0

as

n \to \infty .

Combining this with (10), we obtain

\frac{‖ {\hat{x}}_{n_{k_{j}}} - z_{n_{k_{j}}} ‖}{{\hat{γ}}_{n_{k_{j}}}} \to 0

as

n \to \infty .

Again, we know that

\frac{{\hat{x}}_{n_{k_{j}}} - z_{n_{k_{j}}}}{{\hat{γ}}_{n_{k_{j}}}} - ▿ f z_{n_{k_{j}}} + ▿ f {\hat{x}}_{n_{k_{j}}} \in \partial g ({\hat{x}}_{n_{k_{j}}}) + ▿ f {\hat{x}}_{n_{k_{j}}} = \partial (f + g) ({\hat{x}}_{n_{k_{j}}}) .

The demiclosedness of

G p h (\partial (f + g))

implies that

0 \in \partial (f + g) (w),

and hence

w \in S_{*} .

Therefore

\underset{n \to \infty}{lim sup} 〈 F x^{*} - x^{*}, z_{n} - x^{*} 〉 = 〈 F x^{*} - x^{*}, w - x^{*} 〉 \leq 0 .

Using (8), (9), and Lemma 2, we have

\begin{matrix} ‖ z_{n} - x^{*} ‖^{2} & = ‖ α_{n} F y_{n} + (1 - α_{n}) y_{n} - x^{*} ‖^{2}, \\ = ‖ α_{n} (F y_{n} - F x^{*}) + α_{n} (F x^{*} - x^{*}) + (1 - α_{n}) (y_{n} - x^{*}) ‖^{2}, \\ \leq ‖ α_{n} (F y_{n} - F x^{*}) + (1 - α_{n}) (y_{n} - x^{*}) ‖^{2} + 2 α_{n} 〈 F x^{*} - x^{*}, z_{n} - x^{*} 〉, \\ \leq (1 - (1 - c) α_{n}) {‖ y_{n} - x^{*} ‖}^{2} + 2 α_{n} 〈 F x^{*} - x^{*}, z_{n} - x^{*} 〉, \\ \leq (1 - (1 - c) α_{n}) ‖ x_{n} - x^{*} ‖^{2} + \frac{β_{n}}{α_{n}} ‖ x_{n} - x_{n - 1} ‖ (2 α_{n} ‖ x_{n} - x^{*} ‖ + α_{n} β_{n} ‖ x_{n} - x_{n - 1} ‖) \\ + 2 α_{n} 〈 F x^{*} - x^{*}, z_{n} - x^{*} 〉, \\ \leq (1 - (1 - c) α_{n}) ‖ x_{n} - x^{*} ‖^{2} + (1 - c) α_{n} (\frac{β_{n}}{α_{n}} ‖ x_{n} - x_{n - 1} ‖ \frac{M_{2}}{1 - c} + \frac{2}{1 - c} 〈 F x^{*} - x^{*}, z_{n} - x^{*} 〉), \end{matrix}

for some

M_{2} \geq 2 ‖ x_{n} - x^{*} ‖ + β_{n} ‖ x_{n} - x_{n - 1} ‖ .

Hence,

‖ x_{n + 1} - x^{*} ‖^{2} \leq (1 - (1 - c) α_{n}) ‖ x_{n} - x^{*} ‖^{2} + (1 - c) α_{n} (\frac{β_{n}}{α_{n}} ‖ x_{n} - x_{n - 1} ‖ \frac{M_{2}}{1 - c} + \frac{2}{1 - c} 〈 F x^{*} - x^{*}, z_{n} - x^{*} 〉) .

(11)

We set

a_{n} = {‖ x_{n + 1} - x^{*} ‖}^{2}

and

ζ_{n} = \frac{β_{n}}{α_{n}} ‖ x_{n} - x_{n - 1} ‖ \frac{M_{2}}{1 - c} + \frac{2}{1 - c} 〈 F x^{*} - x^{*}, z_{n} - x^{*} 〉

in Lemma 4. Since

\frac{β_{n}}{α_{n}} ‖ x_{n} - x_{n - 1} ‖ \frac{M_{2}}{1 - c} \to 0

and

\underset{n \to \infty}{lim sup} 〈 F x^{*} - x^{*}, z_{n} - x^{*} 〉 \leq 0,

we have

\underset{n \to \infty}{lim sup} ζ_{n} \leq 0 .

Consequently, Lemma 4 is applicable and hence

‖ x_{n} - x^{*} ‖^{2} \to 0,

that is

{x_{n}}

converges strongly to

x^{*} .

Case 2 There exists a subsequence

{x_{m_{j}}}

of

{x_{n}}

such that

‖ x_{m_{j}} - x^{*} ‖ < ‖ x_{m_{j} + 1} - x^{*} ‖

for all

j \in N .

From Lemma 3, there exists a nondecreasing sequence

{n_{k}}

of

N

such that

lim_{k \to \infty} n_{k} = \infty

and the following holds, for all sufficiently large

k \in N,

‖ x_{n_{k}} - x^{*} ‖ \leq ‖ x_{n_{k} + 1} - x^{*} ‖ and ‖ x_{k} - x^{*} ‖ \leq ‖ x_{n_{k} + 1} - x^{*} ‖ .

From the definition of

z_{n_{k}}

and (8) we have, for all

k \in N,

\begin{matrix} ‖ z_{n_{k}} - x^{*} ‖^{2} & \leq α_{n_{k}} ‖ F y_{n_{k}} - x^{*} ‖^{2} + {‖ y_{n_{k}} - x^{*} ‖}^{2}, \\ \leq ‖ x_{n_{k}} - x^{*} ‖^{2} + α_{n_{k}} {‖ F y_{n_{k}} - x^{*} ‖}^{2} \\ + \frac{β_{n_{k}}}{α_{n_{k}}} ‖ x_{n_{k}} - x_{n_{k} - 1} ‖ (2 α_{n_{k}} ‖ x_{n_{k}} - x^{*} ‖ + α_{n_{k}} β_{n_{k}} ‖ x_{n_{k}} - x_{n_{k} - 1} ‖), \\ \leq ‖ x_{n_{k}} - x^{*} ‖^{2} + α_{n_{k}} ‖ F y_{n_{k}} - x^{*} ‖^{2} + α_{n_{k}} (\frac{β_{n_{k}}}{α_{n_{k}}} ‖ x_{n_{k}} - x_{n_{k} - 1} ‖ M_{3}), \\ \leq ‖ x_{n_{k} + 1} - x^{*} ‖^{2} + α_{n_{k}} ‖ F y_{n_{k}} - x^{*} ‖^{2} + α_{n_{k}} (\frac{β_{n_{k}}}{α_{n_{k}}} ‖ x_{n_{k}} - x_{n_{k} - 1} ‖ M_{3}), \end{matrix}

for some

M_{3} \geq 2 ‖ x_{n_{k}} - x^{*} ‖ + β_{n_{k}} ‖ x_{n_{k}} - x_{n_{k} - 1} ‖ .

Combining with (9), we obtain

α_{n_{k}} ‖ F y_{n_{k}} - x^{*} ‖^{2} + α_{n_{k}} (\frac{β_{n_{k}}}{α_{n_{k}}} ‖ x_{n_{k}} - x_{n_{k} - 1} ‖ M_{3}) \geq (1 - 2 δ) ‖ x_{n_{k} + 1} - z_{n_{k}} ‖, \forall k \in N .

So,

‖ x_{n_{k} + 1} - z_{n_{k}} ‖ \to 0

as

k \to \infty .

Since

{z_{n_{k}}}

is bounded, there exists a subsequence

{z_{n_{k_{j}}}}

such that

z_{n_{k_{j}}} ⇀ w,

for some

w \in H,

and

\underset{k \to \infty}{lim sup} 〈 F x^{*} - x^{*}, z_{n_{k}} - x^{*} 〉 = lim_{j \to \infty} 〈 F x^{*} - x^{*}, z_{n_{k_{j}}} - x^{*} 〉 = 〈 F x^{*} - x^{*}, w - x^{*} 〉 .

Using the same argument as in case 1, we have that

w \in S_{*}

and

\underset{k \to \infty}{lim sup} 〈 F x^{*} - x^{*}, z_{n_{k}} - x^{*} 〉 = 〈 F x^{*} - x^{*}, w - x^{*} 〉 \leq 0 .

Moreover, it follows from (11) that

\begin{matrix} ‖ x_{n_{k} + 1} - x^{*} ‖^{2} & \leq (1 - (1 - c) α_{n_{k}}) {‖ x_{n_{k}} - x^{*} ‖}^{2} \\ + (1 - c) α_{n_{k}} (\frac{β_{n_{k}}}{α_{n_{k}}} ‖ x_{n_{k}} - x_{n_{k} - 1} ‖ \frac{M_{2}}{1 - c} + \frac{2}{1 - c} 〈 F x^{*} - x^{*}, z_{n_{k}} - x^{*} 〉), \\ \leq (1 - (1 - c) α_{n_{k}}) {‖ x_{n_{k} + 1} - x^{*} ‖}^{2} \\ + (1 - c) α_{n_{k}} (\frac{β_{n_{c}}}{α_{n_{k}}} ‖ x_{n_{k}} - x_{n_{k} - 1} ‖ \frac{M_{2}}{1 - c} + \frac{2}{1 - c} 〈 F x^{*} - x^{*}, z_{n_{k}} - x^{*} 〉) . \end{matrix}

Consequently,

‖ x_{n_{k} + 1} - x^{*} ‖^{2} \leq \frac{β_{n_{k}}}{α_{n_{k}}} ‖ x_{n_{k}} - x_{n_{k} - 1} ‖ \frac{M_{2}}{1 - c} + \frac{2}{1 - k} 〈 F x^{*} - x^{*}, z_{n_{k}} - x^{*} 〉 .

Hence,

0 \leq \underset{k \to \infty}{lim sup} ‖ x_{k} - x^{*} ‖^{2} \leq \underset{k \to \infty}{lim sup} {‖ x_{n_{k} + 1} - x^{*} ‖}^{2} \leq 0 .

Thus, we can conclude that

{x_{n}}

converges strongly to

x^{*}

, and the proof is complete. □

Remark 1.

We observe that we can prove our main result, Theorem 1, without the condition

A 2

and use the weaker condition

C 2

instead of

A 3

while VIFBA requires all of these conditions.

4. Applications to Data Classification and Regression Problems

As mentioned in the literature, many real world problems can be formulated in the form of a convex minimization problem. So, in this section, we illustrate the reformulation process of some problems in machine learning, namely classification and regression problems, into a convex minimization problem, and apply our proposed algorithm to solve such problems. We also show that our proposed method is more efficient than some methods mentioned in the literature.

First, we give a brief concept of extreme learning machine for data classification and regression problems, then we apply our main result to solve these two problems by conducting some numerical experiments. We also compare the performance of FISTA, VIFBA, Algorithms 4 and 5.

Extreme learning machine (ELM). Let

S : = {(x_{k}, t_{k}) : x_{k} \in R^{n}, t_{k} \in R^{m}, k = 1, 2, \dots, N}

be a training set of N distinct samples, then

x_{k}

is an input data and

t_{k}

is a target. For any single hidden layer of ELM, the output of the i-th hidden node is

h_{i} (x) = G (a_{i}, b_{i}, x),

where G is an activate function,

a_{i}

and

b_{i}

are parameters of the i-th hidden node. The output function of ELM for SLFNs with M hidden nodes is

O_{k} = \sum_{i = 1}^{M} β_{i} h_{i} (x_{k}),

where

β_{i}

is the output weight of the i-th hidden node. The hidden layer output matrix

H

is defined as follows:

H = [\begin{matrix} G (a_{1}, b_{1}, x_{1}) & \dots & G (a_{M}, b_{M}, x_{1}) \\ ⋮ & ⋱ & ⋮ \\ G (a_{1}, b_{1}, x_{N}) & \dots & G (a_{M}, b_{M}, x_{N}) \end{matrix}] .

The main goal of ELM is to find

β = {[β_{1}^{T}, \dots, β_{M}^{T}]}^{T}

such that

H β = T,

where

T = {[t_{1}^{T}, \dots, t_{N}^{T}]}^{T}

is the training data. In some cases, finding

β = H^{†} T

, where

H^{†}

is the Moore–Penrose generalized inverse of

H,

maybe a difficult task when

H^{†}

does not exist. Thus, finding such solution

β

by means of convex minimization can overcome such difficulty.

In this section, we conduct some experiments on regression and classification problems, the problem is formulated as the following convex minimization problem:

{Minimize : ‖ H β - T ‖}_{2}^{2} + λ {‖ β ‖}_{1},

(12)

where

λ

is a regularization parameter. This problem is called the least absolute shrinkage and selection operator (LASSO) [26]. In this case

f (x) = {‖ H x - T ‖}_{2}^{2}

and

g (x) = {λ ‖ x ‖}_{1}

. We note that, in our experiments, FISTA and VIFBA can be used to solve the problems, since the L-Lipschitz constants of the problems exist. However, FISTA and VIFBA fail to solve problems in which L-Lipschitz constants do not exist, while Algorithms 4 and 5 succeed.

4.1. Regression of a Sine Function

Throughout Section 4.1 and Section 4.2, all parameters are chosen to satisfy all the hypotheses of Theorem 1. All results are performed on Intel Core i5-7500 CPU with 16GB RAM and GeForce GTX 1060 6GB GPU.

As seen in Table 1, we create randomly 10 distinct points

x_{1}, x_{2}, \dots, x_{10}

which value between

[- 4, 4]

, then we create the training set

S : = {sin x_{n} : n = 1, \dots, 10}

and a graph of a sine function on

[- 4, 4]

as the target. The activation function is sigmoid, number of hidden nodes

M = 100,

and regularization parameter

λ = 1 \times 10^{- 5} .

We use FISTA, VIFBA, Algorithms 4 and 5 to predict a sine function with 10 training points.

The first experiment is to compare the performance of Algorithm 5 with different c-contractive mapping

F,

so we can observe if F affects the performance of Algorithm 5. We use mean square error (MSE) as a measure defined as follows:

MSE = \frac{1}{n} \sum_{i = 1}^{n} {({\bar{y}}_{i} - y_{i})}^{2} .

By setting

σ = 0.49,

δ = 0.1,

θ = 0.1

and the inertial parameter

β_{n} = \frac{1}{‖ x_{n} - x_{n - 1} ‖^{3} + n^{3}}

and MSE

= 5 \times 10^{- 3}

as the stopping criteria, we obtain the results as seen in Table 2.

We observe that Algorithm 5 performs better when c is closer to 1.

In the second experiment, we compare the performance of Algorithm 5 with different inertial parameters

β_{n}

in Theorem 1, namely

β_{n}^{1} = 0, β_{n}^{2} = \frac{1}{n^{2} ‖ x_{n} - x_{n - 1} ‖}, β_{n}^{3} = \frac{1}{‖ x_{n} - x_{n - 1} ‖^{3} + n^{3}}, β_{n}^{4} = \frac{10^{10}}{‖ x_{n} - x_{n - 1} ‖^{3} + n^{3} + 10^{10}} .

It can be shown that

β_{n}^{1},

β_{n}^{2}, β_{n}^{3}

and

β_{n}^{4}

satisfy C2. By setting

σ = 0.49,

δ = 0.1,

θ = 0.1,

F = 0.999 x,

and MSE

= 5 \times 10^{- 3}

as the stopping criteria, we obtain the results, as seen in Table 3.

We can clearly see that

β_{n}^{4}

significantly improves the performance of Algorithm 5. Although,

β_{n}^{4}

converges to 0 as

n \to \infty

, we observe that the behavior of

β_{n}^{4}

is different form

β_{n}^{2}

and

β_{n}^{3}

at the first few steps of the iteration, i.e.,

β_{n}^{4}

is extremely close to 1 while

β_{n}^{2}

and

β_{n}^{3}

are far away from 1. Based on this experiment, we choose

β_{n}^{4}

as our default inertial parameter for later experiments.

The third experiment, we compare the performance of FISTA, VIFBA, Algorithms 4 and 5. As in Table 4, we set the following parameters for each algorithm:

By setting MSE

= 5 \times 10^{- 3}

as the stopping criteria, we obtain the results, as seen in Table 5.

We observe that Algorithm 5 takes only 129 iterations while FISTA, VIFBA and Algorithm 4 take a higher number of iterations, and Algorithm 5 uses a training time less than Algorithm 4.

Next, we compare each algorithm at the 3000th iteration with different kinds of measures, namely mean absolute error (MAE) and root mean squared error (RMSE) defined as follows:

MAE = \frac{1}{n} \sum_{i = 1}^{n} | {\bar{y}}_{i} - y_{i} |, RMSE = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {({\bar{y}}_{i} - y_{i})}^{2}} .

The results can be seen in Table 6.

We observe from Table 6 that Algorithm 5 has the lowest MAE and RMSE, but takes the longest training time. In Figure 2, we observe that Algorithm 5 outperforms other algorithms in the regression of a graph of a sine function under the small number of iterations. In Figure 3, it is shown that Algorithm 5, FISTA and VIFBA have a better performance in the regression of a graph than Algorithm 4 when the number of iterations is higher.

4.2. Data Classification

In this experiment, we classify the type of Iris plants from Iris dataset created by Fisher [27]. As shown in Table 7, this dataset contains 3 classes of 50 instances and each sample contains four attributes.

We also would like to thank https://archive.ics.uci.edu for providing the dataset.

With this dataset, we set sigmoid as an activation function, number of hidden nodes

M = 100,

and regularization parameter

λ = 1 \times 10^{- 5} .

We use FISTA, VIFBA, Algorithms 4 and 5 as the training algorithm to estimate the optimal weight

β .

The output data O of training and testing data are obtained by

O = H β,

see Table 8 for more detail.

In the first experiment, we use the first 35 instances of each class as training data and the last 15 of each class as testing data, see Table 9 for detail.

The accuracy of the output data is calculated by:

accuracy = \frac{correctly predicted data}{all data} \times 100 % .

To compare the performance of FISTA, VIFBA, Algorithms 4 and 5, we choose parameters for each algorithm the same as in Table 4.

We first compare the accuracy of each method at the 700th iteration, and obtain the following results, as seen in Table 10.

As we see, from Table 10, Algorithm 5 obtains the highest accuracy at 700th iterations. We use acc.train and acc.test for the accuracy of the training data set and testing data set, respectively.

Next we compare each method with the stopping criteria as acc.train > 90 and acc.test > 90, the results can be seen in Table 11.

We observe from Table 11 that Algorithm 5 performs better than Algorithm 4.

In the next experiment, we use 10-fold stratified cross-validation to set up the training and testing data, see Table 12 for detail.

We also use Average ACC and ERR

_{%}

to evaluate the performance of each algorithm.

Average ACC = \sum_{i = 1}^{N} \frac{x_{i}}{y_{i}} \times 100 % / N .

where N is a number of sets considered during cross validation (

N = 10

),

x_{i}

a number of correctly predicted data at fold i and

y_{i}

a number of all data at fold i.

Let err

_{L s u m}

= sum of errors in all 10 training sets, err

_{T s u m}

= sum of errors in all 10 testing sets,

L s u m =

sum of all data in 10 training sets and

T s u m =

sum of all data in 10 testing sets. Define

{ERR}_{%} = ({err}_{L %} + {err}_{T %}) / 2,

where err

_{L %}

=

\frac{{err}_{L s u m}}{L s u m} \times 100 %

, and err

_{T %}

=

\frac{{err}_{T s u m}}{T s u m} \times 100 %

.

We choose the same parameters as in Table 4. We compare the accuracy at the 1000th iteration of each fold, and obtain the following results, as seen in Table 13.

We observe from Table 13 that Algorithm 5 has higher average accuracy than Algorithm 4.

5. Conclusions

In this work, algorithms for solving a convex minimization problem (1) are studied. Many effective algorithms for solving this problem were proposed, most of them require a Lipschitz continuous assumption of

▿ f .

By combining a linesearch technique introduced by Cruz and Nghia [20], and an iterative method VIFBA by Verma et al. [19], we establish a new algorithm that does not require a Lipschitz continuous assumption of

▿ f .

As a result, it can be applied to solve problems in which Lipschitz constants do not exist, while VIFBA and FISTA cannot. Moreover, by viscosity approximation together with the inertial technique, our proposed algorithm has a better convergence behavior than Algorithm 4. A strong convergence of our proposed method is also proven under some control conditions that are weaker than that of VIFBA.

Our algorithm can be used to solve many real world problems such as image and signal processing, machine learning, especially regression and classification problems. To compare the performance of FISTA, VIFBA, Algorithm 4 and our proposed algorithm(Algorithm 5), we conduct numerical experiments on the latter problems. We observe from these experiments that Algorithms 4 and 5 take computational time longer than FISTA and VIFBA at the same number of iterations because the linesearch step (Linesearch 1) takes a long time to compute. In the experiments with the stopping criteria (Table 5 and Table 11), Algorithm 5 converges to a solution with a lower number of iterations than Algorithm 4 and hence performs better in terms of speed. We can also observe that Algorithm 5 performs decently in terms of accuracy, especially when compared with Algorithm 4.

For our future research, since FISTA performs better than Algorithm 5 in terms of speed, in order to compete with FISTA, we aim to find a new linesearch technique that takes less computational time than Linesearch 1 and hence decreases the computational time of Algorithm 5.

Author Contributions

Writing—review and editing, W.I.; supervision, S.S.; writing—original draft preparation, P.S.; software, D.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by Chiang Mai University, Chiang Mai, Thailand.

Acknowledgments

This work was supported by Chiang Mai University, Chiang Mai, Thailand.

Conflicts of Interest

The authors declare no conflict of interest.

References

Byrne, C. Iterative oblique projection onto convex subsets and the split feasibility problem. Inverse Probl. 2002, 18, 441–453. [Google Scholar] [CrossRef]
Combettes, P.L.; Wajs, V. Signal recovery by proximal forward–backward splitting. Multiscale Model. Simul. 2005, 4, 1168–1200. [Google Scholar] [CrossRef] [Green Version]
Combettes, P.L.; Pesquet, J.C. Proximal Splitting Methods in Signal Processing. In Fixed-Point Algorithms for Inverse Problems in Science and Engineering; Bauschke, H., Burachik, R., Combettes, P., Elser, V., Luke, D., Wolkowicz, H., Eds.; Springer: New York, NY, USA, 2011; pp. 185–212. [Google Scholar]
Cholamjiak, P.; Shehu, Y. Inertial forward-backward splitting method in Banach spaces with application to compressed sensing. Appl. Math. 2019, 64, 409–435. [Google Scholar] [CrossRef]
Szaleniec, M.; Tadeusiewicz, R.; Witkoa, M. How to select an optimal neural model of chemical reactivity? Neurocomputing 2008, 72, 241–256. [Google Scholar] [CrossRef]
Szaleniec, J.; Wiatr, M.; Szaleniec, M.; Składzień, J.; Tomik, J.; Oleś, K.; Tadeusiewicz, R. Artificial neural network modelling of the results of tympanoplasty in chronic suppurative otitis media patients. Comput. Biol. Med. 2013, 43, 16–22. [Google Scholar] [CrossRef]
Pławiak, P.; Abdar, M.; Acharya, U.R. Application of new deep genetic cascade ensemble of SVM classifiers to predict the Australian credit scoring. Appl. Soft Comput. 2019, 84, 105740. [Google Scholar] [CrossRef]
Pławiak, P.; Abdar, M.; Pławiak, J.; Makarenkov, V.; Acharya, U.R. DGHNL: A new deep genetic hierarchical network of learners for prediction of credit scoring. Inf. Sci. 2020, 516, 401–418. [Google Scholar] [CrossRef]
Lions, P.L.; Mercier, B. Splitting algorithms for the sum of two nonlinear operators. SIAM J. Numer. Anal. 1979, 16, 964–979. [Google Scholar] [CrossRef]
Bussaban, L.; Suantai, S.; Kaewkhao, A. A parallel inertial S-iteration forward-backward algorithm for regression and classification problems. Carpathian J. Math. 2020, 36, 21–30. [Google Scholar]
Moudafi, A.; Oliny, M. Convergence of a splitting inertial proximal method for monotone operators. J. Comput. Appl. Math. 2003, 155, 447–454. [Google Scholar] [CrossRef] [Green Version]
Verma, M.; Shukla, K.K. A new accelerated proximal gradient technique for regularized multitask learning framework. Pattern Recogn. Lett. 2017, 95, 98–103. [Google Scholar] [CrossRef]
Beck, A.; Teboulle, M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2009, 2, 183–202. [Google Scholar] [CrossRef] [Green Version]
Liang, J.; Schonlieb, C.B. Improving fista: Faster, smarter and greedier. arXiv 2018, arXiv:1811.01430. [Google Scholar]
Moudafi, A. Viscosity approximation method for fixed-points problems. J. Math. Anal. Appl. 2000, 241, 46–55. [Google Scholar] [CrossRef] [Green Version]
Nakajo, K.; Takahashi, W. Strong convergence theorems for nonexpansive mappings and nonexpansive semigroups. J. Math. Anal. Appl. 2003, 279, 372–379. [Google Scholar] [CrossRef] [Green Version]
Takahashi, W.; Zembayashi, K. Strong Convergence Theorem by a New Hybrid Method for Equilibrium Problems and Relatively Nonexpansive Mappings. Fixed Point Theory Appl. 2008, 2008, 528476. [Google Scholar] [CrossRef]
Halpern, B. Fixed points of nonexpansive maps. Bull. Am. Math. Soc. 1967, 73, 957–961. [Google Scholar] [CrossRef] [Green Version]
Verma, M.; Sahu, D.R.; Shukla, K.K. VAGA: A novel viscosity-based accelerated gradient algorithm. Appl. Intell. 2018, 48, 2613–2627. [Google Scholar] [CrossRef]
Bello Cruz, J.Y.; Nghia, T.T. On the convergence of the forward-backward splitting method with linesearches. Optim. Methods Softw. 2016, 31, 1209–1238. [Google Scholar] [CrossRef] [Green Version]
Burachik, R.S.; Iusem, A.N. Set-Valued Mappings and Enlargements of Monotone Operators; Springer: Berlin, Germany, 2008. [Google Scholar]
Huang, Y.; Dong, Y. New properties of forward-backward splitting and a practical proximal-descent algorithm. Appl. Math. Comput. 2014, 237, 60–68. [Google Scholar] [CrossRef]
Takahashi, W. Introduction to Nonlinear and Convex Analysis; Yokohama Publishers: Yokohama, Japan, 2009. [Google Scholar]
Mainge, P.E. Strong convergence of projected subgradient methods for nonsmooth and nonstrictly convex minimization. Set-Valued Anal. 2008, 16, 899–912. [Google Scholar] [CrossRef]
Xu, H.K. Another control condition in an iterative method for nonexpansive mappings. Bull. Austral. Math. Soc. 2002, 65, 109–113. [Google Scholar] [CrossRef] [Green Version]
Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. B Methodol. 1996, 58, 267–288. [Google Scholar] [CrossRef]
Fisher, R.A. The use of multiple measurements in taxonomic problems. Ann. Eugen. 1936, 7, 179–188. [Google Scholar] [CrossRef]

Figure 1. Diagram of Algorithm 5.

Figure 2. A regression of a sine function at the 130th iteration.

Figure 3. A regression of a sine function at the 3000th iteration.

Table 1. Detail about the regression of a sine function experiment.

Training set	Create a training matrix R = [x₁ x₂ … x₁₀]^T where x₁, x₂, …, x_n ∈ [−4, 4] are randomly generated
	Create the training target matrix S = [sin x₁ sin x₂ … sin x₁₀]^T
Testing set	Create the testing matrix V = [−4 −3.99 −3.98 … 4]^T
Target set	Create T = [sin(−4 sin(−3.99 … sin(4)]^T as the target matrix
Learning process	Generate the hidden layer output matrix H₁ of training matrix R with 100 hidden nodes using sigmoid as the activation function
	Pick regularization parameter λ and formulate a convex minimization problem: $Minimize : {‖ H_{1} β - S ‖}_{2}^{2} + λ {‖ β ‖}_{1}$
	Find optimal weight β* of this problem using Algorithm 5 with a certain number of iterations
Testing process	Generate the hidden layer output matrix H₂ of testing matrix V with 100 hidden nodes using sigmoid as the activation function
	Calculate output O = H₂β*
	Calculate MSE, MAE, RMSE of output O and target matrix T

Table 2. Numerical results of c-contractive mapping.

$Fx = cx$	Iteration No.	Training Time	MSE
$c = 0.5$	25143	4.8376	5 × 10 $^{- 3}$
$c = 0.8$	14761	2.7909	5 × 10 $^{- 3}$
$c = 0.9$	11956	2.3115	5 × 10 $^{- 3}$
$c = 0.999$	9649	1.7992	5 × 10 $^{- 3}$

Table 3. Numerical results of each inertial parameter.

	Iteration No.	Training Time	MSE
$β_{n}^{1}$	9635	1.7477	5 × 10 $^{- 3}$
$β_{n}^{2}$	9671	1.8107	5 × 10 $^{- 3}$
$β_{n}^{3}$	9649	1.7992	5 × 10 $^{- 3}$
$β_{n}^{4}$	129	0.0277	4.2 × 10 $^{- 3}$

Table 4. Chosen parameters of each algorithm.

	FISTA	VIFBA	Algorithm 4	Algorithm 5
$t_{1} = 1$	✓	-	-	-
$L = 2 ‖ H_{1} ‖^{2}$	✓	-	-	-
$σ = 0.49$	-	-	✓	✓
$δ = 0.1$	-	-	✓	✓
$θ = 0.1$	-	-	✓	✓
$γ_{n} = \frac{1}{2 ‖ H_{1} ‖^{2}}$	-	✓	-	-
$α_{n} = \frac{1}{n}$	-	✓	-	✓
$β_{n} = \frac{10^{10}}{‖ x_{n} - x_{n - 1} ‖^{3} + n^{3} + 10^{10}}$	-	✓	-	✓

Table 5. Numerical results of a regression of a sine function with the stopping criteria.

	Iteration No.	Training Time	MSE
FISTA	305	0.0055	$4.8 \times 10^{- 3}$
VIFBA	419	0.008	$5 \times 10^{- 3}$
Algorithm 4	9592	1.9681	$5 \times 10^{- 3}$
Algorithm 5	129	0.0277	$4.2 \times 10^{- 3}$

Table 6. Numerical results of a regression of a sine function at the 3000th iteration.

	Iteration No.	Training Time	MSE	RMSE
FISTA	3000	0.1248	0.0169	0.0326
VIFBA	3000	0.1660	0.0236	0.0364
Algorithm 4	3000	0.6474	0.2496	0.3224
Algorithm 5	3000	0.6589	0.0169	0.0310

Table 7. Iris dataset.

Class (50 Samples Each)	Attributes (in Centimeters)
Iris setosa Iris versicolor Iris virginica	sepal length sepal width petal length petal witdth

Table 8. Details about the classification of Iris dataset experiment.

Training set	Create the N × 1 training matrix S of number 1, 2 and 3 according to the training set and N is a number of training samples
	Create the N × 4 training attribute matrix R according to S and attribute data
Testing set	Create the M × 1 testing matrix T of number 1, 2 and 3 according to the testing set and M is a number of testing samples
	Create the M × 4 testing attribute matrix V according to T and attribute data
Learning process	Generate the hidden layer output matrix H₁ of training attribute matrix R with 100 hidden nodes using sigmoid as the activation function
	Choose λ = 1 × 10⁻⁵ and formulate a convex minimization problem: $Minimize : {‖ H_{1} β - S ‖}_{2}^{2} + λ {‖ β ‖}_{1}$
	Find optimal weight β* of this problem using Algorithm 5 as a learning algorithm with certain number of iterations
Testing process	Calculate output O₁ = H₁β*
	Calculate number of correctly predicted samples between output O₁ and training matrix S Generate the hidden layer output matrix H₂ of testing attribute matrix V with 100 hidden nodes using sigmoid as the activation function
	Calculate output O₂ = H₂β*
	Calculate number of correctly predicted samples between output O₂ and testing matrix T
	Calculate acc.train, acc.test, Average ACC, ERR_%

Let the number 1, 2 and 3 represent Iris setosa, Iris versicolor and Iris virginica, respectively.

Table 9. Training and testing sets of the Iris dataset.

Class	Training Data	Testing Data
Iris setosa	35	15
Iris versicolor	35	15
Iris virginica	35	15
Sum	105	45

Table 10. The performance of each algorithm at the 700th iteration.

	Iteration No.	Training Time	acc.train	acc.test
FISTA	700	0.0288	98.10	100
VIFBA	700	0.0416	98.10	100
Algorithm 4	700	0.3867	95.24	97.78
Algorithm 5	700	0.4172	99.05	100

Table 11. The performance of each algorithm with the stopping criteria.

	Iteration No.	Training Time	acc.train	acc.test
FISTA	61	0.0038	91.43	91.11
VIFBA	32	0.0021	91.43	91.11
Algorithm 4	368	0.0816	90.48	91.11
Algorithm 5	65	0.0584	93.33	93.33

Table 12. Training and testing sets for 10-fold stratified cross-validation.

Class	Group 1–10
Class	Training Set	Testing Set
Iris Setosa	45	5
Iris versicolor	45	5
Iris virginica	45	5
Sum	135	15

Table 13. The performance of each algorithm at the 1000th iteration with a 10-fold stratified cross-validation.

	FISTA		VIFBA		Algorithm 4		Algorithm 5
	acc.train	acc.test	acc.train	acc.test	acc.train	acc.test	acc.train	acc.test
Fold 1	97.78	100	97.78	100	97.78	100	97.78	100
Fold 2	99.26	93.33	99.26	93.33	99.26	93.33	98.52	93.33
Fold 3	97.78	100	97.78	100	95.56	100	98.52	100
Fold 4	99.26	93.33	99.26	93.33	97.04	93.33	97.78	86.67
Fold 5	98.52	100	98.52	100	97.78	100	98.52	100
Fold 6	98.52	100	98.52	100	98.52	100	98.52	100
Fold 7	98.52	86.67	98.52	86.67	98.52	73.33	98.52	93.33
Fold 8	98.52	100	98.52	100	98.52	100	98.52	100
Fold 9	98.52	100	97.78	100	97.04	100	97.78	100
Fold 10	98.52	100	98.52	100	97.78	100	97.78	100
Average Acc	98.52	97.333	98.446	97.333	97.778	96	98.222	97.333
ERR $_{%}$	2.074		2.112		3.111		2.223
Training time (sec.)	0.0413		0.0625		0.5673		0.7051

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Inthakon, W.; Suantai, S.; Sarnmeta, P.; Chumpungam, D. A New Machine Learning Algorithm Based on Optimization Method for Regression and Classification Problems. Mathematics 2020, 8, 1007. https://doi.org/10.3390/math8061007

AMA Style

Inthakon W, Suantai S, Sarnmeta P, Chumpungam D. A New Machine Learning Algorithm Based on Optimization Method for Regression and Classification Problems. Mathematics. 2020; 8(6):1007. https://doi.org/10.3390/math8061007

Chicago/Turabian Style

Inthakon, Warunun, Suthep Suantai, Panitarn Sarnmeta, and Dawan Chumpungam. 2020. "A New Machine Learning Algorithm Based on Optimization Method for Regression and Classification Problems" Mathematics 8, no. 6: 1007. https://doi.org/10.3390/math8061007

APA Style

Inthakon, W., Suantai, S., Sarnmeta, P., & Chumpungam, D. (2020). A New Machine Learning Algorithm Based on Optimization Method for Regression and Classification Problems. Mathematics, 8(6), 1007. https://doi.org/10.3390/math8061007

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A New Machine Learning Algorithm Based on Optimization Method for Regression and Classification Problems

Abstract

1. Introduction

2. Preliminaries

3. Main Results

4. Applications to Data Classification and Regression Problems

4.1. Regression of a Sine Function

4.2. Data Classification

5. Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI