An Accelerated Convex Optimization Algorithm with Line Search and Applications in Machine Learning

: In this paper, we introduce a new line search technique, then employ it to construct a novel accelerated forward–backward algorithm for solving convex minimization problems of the form of the summation of two convex functions in which one of these functions is smooth in a real Hilbert space. We establish a weak convergence to a solution of the proposed algorithm without the Lipschitz assumption on the gradient of the objective function. Furthermore, we analyze its performance by applying the proposed algorithm to solving classiﬁcation problems on various data sets and compare with other line search algorithms. Based on the experiments, the proposed algorithm performs better than other line search algorithms.


Introduction
The convex minimization problem in the form of the sum of two convex functions plays a very important role in machine learning.This problem has been analyzed and studied by many authors because of its applications in various fields such as data science, computer science, statistics, engineering, physics, and medical science.Some examples of these applications are signal processing, compressed sensing, medical image reconstruction, digital image processing, and data prediction and classification; see [1][2][3][4][5][6][7][8].
As we know in machine learning, especially in data prediction and classification problems, the main objective is to minimize loss functions.Many loss functions can be viewed as convex functions; thus by employing convex minimization, one could find the minimum of such functions, which in turn solve data prediction and classification problems.Many works have implemented this strategy; see [9][10][11] and the references therein for more information.In this work, we apply extreme learning machine together with the least absolute shrinkage and selection operator to solve classification problems; more detail will be discussed in a later section.First, we introduce a convex minimization problem, which can be formulated as the following form: where f : H → R ∪ {+∞} is proper, convex differentiable on an open set containing dom(g) and g : H → R ∪ {+∞} is a proper, lower semicontinuous convex function defined on a real Hilbert space H.
A solution of ( 1) is in fact a fixed point of the operator prox αg (I − α f ), i.e., where α > 0, and prox αg (I − α f )(x) = arg min y∈H {g(y) + 1 2α (x − α f (x)) − y 2 }, which is known as the forward-backward operator.In order to solve (1), the forward-backward algorithm [12] was introduced as follows: where α n is a positive number.If f is L-Lipschitz continuous and α n ∈ (0, 2 L ), then a sequence generated by (3) converges weakly to a solution of (1).There are several techniques that can improve the performance of (3).For instance, we could utilize an inertial step, which was first introduced by Polyak [13], to solve smooth convex minimization problems.Since then, there have been several works that included an inertial step in their algorithms to accelerate the convergence behavior; see [14][15][16][17][18][19] for examples.
One of the most famous forward-backward-type algorithms that implements an inertial step is the fast iterative shrinkage-thresholding algorithm (FISTA) [20].It is defined as the following Algorithm 1.
1: Input Given y 1 = x 0 ∈ R n , and t 1 = 1, for n ∈ N, where L is a Lipschitz constant of f .
The term x n + θ n (x n − x n−1 ) is known as an inertial term with an inertial parameter θ n .It has been shown that FISTA performs better than (3).Later, other forwardbackward-type algorithms have been introduced and studied by many authors; see for instance [2,8,18,21,22].However, most of these works assume the Lipschitz assumption on f , which is difficult for computation in general.Therefore, in this paper, we focus on another approach where f is not necessarily Lipschitz continuous.
They asserted that Line Search 1 stops after finitely many steps and proposed the following Algorithm 3. Algorithm 3. Algorithm with Line Search 1.
They also showed that the sequence {x n } defined by Algorithm 3 converges weakly to a solution of (1) under Assumptions A1 and A2 where: A1. f , g are proper lower semicontinuous convex functions with dom(g) ⊆ dom( f ); A2. f is differentiable on an open set containing dom(g), and f is uniformly continuous on any bounded subset of dom(g) and maps any bounded subset of dom(g) to a bounded set in H.
It is noted that the L-Lipschitz continuity of f is not necessarily assumed.Moreover, if f is L-Lipschitz continuous, then A2 is satisfied.
They also asserted that Line Search 2 stops at finitely many steps and proposed the following Algorithm 5.
Algorithm 5. Algorithm with Line Search 2.
A weak convergence result of this algorithm was obtained under Assumptions A1 and A2.Although Algorithms 3 and 5 obtained weak convergence results without the Lipschitz assumption on f , the two algorithms did not utilize an inertial step yet.Therefore, some improvements of their convergence behavior using this technique are interesting to investigate.
Motivated by the works mentioned earlier, we aim to introduce a new line search technique and prove that it is well-defined.Then, we employ it to construct a novel forwardbackward algorithm that utilizes an inertial step to improve its performance to be better than the other line search algorithms.We prove a weak convergence theorem of the proposed algorithm without the Lipschitz assumption on f and apply it to solve classification problems on various data sets.We also compare its performance with Algorithms 3 and 5 to show that the proposed algorithm performs better.
This work is organized as follows: In Section 2, we recall some important definitions and lemmas used in later sections.In Section 3, we introduce a new line search technique and algorithm for solving (1).Then, we analyze the convergence and complexity of the proposed algorithm under Assumptions A1 and A2.In Section 4, we apply the proposed algorithm to solve data classification problems and compare its performance with other algorithms.Finally, the conclusion of this work is presented in Section 5.

Preliminaries
In this section, some important definitions and lemmas, which will be used in later sections, are presented.
Let {x n } be a sequence in H and x ∈ H.We denote x n → x and x n x as a strong and weak convergence of {x n } to x, respectively.Let f : H → R ∪ {+∞} be a proper lower semicontinuous and convex function.We denote dom( f A proximal operator prox α f : H → dom( f ) is defined as follows: where I is an identity mapping and α is a positive number.It is well known that this operator is single-valued, nonexpansive, and , for all x ∈ H and α > 0; (4) see [23] for more details.Next, we present some important lemmas for this work.

Main Results
In this section, we define a new line search technique and a new accelerated algorithm with the new line search for solving (1).We denote S * the set of all solutions of (1) and suppose that f , g : H → R ∪ {+∞} are two convex functions that satisfy Assumptions A1 and A2, and dom(g) is closed.Furthermore, we also suppose that S * = ∅.
We first introduce a new line search technique as the following Algorithm 6.
We first show that Line Search 3 terminates at finitely many steps.Lemma 7. Line Search 3 stops at finitely many steps.
Proof.If x ∈ S * , then x = L(x, σ) = S(x, σ), so Line Search 3 stops with zero steps.If x / ∈ S * , suppose by contradiction that, for all n ∈ N, the following hold: or Then, from these assumptions, we can find a subsequence {σθ n k } of {σθ n } such that (5) or ( 6) holds.First, we show that To complete the proof, we consider the only two possible cases to find a contradiction.Case 1: Suppose that there exists a subsequence {σθ n k } of {σθ n } such that (5) holds, for all k ∈ N.Then, it follows that S(x, σθ n k ) − L(x, σθ n k ) → 0 and L(x, σθ n k ) − x → 0, as k → +∞.Since f is uniformly continuous, we obtain: Thus, Case 2: Suppose that there is a subsequence {σθ n k } of {σθ n } satisfying ( 6), for all k ∈ N.Then, L(x, σθ n k ) − x → 0, as k → +∞.Again, from the uniform continuity of f , we have as k → +∞.From (6), we conclude that as k → +∞.By the same argument as in Case 1, we can show that 0 ∈ ∂( f + g)(x), and hence, x ∈ S * , a contradiction.Therefore, we conclude that Line Search 3 stops with finite steps, and the proof is complete.
We propose a new inertial algorithm with Line Search 3 as following Algorithm 7.
Algorithm 7. Inertial algorithm with Line Search 3.
The diagram of Algorithm 7 can be seen in Figure 1.Next, we prove the following lemma, which will play a crucial role in our main theorems.Lemma 8. Let γ n := Line Search 3(z n , δ, σ, θ).Then, for all n ∈ N and x ∈ dom(g), the following hold: where v n = prox γ n g (w n − γ n f (w n )).
Proof.First, we show that (I) is true.From (4), we know that Moreover, it follows from the definitions of ∂g( , for all n ∈ N.
From Lemma 3, we have ), and hence, for all n ∈ N.Then, it follows that, for any x ∈ dom(g), and (I) is proven.Next, we show (I I).From (4), we have that Then, The above inequalities imply for all x ∈ dom(g) and n ∈ N. Hence, 1 Moreover, from Lemma 3, we have, for all n ∈ N, ), and As a result, we obtain for all x ∈ dom(g), and n ∈ N. Therefore, for all x ∈ dom(g), and n ∈ N, and hence, (I I) is proven.
Next, we prove the weak convergence result of Algorithm 7.
Theorem 9. Let {x n } be a sequence generated by Algorithm 7. Suppose that the following hold: B2.There exists γ > 0 such that γ n ≥ γ, for all n ∈ N.
Then, {x n } converges weakly to some point in S * .
Proof.Let x * ∈ S * ; obviously, x * ∈ dom(g).The following are direct consequences of Lemma 8: and where v n = prox γ n g (w n − γ n f (w n )).Then, we have Next, we show that lim n→∞ x n − x * exists.Since P dom(g) is nonexpansive, we have By using Lemma 4, we have that {x n } is bounded.Consequently, +∞, and By (10)  We claim that every weak-cluster point of {x n } belongs to S * .To prove this claim, let w be a weak-cluster point of {x n }.Then, there exists a subsequence {x n k } of {x n } such that x n k w, and hence, w n k w.Next, we show that w ∈ S * .From A2, we know that f is uniformly continuous, so lim k→+∞ f w n k − f z n k = 0. From (4), we also have Hence, By letting k → +∞ in the above inequality, we can conclude from (1) that 0 ∈ ∂( f + g)(w), and hence, w ∈ S * .It follows directly from Lemma 6 that {x n } converges weakly to a point in S * , and the proof is now complete.
The diagram of Algorithm 8 can be seen in Figure 2. We next prove the complexity of Algorithm 8.
Theorem 10.Let {x n } be a sequence generated by Algorithm 8. Suppose that there exists γ > 0 such that γ n ≥ γ, for all n ∈ N, then {x n } converges weakly to a point in S * .In addition, if δ ∈ (0, 1  16 ), then the following also holds: for all n ∈ N.
Proof.A weak convergence of {x n } is guaranteed by Theorem 9.It remains to show that (11) is true.Let v n = prox γ n g (w n − γ n f (w n )) and x * ∈ S * .We first show that f (x k+1 ) ≤ f (x k ), for all k ∈ N. We know that x k = z k in Lemma 8, so for any x ∈ dom(g) and k ∈ N, we have: and Putting x = x k in ( 12) and (13), we obtain and respectively.Substituting x with w k in (13), we obtain By summing ( 15) and ( 16), we obtain It follows from ( 14) and ( 17) that respectively, for all k ∈ N. Hence, for all k ∈ N. Hence, {( f + g)(x k )} is a non-increasing sequence.Now, put x = x * in ( 12) and ( 13), then we obtain and Inequalities ( 19) and ( 20) imply that for all k ∈ N. Summing the above inequality over k = 1, 2, 3, ..., n − 1, we obtain for all n ∈ N. Hence, Since x * is arbitrarily chosen from S * , we obtain for all n ∈ N, and the proof is now complete.

Some Applications on Data Classification
In this section, we apply Algorithms 3, 5, 7, and 8 to solve some classification problems based on a learning technique called extreme learning machine (ELM) introduced by Huang et al. [28].It is formulated as follows: Let . ., N} be a set of N samples where x k is an input and t k is a target.A simple mathematical model for the output of ELM for SLFNs with M hidden nodes and activation function G is defined by where w i is the weight that connects the i-th hidden node and the input node, η i is the weight connecting the i-th hidden node and the output node, and b i is the bias.The hidden layer output matrix H is defined by The main objective of ELM is to calculate an optimal weight η = [η T 1 , . . ., η T M ] T such that Hη = T, where T = [t T 1 , . . ., t T N ] T is the training target.If the Moore-Penrose generalized inverse H † of H exists, then η = H † T is the solution.However, in general cases, H † may not exist or be difficult for computation.Thus, in order to avoid such difficulties, we transformed the problem into a convex minimization problem and used our proposed algorithm to find the solution η without H † .
In machine learning, a model can be overfit in the sense that it is very accurate on a training sets, but inaccurate on a testing set.In other words, it cannot be used to predict unknown data.In order to prevent overfitting, the least absolute shrinkage and selection operator (LASSO) [29] is used.It can be formulated as follows: where λ is a regularization parameter.If we set f (x) := Hx − T 2 2 and g(x) := λ x 1 , then the problem ( 23) is reduced to the problem (1).Hence, we can use our algorithm as a learning method to find the optimal weight η and solve classification problems.
In the experiments, we aim to classify three data sets from https://archive.ics.uci.edu(accessed on 15 November 2021): Iris data set [30].Each sample in this data set has four attributes, and the set contains three classes with 50 samples for each type.Heart disease data set [31].This data set contains 303 samples each of which has 13 attributes.In this data set, we classified two classes of data.Wine data set [32].In this data set, we classified three classes of 178 samples.Each sample contains 13 attributes.
In all experiments, we used the sigmoid as the activation function.The number of hidden nodes M = 30.We calculate the accuracy of the output data by: accuracy = correctly predicted data all data × 100.
We chose control parameters for each algorithm as seen in Table 1.In our experiments, the inertial parameters β n for Algorithm 7 were chosen as follows: In the first experiment, we chose the regularization parameter λ = 0.1 for all algorithms and data sets.Then, we used 10-fold cross-validation and utilized Average ACC and ERR % for evaluating the performance of each algorithm.
where N is the number of folds (N = 10), x i is the number of data correctly predicted at fold i, and y i is the number of all data at fold i.
Let err Lsum = the sum of errors in all 10 training sets, err Tsum = the sum of errors in all 10 testing sets, Lsum = the sum of all data in all 10 training sets, and Tsum = the sum of all data in all 10 testing sets.Then, ERR % = (err L% + err T% )/2, where err L% = err Lsum Lsum × 100% and err T% = err Tsum Tsum × 100%.With these evaluation tools, we obtained the results for each data set as seen in Tables 2-4.As seen in Tables 2-4, with the same regularization λ = 0.1, Algorithms 7 and 8 perform better than Algorithms 3 and 5 in terms of accuracy, while the computation times are relatively close among the four algorithms.
In the second experiment, the regularization parameters λ for each algorithm and data set were chosen using 10-fold cv.We compared the error of each model and data set with various λ, then chose the λ that gives the lowest error (ERR % ) for the particular model and data set.Hence, the parameter λ varies depending on the algorithm and data set.The choice of parameters λ can be seen in Table 5.With the chosen λ, we also evaluated the performance of each algorithm using 10-fold cross-validation and similar evaluation tools as in the first experiment.The results can be seen in the following Tables 6-8.With the chosen regularization parameters λ as in Table 5, we see that the ERR % of each algorithm in Tables 6-8 is lower than that of Tables 2-4.We can also see that Algorithms 7 and 8 perform better than Algorithms 3 and 5 in terms of accuracy in all experiments conducted.
In Figure 3, we show the graph of ERR % for each algorithm of the second experiment.As we can see, Algorithms 7 and 8 have lower ERR % , which means they perform better than Algorithm 3 and 5. From Tables 6-8, we can notice that the computational time of Algorithms 7 and 8 is 30% slower than Algorithm 3 at the same number of iterations.However, from Figure 3, we see that at the 120th iteration, both Algorithms 7 and 8 have lower ERR % than Algorithm 3 at the 200th iteration.Therefore, the time needed for Algorithms 7 and 8 to achieve the same accuracy as or higher accuracy than Algorithm 3 is actually lower because we can compute the 120-step iteration much faster than the 200-step iteration.

Conclusions
We introduced a new line search technique and employed it in order to introduce new algorithms, namely Algorithms 7 and 8. Furthermore, Algorithm 7 also utilizes an inertial step to accelerate its convergence behavior.Both algorithms converge weakly to a solution of (1) without the Lipschitz assumption on f .The complexity of Algorithm 8 was also analyzed and studied.Then, we applied the proposed algorithms to the data classification of the Iris, Heart disease, and Wine data set, then their performances were evaluated and compared with other line search algorithms, namely Algorithms 3 and 5.We observed from our experiments that Algorithm 7 achieved the highest accuracy in all data sets under the same number of iterations.Moreover, Algorithm 8, which is not an inertial algorithm, also performed better than Algorithms 3 and 5. Furthermore, from Figure 3, we see that at a lower number of iterations, the proposed algorithms were more accurate than the other algorithms at a higher iteration number.
Based on the experiments on various data sets, we conclude that the proposed algorithms perform better than the previously established algorithms.Therefore, for our future works, we would like to implement the proposed algorithm to predict and classify the data of patients with non-communicable diseases (NCDs) collected from Sriphat Medical Center, Faculty of Medicine, Chiang Mai University, Thailand.We aim to make an innovation for screening and preventing non-communicable diseases, which will be used in hospitals in Chiang Mai, Thailand.

Figure 3 .
Figure 3. ERR % of each algorithm and data set of the second experiment.
for all n ∈ N.
* exists; together with Lemma 5, we conclude that lim n→+∞x n − x * exists.Since x n ∈ dom(g), for all n ∈ N, we obtainy n − z n ≤ y n − x n , for all n ∈ N,n→+∞ w n − z n = 0, and hence, lim n→+∞ x n − w n = 0.

Table 1 .
Chosen parameters of each algorithm.

Table 2 .
The performance of each algorithm in the first experiment at the 200th iteration with 10-fold cv. on the Iris data set.

Table 3 .
The performance of each algorithm in the first experiment at the 200th iteration with 10-fold cv. on the Heart disease data set.

Table 4 .
The performance of each algorithm in the first experiment at the 200th iteration with 10-fold cv. on the Wine data set.

Table 5 .
Chosen λ of each algorithm.

Table 6 .
The performance of each algorithm in the second experiment at the 200th iteration with 10-fold cv. on the Iris data set.

Table 7 .
The performance of each algorithm in the second experiment at the 200th iteration with 10-fold cv. on the Heart disease data set.

Table 8 .
The performance of each algorithm in the second experiment at the 200th iteration with 10-fold cv. on the Wine data set.