A New Forward–Backward Algorithm with Line Search and Inertial Techniques for Convex Minimization Problems with Applications

: For the past few decades, various algorithms have been proposed to solve convex minimization problems in the form of the sum of two lower semicontinuous and convex functions. The convergence of these algorithms was guaranteed under the L-Lipschitz condition on the gradient of the objective function. In recent years, an inertial technique has been widely used to accelerate the convergence behavior of an algorithm. In this work, we introduce a new forward–backward splitting algorithm using a new line search and inertial technique to solve convex minimization problems in the form of the sum of two lower semicontinuous and convex functions. A weak convergence of our proposed method is established without assuming the L-Lipschitz continuity of the gradient of the objective function. Moreover, a complexity theorem is also given. As applications, we employed our algorithm to solve data classiﬁcation and image restoration by conducting some experiments on these problems. The performance of our algorithm was evaluated using various evaluation tools. Furthermore, we compared its performance with other algorithms. Based on the experiments, we found that the proposed algorithm performed better than other algorithms mentioned in the literature.


Introduction
The convex minimization problem has been studied intensively for the past few decades due to its wide range of applications in various real-world problems. Some major problems in physics, economics, data science, engineering, and medical science can be viewed as convex minimization problems, for instance a reaction-diffusion equation, which is a mathematical model describing physical phenomena, such as chemical reactions, heat flow models, and population dynamics. The problem of finding an unknown reaction term of such an equation can be formulated as a coefficient inverse problem (CIP) for a partial differential equation (PDE). Numerical methods for solving CIPs for PDEs, as well as their applications to various subjects have been widely studied and analyzed; for more comprehensive information on this topic, see [1][2][3][4][5]. To obtain a globally convergent method for solving CIPs for PDEs, many authors have employed the convexification technique, which reformulates CIPs as convex minimization problems; for a more in-depth development and discussion, we refer to [6][7][8]. Therefore, a numerical method for convex minimization problems can be applied. More examples of convex minimization problems are signal and image processing, compressed sensing, and machine learning tasks such as regression and classification; see [9][10][11][12][13][14][15][16] and the references therein for more information.
The problem is formulated, in the form of the summation of two convex functions, as follows: min where f , g : H → R ∪ {+∞} are proper, lower semicontinuous convex functions and H is a Hilbert space. If f is differentiable, then x solves (1) if and only if: where α > 0, prox αg (x) = J ∂g α (x) = (I − α∂g) −1 (x), I is an identity mapping and ∂g is a subdifferential of g. In other words, x is a fixed point of prox αg (I − α f ). Under some conditions, the Picard iteration converges to the solution. As a result, the forward-backward algorithm [17], which is defined as follows: x n+1 = prox α n g (I − α n f )(x n ), for all n ∈ N, (3) where α n is a suitable step size, converges to a solution of (1). Due to its simplicity, this method has been intensively studied and improved by many researchers; see [10,16,18] and the references therein for more information. One well-known method, which notably improves the convergence rate of (3), is the fast iterative shrinkage-thresholding algorithm (FISTA) introduced by Beck and Teboulle [19]. To the best of our knowledge, most of these works assumed that f is Lipschitz continuous. However, in general, such an assumption might be too strong, and finding a Lipschitz constant of f is sometimes difficult. To relax this strong assumption, Cruz and Nghia [20] proposed a line search technique and replaced the Lipschitz continuity assumption of f with weaker assumptions, as seen in the following conditions: Assumption 1. f , g are proper lower semicontinuous convex functions with dom g ⊆ dom f ; Assumption 2. f is differentiable on an open set containing dom g, and f is uniformly continuous on any bounded subset of dom g and maps any bounded subset of dom g to a bounded set in H.
Note that if f is L-Lipschitz continuous, then A2 holds. They also established an algorithm using a line search technique and obtained a weak convergence result under Assumptions 1 and 2. In the same work, they also proposed an accelerated algorithm based on FISTA and a line search technique. They showed that this accelerated algorithm performed better than the other introduced algorithm.
Recently, inspired by [20], Kankam et al. [9] proposed a new line search technique and a new algorithm to solve (1). They conducted some experiments on signal processing and found that their method performed better than that of Cruz and Nghia [20].
In recent years, many authors have utilized the inertial technique in order to accelerate their algorithms. This was first introduced by Polyak [21] to solve smooth convex minimization problems. After that, many inertial-type algorithms were proposed and studied by many authors; see [22][23][24][25] and the references therein.
The algorithms introduced in [9,20] are easy to employ, since a Lipschitz constant of the gradient of an objective function f is not required to exist. Although the convergence of these algorithms is guaranteed, some improvements are still welcome, specifically utilizing an inertial technique in order to improve the performance. Hence, in this work, motivated by the works mentioned above, our main objective was to propose a new algorithm that utilizes both line search and inertial techniques to not only guarantee its convergence to a solution of (1) without assuming an L-Lipschitz continuity on f , but also to improve its performance by mean of an inertial technique. We established a weak convergence theorem under Assumptions 1 and 2. Moreover, a complexity theorem of this new algorithm was studied. Furthermore, we employed our proposed algorithm to solve an image restoration problem, as well as data classification problems. Then, we evaluated its performance and compared it with various other algorithms. The proposed method is also of great interest in solving nonlinear coefficient inverse problems for partial differential equations, along with other possible applications of convex minimization problems.
This work is organized as follows: In Section 2, we recall important definitions, lemmas that will be used in later sections, as well as various methods introduced in [9,19,20,22]. In Section 3, we introduce a new algorithm using new line search and inertial techniques and establish its weak convergence to a solution of (1). Moreover, the complexity of this method is also analyzed. In Section 4, in order to compare the performance of the studied algorithms, some numerical experiments on data classification problems and image restoration problem are conducted and discussed. In the last section, the brief conclusion of this paper is presented.

Preliminaries
Throughout this work, we denote x n → x and x n x as {x n } converges strongly and weakly to x, respectively. For h : H → R ∪ {+∞}, we also denote dom h := {x ∈ H : h(x) < +∞}.
First, we recall some methods for solving (1) introduced by many authors mentioned in Section 1.
In 2016, Cruz and Nghia [20] introduced a line search step as follows (Algorithm 2) .
They showed that a sequence generated by Algorithm 3 converges weakly to a solution of (1) under Assumptions 1 and 2.
Recently, Kankam et al. [9] proposed a new line search technique as follows (Algorithm 4).
They established a weak convergence theorem for Algorithm 5, under Assumptions 1 and 2. As we can see, Algorithms 3 and 5 do not utilize an inertial step.
They showed that Algorithm 6 has better complexity than Algorithm 3. However, the convergence to a solution of (1) is not guaranteed under the inertial parameter β n = t n −1 In 2019, Attouch and Cabot [22] analyzed the convergence rate of a method called the relaxed inertial algorithm (RIPA) for solving monotone inclusion problems, which is closely related to convex minimization problems. This method utilizes an inertial step x n + β n (x n − x n−1 ) to improve its performance. It was defined as follows (Algorithm 7).
where A is a maximal monotone operator and J A µ n (y n ) = (I + µ n A) −1 .
Under mild restrictions of the control parameters, they showed that Algorithm 7 (RIPA) gives a fast convergence rate.
Next, we recall some important tools that will be used in the later sections.
Definition 1. For x ∈ H, a subdifferential of h at x is defined as follows: The following can be found in [26].
, for any sequence (u n , v n ) ∈ Gph(∂h) such that {u n } converges weakly to u and {v n } converges strongly to v, we have (u, v) ∈ Gph(∂h).
The proximal operator, prox g : H → dom g with prox g (x) = (I + ∂g) −1 (x), is singlevalued with the full domain, and the following holds: The following lemmas are crucial for the main results.
Then, the following holds:

Lemma 5 ([28]
). Let H be a Hilbert space and {x n } a sequence in H such that there exists a nonempty subset Ω of H satisfying the following: (ii) every weak-cluster point of {x n } belongs to Ω. Then, {x n } converges weakly to an element in Ω.

Main Results
In this section, we assume the existence of a solution of (1) and denote S * the set of all such solutions. We modify Line Search 2 as follows (Algorithm 8).
Next, we establish our first theorem.
Then, a sequence {x n } generated by Algorithm 9 converges weakly to a point in S * .
Proof. For any x ∈ dom g and n ∈ N, we claim that: To prove our claim, we know, from (4) and the definition of ∂g, that: Then, From the definition of γ n and the above inequalities, we have, for all x ∈ dom g and n ∈ N, Moreover, the following also hold, for all n ∈ N, As a result, we obtain: for all x ∈ dom g and n ∈ N. Therefore, for all x ∈ dom g and n ∈ N. Furthermore, putting x = x * ∈ S * , we obtain: Next, we show that lim n→∞ x n − x * exists. From (6), we have: This implies by Lemma 3 and C2 that {x n } is bounded.
By (7) and Lemma 4, we obtain that lim n→+∞ x n − x * exists. By the definitions of z n−1 and x n , we see x n ∈ dom g, for all n ∈ N. As a result, we have: x n − y n ≤ x n − x n , for all n ∈ N, so, lim n→+∞ x n − y n = 0.
Thus, we obtain from (7) that lim n→+∞ x n − x * = lim n→+∞ y n − x * . Moreover, it follows from (6) that lim n→+∞ z n − y n = 0, and hence, lim n→+∞ z n − x n = 0. Now, we prove that every weak-cluster point of {x n } belongs to S * . In order to accomplish this, we first let w be a weak-cluster point of {x n }. Therefore, there exists a subsequence {x n k } of {x n } such that x n k w, and hence, z n k w. Next, we prove that w belongs to S * . Since f is uniformly continuous, we have lim k→+∞ f z n k − f y n k = 0.
From (4), we obtain: Hence, y n k − z n k γ n k − f y n k + f z n k ∈ ∂g(z n k ) + f z n k = ∂( f + g)(z n k ), for all k ∈ N.
We derive from Lemma 5 that {x n } converges weakly to w * in S * . Therefore, {x n } converges weakly to a solution of (1), and the proof is complete.
In the next theorem, we provide the complexity theorem of Algorithm 9. First, we introduce the control sequence {t n } defined in [22] by: This sequence is well defined if the following assumption holds: Hence, from (8), we can see that: β n t n+1 = t n − 1, for all n ∈ N.
Next, we establish the following theorem.
Theorem 2. Given x 0 = x 1 ∈ dom g and letting {x n } be a sequence generated by Algorithm 9, assume that all assumptions in Theorem 1 are satisfied. Furthermore, suppose that the following conditions hold, for all n ∈ N: Then, , for all n ∈ N. In other words, ), for all n ∈ N.
Proof. For any x ∈ dom g, we know that: for all n ∈ N. Since x ∈ dom g, we obtain x n − x ≥ y n − x . Thus, we conclude that: Let x * be an element in S * . We know that x n , x ∈ dom g and t n+1 ≥ 1, for all n ∈ N. From D1 and D2, we know that t 2 n ≥ 2t 2 n+1 − 2t n+1 and γ n ≥ γ n+1 , so: , for all n ∈ N.
Moreover, we also have: for all n ∈ N. By (11)-(13), we obtain: for all n ∈ N. It follows that: for all n ∈ N. Moreover, we can inductively prove, from (14), that: for all n ∈ N. From C1, we know that γ n+1 ≥ γ. Therefore, we obtain, for all n ∈ N, . Since x * is arbitrarily chosen from S * , we have: , for all n ∈ N, and the proof is complete.

Applications to Data Classification and Image Restoration Problems
In this section, the proposed algorithm is used to solve classification and image restoration problems. The performance of Algorithm 9 is evaluated and compared with Algorithms 3, 5, and 6.

Data Classification
Data classification is a major branch of problems in machine learning, which is an application of artificial intelligence (AI) possessing the ability to learn and improve from experience without being programmed. In this work, we focused on one particular learning technique called extreme learning machine (ELM) introduced by Huang et al. [29]. It is defined as follows.
Let S := {(x k , t k ) : x k ∈ R n , t k ∈ R m , k = 1, 2, ..., N} be a training set of N samples, where x k is an input and t k is a target. The output of ELM with M hidden nodes and activation function G is defined by: where w i is the weight vector connecting the i-th hidden node and the input node, η i is the weight vector connecting the i-th hidden node and the output node, and b i is the bias. The hidden layer output matrix H is formulated as: The main goal of ELM is to find an optimal weight η = [η T 1 , ..., η T M ] T such that Hη = T, where T = [t T 1 , ..., t T N ] T is the training set. If the Moore-Penrose generalized inverse H † of H exists, then η = H † T is the desired solution. However, in general cases, H † may not exist or be challenging to find. Hence, to avoid such difficulties, we applied the concept of convex minimization to find η without relying on H † .
To prevent overfitting, we used the least absolute shrinkage and selection operator (LASSO) [30], formulated as follows: where λ is a regularization parameter. In the setting of convex minimization, we set f (x) = Hx − T 2 2 and g(x) = λ x 1 . In the experiment, we aimed to classify three datasets https://archive.ics.uci.edu, accessed on 1 May 2021.
Iris dataset [31]: Each sample in this dataset has four attributes, and the set contains three classes with fifty samples for each type.
Heart disease dataset [32]: This dataset contains 303 samples, each of which has 13 attributes. In this dataset, we classified two classes of data.
Wine dataset [33]: In this dataset, we classified three classes of one-hundred seventyeight samples. Each sample contained 13 attributes.
In all experiments, we used the sigmoid as the activation function with the number of hidden nodes M = 30. The accuracy of the output is calculated by: accuracy = correctly predicted data all data × 100.
We also utilized 10-fold cross-validation to evaluate the performance of each algorithm and used the average accuracy as the evaluation tool. It is defined as follows: where N is the number of sets considered during cross-validation (N = 10), x i is the number of correctly predicted data at fold i, and y i is the number of all data at fold i. We used 10-fold cross-validation to split the data into training sets and testing sets; more information can be seen in Table 1. All parameters of Algorithms 3, 5, 6 and 9 were chosen as in Table 2. The inertial parameters β n of Algorithm 9 may vary depending on the dataset, since some β n work well on specific datasets. We used the following two choices of β n in our experiments. The regularization parameters λ for each dataset and algorithm were chosen to prevent overfitting, i.e., a model obtained from the algorithm achieves high accuracy on the training set, but low accuracy on the testing set in comparison, so it cannot be used to predict the unknown data. It is known that when λ is too large, the model tends to underfit, i.e., low accuracy on the training set, and cannot be used to predict the future data. On the other hand, if λ is too small, then it may not be enough to prevent a model from overfitting. In our experiment, for each algorithm, we chose a set of λ that satisfies |Acc_train − Acc_test| < 2%, where Acc_train and Acc_test are the average accuracy of the training set and testing set, respectively. Under this criterion, we can prevent the studied models from overfitting. Then, from these candidates, we chose λ, which yields high ACC_test for each algorithm. Therefore, the models obtained from Algorithms 3, 5, 6 and 9 can be effectively used to predict the unknown data.
By this process, the regularization parameters λ for Algorithms 5, 6, and 9 were as in Table 3. We assessed the performance of each algorithm at the 300th iteration with the average accuracy. The results can be seen in Table 4. Table 4. Average accuracy of each algorithm at the 300th iteration with 10-fold cv.

Iris
Heart Disease Wine As we see from Table 4, from the choice of λ, all models obtained from Algorithms 3, 5, 6 and 9 had reasonably high average accuracy on both the training and testing sets for all datasets. Moreover, we observed that a model from Algorithm 9 performed better than the models from other algorithms in terms of the accuracy in all experiments conducted.

Image Restoration
We first recall that an image restoration problem can be formulated as a simple mathematical model as follows: where x ∈ R n×1 is the original image, A ∈ R m×n is a blurring matrix, b is an observed image, and w is noise. The main objective of image restoration is to find x from given image b, blurring matrix A, and noise w.
In order to solve (16), one could implement LASSO [30] and reformulate the problem in the following form. min where λ is a regularization parameter. Hence, it can be viewed as a convex minimization problem. Therefore, Algorithms 3, 5, 6 and 9 can be used to solve an image restoration problem.
In our experiment, we used the 256 × 256 color image as the original image. We used Gaussian blur of size 9 2 and standard deviation four on the original image and obtained the blurred image. In order to assess the performance of each algorithm, we implemented the peak-signal-to-noise ratio (PSNR) [34] defined by: PSNR(x n ) = 10 log 10 ( 255 2 MSE ).
For any original image x and deblurred image x n , the mean squared error (MSE) is calculated by MSE = 1 M x n − x 2 , where M is the number of pixels of x. We also need to mention that an algorithm with a higher PSNR performs better than one with a lower PSNR.
The control parameters of each algorithm were chosen as δ = θ = σ = 0.1. As the inertial parameter β n of Algorithm 9, we used the following: .
As for the regularization parameter λ, we experimented on λ varying from zero to one for each algorithm. In Figure 1, we show the PSNR of Algorithms 3 and 5 with respect to λ at the 200th iteration. In Figure 2, we show the PSNR of Algorithms 6 and 9 with respect to λ at the 200th iteration.  We observe from Figures 1 and 2 that the PSNRs of Algorithms 3, 5, 6 and 9 increased as λ became smaller. Based on this, for the next two experiments, we chose λ to be small to obtain a high PSNR for all algorithms. Next, we observed the PSNR of each algorithm when λ was small (λ < 10 −4 ). In Figure 3, we show the PSNR of each algorithm with respect to λ < 10 −4 at the 200th iteration. We see from Figure 3 that Algorithm 9 offered a higher PSNR than the other algorithms. In the next experiment, we chose λ = 5 × 10 −5 for each algorithm and evaluated the performance of each algorithm at the 200th iteration; see Table 5 for the results.
In Figure 4, we show the PSNR of each algorithm at each step of iteration. In Figure 5, we show the original test image, blurred image and deblurred images obtained from Algorithms 3, 5, 6 and 9. As we see from Table 5 and Figure 4, Algorithm 9 achieved the highest PSNR.

Conclusions
We introduced a new line search technique inspired by [9,19] and used this technique along with the inertial step to construct a new accelerated algorithm for solving convex minimization problems in the form of the sum of two lower semicontinuous convex functions. We proved its weak convergence to a solution of (1), which did not require an L-Lipschitz continuity of f , as well as its complexity theorem. We note that many forwardbackward-type algorithms require f to be L-Lipschitz continuous in order to obtain the convergence theorem; see [10,16,18] for examples. Our proposed algorithm is easy to employ, since an L-Lipschitz constant of f does not need to be calculated. Moreover, we also utilized an inertial technique to improve the convergence behavior of our proposed algorithm. In order to show that our algorithm performed better than other line search algorithms mentioned in the literature, we conducted some experiments on classification and image restoration problems. In our experiments, we evaluated the performance of each algorithm based on selecting its suitable parameters, especially the regularization parameter. It was evidenced that our proposed algorithm performed better than the other algorithms on both data classification and image restoration problems. Moreover, we observed from [19,20] that the inertial parameter β n = t n −1 t n+1 of FISTA and Algorithm 6 satisfied the condition sup n β n = 1, while our algorithm required a more strict condition, +∞ ∑ n=1 β n < +∞, to ensure its convergence to a solution of (1), which was a limitation of our algorithm. We also note that the convergence of FISTA and Algorithm 6 cannot be obtained under the condition sup n β n = 1. Therefore, it is very interesting to find a weaker condition on β n that still ensure the convergence of the algorithm to a solution of (1).
Author Contributions: Writing-original draft preparation, P.S.; software and editing, D.C.; supervision, S.S. All authors read and agreed to the published version of the manuscript.