A New Machine Learning Algorithm Based on Optimization Method for Regression and Classiﬁcation Problems

: A convex minimization problem in the form of the sum of two proper lower-semicontinuous convex functions has received much attention from the community of optimization due to its broad applications to many disciplines, such as machine learning, regression and classiﬁcation problems, image and signal processing, compressed sensing and optimal control. Many methods have been proposed to solve such problems but most of them take advantage of Lipschitz continuous assumption on the derivative of one function from the sum of them. In this work, we introduce a new accelerated algorithm for solving the mentioned convex minimization problem by using a linesearch technique together with a viscosity inertial forward–backward algorithm (VIFBA). A strong convergence result of the proposed method is obtained under some control conditions. As applications, we apply our proposed method to solve regression and classiﬁcation problems by using an extreme learning machine model. Moreover, we show that our proposed algorithm has more efﬁciency and better convergence behavior than some algorithms mentioned in the literature.


Introduction
In this work, we are dealing with a convex minimization problem, which can be formulated as where f , g : H → R ∪ {+∞} are proper, lower-semicontinuous convex functions and H is a Hilbert space.Many real world problems, such as signal processing, image reconstruction and compressed sensing, can be described using this model [1][2][3][4].Moreover, data classification can also be formulated as (1); for more information about the importance and development of data classification and its methods see [5][6][7][8].Therefore, a convex minimization problem has a wide range of applications, some of which will be studied in this research.If f is differentiable then it is well-known that an element x ∈ H is a solution of (1) if and only if where α > 0, prox αg (x) = J ∂g α (x) = (I − α∂g) −1 (x), I is an identity mapping, and ∂g is a subdifferential of g.In addition, if f is L-Lipschitz continuous then the classical foward-backward algorithm [9] can be used to solve (1).It is defined as follows: where α n is a suitable stepsize.This method has been extensively used due to its simplicity, as a result it has been improved by many works, as seen in [2,[10][11][12].One well-known method that has improved the convergence rate of (3) significantly is known as the fast iterative shrinkage-threshodling algorithm or FISTA.It was proposed by Beck and Teboulle [13], as seen in Algorithm 1.
7: end for They proved that FISTA has a better convergence rate than (3), however the convergence theorem of this method was not given.Recently, Laing and Schonlieb [14] modified FISTA by setting , where p, q > 0 and 0 < r ≤ 4, and proved its weak convergence theorem.In the case that H is an infinite dimension Hilbert space, weak convergence results may not be enough, consequently modifications of some algorithms are needed to obtain strong convergence results.There are several ways to modify the methods for such purpose, for more information see [15][16][17][18].One method that caught our attention was the viscosity-base inertial foward-backward algorithm (VIFBA) proposed by Verma et al. [19], as seen in Algorithm 2. Algorithm 2. VIFBA.
6: end for They proved a strong convergence of this algorithm if the following conditions are satisfied for all n ∈ N: Note that all the methods mentioned above require f to be L-Lipschitz continuous, which is quite difficult to find in general.Therefore, some improvements are still desirable.
Very recently, Cruz and Nghia [20] proposed a linesearch technique which can be used to eliminate the L-Lipschitz continuous assumption of f and replaced it with weaker assumptions.In their work, the following conditions are needed instead: B1. f , g are proper lower semicontinuous convex functions with domg = dom f , B2. f is differentiable on an open set containing domg, and f is uniformly continuous on any bounded subset of domg and maps any a bounded subset of domg to a bounded set in H.
The linesearch step is defined as Algorithm 3 as follows.

6: return x n+1
They also proved its weak convergence theorem.Again the weak convergence may not be enough in the context of infinite dimension space.
As we know, most of the work related to a convex minimization problem assumes the L-Lipschitz continuity of f .This restriction can be relaxed using a linesearch technique.So, we are motivated to establish a novel accelerated algorithm for solving a convex minimization problem (1), which employs a linesearch technique introduced by Cruz and Nghia [20] together with VIFBA [19].The novelty of our proposed method is a suitable combination of the two methods to obtain a fast and efficient method for solving (1).We improve Algorithm 4 by adding an inertial step, which enhances the performance of the algorithm.We also prove its strong convergence theorem under weaker assumptions on the control conditions than that of VIFBA.More precisely, we can eliminate the assumption A2 and replace A3 with a weaker assumption.As applications, we apply our main result to solve a data classification problem and a regression of a sine function.Then we compare the performance of our algorithm with FISTA, VIFBA, and Algorithm 4.
This work is organized as follows: In Section 2, we recall some useful concepts related to the topic.In Section 3, we provide a new algorithm and prove its strong convergence to a solution of (1).In Section 4, we conduct some numerical experiments with a data classification problem and a regression of a sine function and compare the performance of each algorithm (FISTA, VIFBA, Algorithms 4 and 5).Finally, the conclusion of this work is in Section 5.

Preliminaries
In this section, we review some important tools which will be used in the later sections.Throughout this paper, we denote x n → x and x n x as strong and weak convergence of {x n } to x, respectively.
A mapping T : H → H is said to be L-Lipschitz continuous if there exists L > 0 such that Tx − Tx ≤ L x − y , for all x, y ∈ H.
For x ∈ H, a subdifferential of h at x is defined as follows: We have known from [21] that a subdifferential ∂h is maximal monotone.Moreover, a graph of ∂h, Gph(∂h such that {u n } converges weakly to u and {v n } converges strongly to v, we have (u, v) ∈ Gph(∂h).
The proximal operator, prox g : H → domg with prox g (x) = (I + ∂g) −1 (x), is single-valued with a full domain.Moreover, the following is satisfied: The following lemmas are crucial for the main results.

Lemma 2 ([23]
).Let H be a real Hilbert space.Then the following holds, for all x, y ∈ H and α ∈ [0, 1], Lemma 3 ([24]).Let {a n } be a sequence of real numbers such that there exists a subsequence {a m j } of {a n } such that a m j < a m j +1 for all j ∈ N. Then there exists a nondecreasing sequence {n k } of N such that lim k→∞ n k = ∞ and for all sufficiently large k ∈ N the following holds: Lemma 4 ([25]).Let {a n } be a sequence of nonnegative real numbers, {α n } a sequence in (0, 1) with for all n ∈ N, then lim n→∞ a n = 0.

Main Results
In this section, we assume the existence of a solution of (1) and denote S * the set of all such solutions.It is known that S * is closed and convex.We propose a new algorithm, by combining a linesearch technique (Linesearch 1) with VIFBA, as seen in Algorithm 5. A diagram of this algorithm can be seen in Figure 1.
2: for n = 1 to k do 3: : : x n+1 = prox γ n g (z n − γ n f z n ).We prove a strong convergence result of Algorithm 5 in Theorem 1 as follows.
Theorem 1.Let H be a Hilbert space, g : H → R ∪ {+∞} proper lower-semicontinuous convex, f : H → R ∪ {+∞} proper convex differentiable with f being uniformly continuous on any bounded subset of H. Suppose the following holds: Then a sequence {x n } generated by Algorithm 5 converges strongly to x * = P S * F(x * ).
Proof.Since S * is closed and convex, a mapping P S * F yields a fixed point.Let x * = P S * F(x * ), by the definition of x n , y n and z n , we obtain the following, for all n ∈ N: for some The following also holds: Next, we prove the following Indeed, from ( 4) and the definition of ∂g, we obtain Hence, by using the above inequalities and the definition of γ n , we have Furthermore, it follows from x * ∈ S * that Next, we show that {x n } is bounded.Indeed, from ( 7) and ( 9), we obtain Inductively, we have and hence {x n } is bounded.Furthermore, by using ( 5) and ( 6), {y n } and {z n } are also bounded.To show the convergence of {x n }, we divide the proof into two cases.
Case 1 There exists 5) and ( 6), and the fact that w, for some w ∈ H, and the following holds: lim sup We claim that w ∈ S * .In order to prove this, we need to consider two cases of {z n k }.The first case, if γ n k = σ, for finitely many k.Then, without loss of generality, we can assume that γ n k = σ, for all k ∈ N. From the definition of γ n k , we have The uniform continuity of f implies that lim Since Gph(∂( f + g)) is demiclosed, we can have that 0 ∈ ∂( f + g)(w), and hence w ∈ S * .The second case, there exists a subsequence {z n k j } of {z n k } such that γ n k j ≤ σθ, for all j ∈ N.
From the definition of γ n k j , we have Moreover, from Lemma 1 we have which implies that xn k j w.Since f is uniformly continuous, we also have Combining this with (10), we obtain ).
Case 2 There exists a subsequence {x m j } of {x n } such that x m j − x * < x m j +1 − x * for all j ∈ N. From Lemma 3, there exists a nondecreasing sequence {n k } of N such that lim k→∞ n k = ∞ and the following holds, for all sufficiently large k ∈ N, From the definition of z n k and (8) we have, for all k ∈ N, Combining with (9) , we obtain Since {z n k } is bounded, there exists a subsequence {z n k j } such that z n k j w, for some w ∈ H, and lim sup Using the same argument as in case 1, we have that w ∈ S * and lim sup Moreover, it follows from (11) that Thus, we can conclude that {x n } converges strongly to x * , and the proof is complete.
Remark 1.We observe that we can prove our main result, Theorem 1, without the condition A2 and use the weaker condition C2 instead of A3 while VIFBA requires all of these conditions.

Applications to Data Classification and Regression Problems
As mentioned in the literature, many real world problems can be formulated in the form of a convex minimization problem.So, in this section, we illustrate the reformulation process of some problems in machine learning, namely classification and regression problems, into a convex minimization problem, and apply our proposed algorithm to solve such problems.We also show that our proposed method is more efficient than some methods mentioned in the literature.
First, we give a brief concept of extreme learning machine for data classification and regression problems, then we apply our main result to solve these two problems by conducting some numerical experiments.We also compare the performance of FISTA, VIFBA, Algorithms 4 and 5.

Extreme learning machine (ELM). Let
. ., N} be a training set of N distinct samples, then x k is an input data and t k is a target.For any single hidden layer of ELM, the output of the i-th hidden node is where G is an activate function, a i and b i are parameters of the i-th hidden node.The output function of ELM for SLFNs with M hidden nodes is where β i is the output weight of the i-th hidden node.The hidden layer output matrix H is defined as follows: The main goal of ELM is to find β = [β T 1 , . . ., β T M ] T such that Hβ = T, where T = [t T 1 , . . ., t T N ] T is the training data.In some cases, finding β = H † T, where H † is the Moore-Penrose generalized inverse of H, maybe a difficult task when H † does not exist.Thus, finding such solution β by means of convex minimization can overcome such difficulty.
In this section, we conduct some experiments on regression and classification problems, the problem is formulated as the following convex minimization problem: where λ is a regularization parameter.This problem is called the least absolute shrinkage and selection operator (LASSO) [26].In this case f (x) = Hx − T 2 2 and g(x) = λ x 1 .We note that, in our experiments, FISTA and VIFBA can be used to solve the problems, since the L-Lipschitz constants of the problems exist.However, FISTA and VIFBA fail to solve problems in which L-Lipschitz constants do not exist, while Algorithms 4 and 5 succeed.

Regression of a Sine Function
Throughout Sections 4.1 and 4.2, all parameters are chosen to satisfy all the hypotheses of Theorem 1.All results are performed on Intel Core i5-7500 CPU with 16GB RAM and GeForce GTX 1060 6GB GPU.
As seen in Table 1, we create randomly 10 distinct points x 1 , x 2 , . . ., x 10 which value between [−4, 4],then we create the training set S := {sin x n : n = 1, . . ., 10} and a graph of a sine function on [−4, 4] as the target .The activation function is sigmoid, number of hidden nodes M = 100, and regularization parameter λ = 1 × 10 −5 .We use FISTA, VIFBA, Algorithms 4 and 5 to predict a sine function with 10 training points.The first experiment is to compare the performance of Algorithm 5 with different c-contractive mapping F, so we can observe if F affects the performance of Algorithm 5. We use mean square error (MSE) as a measure defined as follows: By setting σ = 0.49, δ = 0.1, θ = 0.1 and the inertial parameter β n = 1 x n −x n−1 3 +n 3 and MSE = 5 × 10 −3 as the stopping criteria, we obtain the results as seen in Table 2.We observe that Algorithm 5 performs better when c is closer to 1.
In the second experiment, we compare the performance of Algorithm 5 with different inertial parameters β n in Theorem 1, namely It can be shown that β 1  n , β 2 n , β 3 n and β 4 n satisfy C2.By setting σ = 0.49, δ = 0.1, θ = 0.1, F = 0.999x, and MSE = 5 × 10 −3 as the stopping criteria, we obtain the results, as seen in Table 3.We can clearly see that β 4 n significantly improves the performance of Algorithm 5. Although, β 4  n converges to 0 as n → ∞, we observe that the behavior of β 4 n is different form β 2 n and β 3 n at the first few steps of the iteration, i.e., β 4  n is extremely close to 1 while β 2 n and β 3 n are far away from 1. Based on this experiment, we choose β 4 n as our default inertial parameter for later experiments.The third experiment, we compare the performance of FISTA, VIFBA, Algorithms 4 and 5.As in Table 4, we set the following parameters for each algorithm: Table 4. Chosen parameters of each algorithm.
FISTA VIFBA Algorithm 4 Algorithm 5 By setting MSE = 5 × 10 −3 as the stopping criteria, we obtain the results, as seen in Table 5.We observe that Algorithm 5 takes only 129 iterations while FISTA, VIFBA and Algorithm 4 take a higher number of iterations, and Algorithm 5 uses a training time less than Algorithm 4.
Next, we compare each algorithm at the 3000th iteration with different kinds of measures, namely mean absolute error (MAE) and root mean squared error (RMSE) defined as follows: The results can be seen in Table 6.We observe from Table 6 that Algorithm 5 has the lowest MAE and RMSE, but takes the longest training time.In Figure 2, we observe that Algorithm 5 outperforms other algorithms in the regression of a graph of a sine function under the small number of iterations.In Figure 3, it is shown that Algorithm 5, FISTA and VIFBA have a better performance in the regression of a graph than Algorithm 4 when the number of iterations is higher.

Data Classification
In this experiment, we classify the type of Iris plants from Iris dataset created by Fisher [27].As shown in Table 7, this dataset contains 3 classes of 50 instances and each sample contains four attributes.
We also would like to thank https://archive.ics.uci.edu for providing the dataset.With this dataset, we set sigmoid as an activation function, number of hidden nodes M = 100, and regularization parameter λ = 1 × 10 −5 .We use FISTA, VIFBA, Algorithms 4 and 5 as the training algorithm to estimate the optimal weight β.The output data O of training and testing data are obtained by O = Hβ, see Table 8 for more detail.
In the first experiment, we use the first 35 instances of each class as training data and the last 15 of each class as testing data, see Table 9 for detail.
The accuracy of the output data is calculated by: accuracy = correctly predicted data all data × 100%.
To compare the performance of FISTA, VIFBA, Algorithms 4 and 5, we choose parameters for each algorithm the same as in Table 4.
We first compare the accuracy of each method at the 700th iteration, and obtain the following results, as seen in Table 10.
As we see, from Table 10, Algorithm 5 obtains the highest accuracy at 700th iterations.We use acc.train and acc.test for the accuracy of the training data set and testing data set, respectively.
Next we compare each method with the stopping criteria as acc.train > 90 and acc.test > 90, the results can be seen in Table 11.We also use Average ACC and ERR % to evaluate the performance of each algorithm.
where N is a number of sets considered during cross validation (N = 10), x i a number of correctly predicted data at fold i and y i a number of all data at fold i.
Let err Lsum = sum of errors in all 10 training sets, err Tsum = sum of errors in all 10 testing sets, Lsum = sum of all data in 10 training sets and Tsum = sum of all data in 10 testing sets.Define ERR % = (err L% + err T% )/2, where err L% = err Lsum Lsum × 100%, and err T% = err Tsum Tsum × 100%.We choose the same parameters as in Table 4.We compare the accuracy at the 1000th iteration of each fold, and obtain the following results, as seen in Table 13.We observe from Table 13 that Algorithm 5 has higher average accuracy than Algorithm 4.

Conclusions
In this work, algorithms for solving a convex minimization problem (1) are studied.Many effective algorithms for solving this problem were proposed, most of them require a Lipschitz continuous assumption of f .By combining a linesearch technique introduced by Cruz and Nghia [20], and an iterative method VIFBA by Verma et al. [19], we establish a new algorithm that does not require a Lipschitz continuous assumption of f .As a result, it can be applied to solve problems in which Lipschitz constants do not exist, while VIFBA and FISTA cannot.Moreover, by viscosity approximation together with the inertial technique, our proposed algorithm has a better convergence behavior than Algorithm 4. A strong convergence of our proposed method is also proven under some control conditions that are weaker than that of VIFBA.
Our algorithm can be used to solve many real world problems such as image and signal processing, machine learning, especially regression and classification problems.To compare the performance of FISTA, VIFBA, Algorithm 4 and our proposed algorithm(Algorithm 5), we conduct numerical experiments on the latter problems.We observe from these experiments that Algorithms 4 and 5 take computational time longer than FISTA and VIFBA at the same number of iterations because the linesearch step (Linesearch 1) takes a long time to compute.In the experiments with the stopping criteria (Tables 5 and 11), Algorithm 5 converges to a solution with a lower number of iterations than Algorithm 4 and hence performs better in terms of speed.We can also observe that Algorithm 5 performs decently in terms of accuracy, especially when compared with Algorithm 4.
For our future research, since FISTA performs better than Algorithm 5 in terms of speed, in order to compete with FISTA, we aim to find a new linesearch technique that takes less computational time than Linesearch 1 and hence decreases the computational time of Algorithm 5.

Figure 2 .
Figure 2. A regression of a sine function at the 130th iteration.

Figure 3 .
Figure 3.A regression of a sine function at the 3000th iteration.

Table 1 .
Detail about the regression of a sine function experiment.Training set Create a training matrix R = [x 1 x 2 . . .x 10 ] T where x 1 , x 2 , ..., x n ∈ [−4, 4] are randomly generated Create the training target matrix S = [sin x 1 sin x 2 . . .sin x 10 ] T * of this problem using Algorithm 5 with a certain number of iterations Testing process Generate the hidden layer output matrix H 2 of testing matrix V with 100 hidden nodes using sigmoid as the activation function Calculate output O = H 2 β * Calculate MSE, MAE, RMSE of output O and target matrix T

Table 2 .
Numerical results of c-contractive mapping.

Table 3 .
Numerical results of each inertial parameter.

Table 5 .
Numerical results of a regression of a sine function with the stopping criteria.

Table 6 .
Numerical results of a regression of a sine function at the 3000th iteration.

Table 8 .
Details about the classification of Iris dataset experiment.Let 1, 2 and 3 represent Iris setosa, Iris versicolor and Iris virginica, respectively Training set Create the N × 1 training matrix S of number 1, 2 and 3 according to the training set and N is a number of training samples Create the N × 4 training attribute matrix R according to S and attribute data Testing set Create the M × 1 testing matrix T of number 1,2 and 3 according to the testing set and M is a number of testing samples Create the M × 4 testing attribute matrix V according to T and attribute data Learning process Generate the hidden layer output matrix H 1 of training attribute matrix R with 100 hidden nodes using sigmoid as the activation function Choose λ = 1 × 10 −5 and formulate a convex minimization problem: Minimize: H 1 β − S 2 2 + λ β 1 Find optimal weight β * of this problem using Algorithm 5 as a learning algorithm with certain number of iterations Testing process Calculate output O 1 = H 1 β * Calculate number of correctly predicted samples between output O 1 and training matrix S Generate the hidden layer output matrix H 2 of testing attribute matrix V with 100 hidden nodes using sigmoid as the activation function Calculate output O 2 = H 2 β *Calculate number of correctly predicted samples between output O 2 and testing matrix T Calculate acc.train, acc.test,Average ACC, ERR %

Table 9 .
Training and testing sets of the Iris dataset.

Table 10 .
The performance of each algorithm at the 700th iteration.

Table 11 .
The performance of each algorithm with the stopping criteria.We observe from Table11that Algorithm 5 performs better than Algorithm 4. In the next experiment, we use 10-fold stratified cross-validation to set up the training and testing data, see Table12for detail.

Table 12 .
Training and testing sets for 10-fold stratified cross-validation.

Table 13 .
The performance of each algorithm at the 1000th iteration with a 10-fold stratified cross-validation.