Next Article in Journal
Generalized (τ, σ)-L-Derivations in Rings
Previous Article in Journal
Deep Learning for Bifurcation Detection: Extending Early Warning Signals to Dynamical Systems with Coloured Noise
Previous Article in Special Issue
A Double Inertial Mann-Type Method for Two Nonexpansive Mappings with Application to Urinary Tract Infection Diagnosis
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A New Accelerated Forward–Backward Splitting Algorithm for Monotone Inclusions with Application to Data Classification

1
PhD Degree Program in Mathematics, Department of Mathematics, Faculty of Science, Chiang Mai University, Under the CMU Presidential Scholarship, Chiang Mai 50200, Thailand
2
Department of Statistics, Faculty of Science, Chiang Mai University, Chiang Mai 50200, Thailand
3
Research Center in Optimization and Computational Intelligence for Big Data Prediction, Department of Mathematics, Faculty of Science, Chiang Mai University, Chiang Mai 50200, Thailand
*
Author to whom correspondence should be addressed.
Mathematics 2025, 13(17), 2783; https://doi.org/10.3390/math13172783
Submission received: 18 July 2025 / Revised: 20 August 2025 / Accepted: 28 August 2025 / Published: 29 August 2025
(This article belongs to the Special Issue Variational Analysis, Optimization, and Equilibrium Problems)

Abstract

This paper proposes a new accelerated fixed-point algorithm based on a double-inertial extrapolation technique for solving structured variational inclusion and convex bilevel optimization problems. The underlying framework leverages fixed-point theory and operator splitting methods to address inclusion problems of the form 0 ( A + B ) ( x ) , where A is a cocoercive operator and B is a maximally monotone operator defined on a real Hilbert space. The algorithm incorporates two inertial terms and a relaxation step via a contractive mapping, resulting in improved convergence properties and numerical stability. Under mild conditions of step sizes and inertial parameters, we establish strong convergence of the proposed algorithm to a point in the solution set that satisfies a variational inequality with respect to a contractive mapping. Beyond theoretical development, we demonstrate the practical effectiveness of the proposed algorithm by applying it to data classification tasks using Deep Extreme Learning Machines (DELMs). In particular, the training processes of Two-Hidden-Layer ELM (TELM) models is reformulated as convex regularized optimization problems, enabling robust learning without requiring direct matrix inversions. Experimental results on benchmark and real-world medical datasets, including breast cancer and hypertension prediction, confirm the superior performance of our approach in terms of evaluation metrics and convergence. This work unifies and extends existing inertial-type forward–backward schemes, offering a versatile and theoretically grounded optimization tool for both fundamental research and practical applications in machine learning and data science.

1. Introduction

The convex bilevel optimization problem plays an important role in real-world applications such as image and signal processing, data classification problems, medical image, machine learning, and so on. Recently, deep learning has become an important tool in many areas, such as image classification, speech recognition, and medical data analysis.
The convex bilevel optimization consists of the following two levels.
The outer-level problem is
min x Γ ϕ ( x ) ,
where ϕ : H R is a strongly convex and differentiable function over a real Hilbert space H and  Γ is the set of solutions to the inner-level problem:
arg min x H { f ( x ) + g ( x ) } ,
where f is convex and differentiable and  g Γ 0 ( H ) is the class of proper, lower semicontinuous, convex functions. The implicit nature of the constraint set Γ makes the bilevel problem particularly challenging and well-suited for operator-theoretic approaches.
Various algorithms have been developed to solve Problems (1) and (2). Among these, the Bilevel Gradient Sequential Averaging Method (BiG-SAM, Algorithm 1) was proposed by Sabach and Shtern [1] as follows:
Algorithm 1 BiG-SAM
  • Input:  x 1 R m , α n ( 0 , 1 ) , γ n ( 0 , 1 L f ) and s ( 0 , 2 L ϕ + σ ) where L f and L ϕ are the Lipschitz constants of f and ϕ , respectively.
  • For  n 1 :
  • Compute:
    y n = prox γ n g ( x n γ n f ( x n ) ) , x n + 1 = α n ( x n s ϕ ( x n ) ) + ( 1 α n ) y n ,
    where f and ϕ are gradients of f and ϕ , respectively.
They showed that x n x Ω , where Ω is the set of all solutions to Problem (1).
The inertial technique for accelerating the convergence behavior of the algorithms was first proposed by Polyak [2]. Since then, this technique has been continuously employed in various algorithms.
Shehu et al. [3] designed the inertial Bilevel Gradient Sequential Averaging Method (iBiG-SAM, Algorithm 2) as an extension of Algorithm 1, with inertial technique for improving its convergence rate.
Algorithm 2 iBiG-SAM
  • Input:  x 0 , x 1 R m , α 3 , α n ( 0 , 1 ) , γ n ( 0 , 2 L f ) , s ( 0 , 2 L ϕ + σ ] where L f and L ϕ are the Lipschitz constants of f and ϕ , respectively.
  • For  n 1 :
  • Choose:  θ n [ 0 , θ n ¯ ] where θ n ¯ is defined by
    θ n ¯ : = min { n 1 n + α 1 , ξ n x n x n 1 } , if x n x n 1 , n 1 n + α 1 , otherwise .
  • Compute:
    y n = x n + θ n ( x n x n 1 ) , t n = prox γ n g ( y n γ n f ( y n ) ) , w n = y n s ϕ ( y n ) , x n + 1 = α n w n + ( 1 α n ) t n ,
    where f and ϕ are gradients of f and ϕ , respectively.
They further demonstrated that the sequence { x n } converges to some x Ω when the control sequence { α n } satisfies the following criteria:
lim n α n = 0   and n = 1 α n = .
To further improve the convergence performance of Algorithm 2, Duan and Zhang [4] developed three related methods; namely, the alternated inertial Bilevel Gradient Sequential Averaging Method (aiBiG-SAM, Algorithm 3), multi-step inertial Bilevel Gradient Sequential Averaging Method (miBiG-SAM, Algorithm 4), and multi-step alternative inertial Bilevel Gradient Sequential Averaging Method (amiBiG-SAM, Algorithm 5), which were defined as follows:
Algorithm 3 aiBiG-SAM
  • Input:  x 0 , x 1 R m , α 3 , ξ > 0 , γ n ( 0 , 2 L f ) , ξ n ( 0 , 1 ) , s ( 0 , 2 L ϕ + σ ] where L f and L ϕ are the Lipschitz constants of f and ϕ , respectively. Let { α n } be a sequence in ( 0 , 1 ) satisfying (3).
  • For  n 1 :
  • Step 1. Compute:
    y n = x n + θ n ( x n x n 1 ) , if   n is   odd , x n , otherwise .
    When n is odd, choose θ n such that 0     | θ n |     θ n ¯ with θ n ¯ defined by
    θ n ¯ : = min { n n + α 1 , ξ n x n x n 1 } , if x n x n 1 , n n + α 1 , otherwise .
    When n is even, θ n = 0 .
  • Step 2. Compute:
    t n = prox γ n g ( y n γ n f ( y n ) ) , w n = y n s ϕ ( y n ) , x n + 1 = α n w n + ( 1 α n ) t n ,
    where f and ϕ are gradients of f and ϕ , respectively.
  • Step 3. If x n x n 1   <   ξ , then stop. Otherwise, set n = n + 1 and go to Step 1.
Algorithm 4 miBiG-SAM
  • Input:  x 0 , x 1 R m , α 3 , ξ > 0 , γ n ( 0 , 2 L f ) , ξ n ( 0 , 1 ) , s ( 0 , 2 L ϕ + σ ] where L f and L ϕ are the Lipschitz constants of f and ϕ , respectively. Let { α n } be a sequence in ( 0 , 1 ) satisfying (3).
  • For  n 1 :
  • Step 1. Given x n , x n 1 , , x n q + 1 and compute
    y n = x n + i Q θ i , n ( x n i x n 1 i ) ,
    where Q = { 0 , 1 , , q 1 } . Choose θ i , n such that 0     | θ i , n |     θ ¯ n with θ ¯ n defined by
    θ ¯ n : = min { n n + α 1 , ξ n i Q x n i x n 1 i } , if i Q x n i x n 1 i 0 , n n + α 1 , otherwise .
  • Step 2. Compute:
    t n = prox γ n g ( y n γ n f ( y n ) ) , w n = y n s ϕ ( y n ) , x n + 1 = α n w n + ( 1 α n ) t n ,
    where f and ϕ are gradients of f and ϕ , respectively.
  • Step 3. If x n x n 1   <   ξ , then stop. Otherwise, set n = n + 1 and go to Step 1.
Algorithm 5 amiBiG-SAM
  • Input:  x 0 , x 1 R m , α 3 , ξ > 0 , γ n ( 0 , 2 L f ) , ξ n ( 0 , 1 ) , s ( 0 , 2 L ϕ + σ ] where L f and L ϕ are the Lipschitz constants of f and ϕ , respectively. Let { α n } be a sequence in ( 0 , 1 ) satisfying (3).
  • For  n 1 :
  • Step 1. Given x n , x n 1 , , x n q + 1 and compute
    y n = x n + i Q θ i , n ( x n i x n 1 i ) , if   n   is   odd , x n , otherwise ,
    where Q = { 0 , 1 , , q 1 } . When n is odd, choose θ i , n such that 0     | θ i , n |     θ ¯ n with θ ¯ n defined by
    θ ¯ n : = min { n n + α 1 , ξ n i Q x n i x n 1 i } , if i Q x n i x n 1 i 0 , n n + α 1 , otherwise .
    When n is even, θ n = 0 .
  • Step 2. Compute:
    t n = prox γ n g ( y n γ n f ( y n ) ) , w n = y n s ϕ ( y n ) , x n + 1 = α n w n + ( 1 α n ) t n ,
    where f and ϕ are gradients of f and ϕ , respectively.
  • Step 3. If x n x n 1   <   ξ , then stop. Otherwise, set n = n + 1 and go to Step 1.
The convergence analysis revealed that Algorithms 3–5 achieve better performance than Algorithms 1 and 2 (see more details in Duan and Zhang [4]).
We note that Algorithms 1–5 were developed based on fixed-point techniques. Subsequently, viscosity approximation methods combined with the fixed-point method and inertial technique were employed to develop accelerated algorithms for solving convex bilevel optimization problems (see [5,6,7]).
The convex minimization problem (2) is one of the most fundamental and crucial problems in applied mathematics, medical image, data science, data classification, and computer science settings.
It is well known that x * is a solution to Problem (2) if, and only if,
0 f ( x * ) + g ( x * )
This leads to the more general framework of variational inclusion problems, which unifies many classes of problems by seeking a point x * , such that
0 A ( x * ) + B ( x * )
where A : H 2 H is a monotone operator and B is a Lipschitz continuous mapping. The solution set of Problem (5) is denoted by ( A + B ) 1 ( 0 ) . Variational inclusion problems generalize fixed-point problems, monotone equations, and variational inequalities, and provide a flexible structure for handling nonsmooth terms and constraints.
The variational inclusion Problem (5) can be reformulated as the fixed point equation
x * = T x * ,
where T : = J γ B ( I γ A ) and J γ B ( x ) : = ( I + γ B ) 1 ( x ) for some γ > 0 .
Over the years, various iterative schemes have been proposed for solving variational inclusion Problem (5). A well known and extensively studied one is the forward–backward method (FBM), defined by
x 1 H , x n + 1 = J γ n B ( I γ n A ) x n , n 1 ,
where γ n is a positive step size.
To improve the performance of the classical forward–backward method in solving monotone inclusion problems, Moudafi and Oliny (2003) [8] proposed an inertial variant known as the Inertial Forward–Backward Algorithm (IFBA). This method incorporates a momentum-like term inspired by inertial techniques and is designed for finding a zero of the sum of two monotone operators. The iterative scheme was given by
x 0 , x 1 H , y n = x n + θ n ( x n x n 1 ) , x n + 1 = J γ n B ( y n γ n A x n ) , n 1 ,
where γ n ( 0 , 2 L A ) and L A denotes the Lipschitz constant of the monotone operator A. The inclusion of the extrapolation parameter θ n aims to accelerate the convergence rate of proposed algorithm. Under suitable assumptions on θ n , they proved that the generated sequence converges weakly to a solution of the inclusion problem.
Recently, Peeyada [9] proposed the Inertial Mann Forward–Backward Splitting Algorithm (IMFBSA, Algorithm 6) as a refined approach that combines Mann-type iterations with inertial extrapolation, as follows:
Algorithm 6 IMFBSA
  • Input:  x 0 , x 1 H , { α n } ( 0 , 1 ) , { γ n } ( 0 , 2 β ) , { θ n } [ 0 , ) satisfies the following conditions
    0 < lim inf n γ n lim sup n γ n < 2 β   and n = 1 θ n x n x n 1 < .
  • Step 1. Compute:
    y n = x n + θ n ( x n x n 1 ) , z n = y n + η n ( x n y n ) , x n + 1 = J γ n B ( I γ n A ) z n
  • Step 2. Set n = n + 1 and go to Step 1.
Moreover, several authors have developed the algorithms by using multi-inertial forward–backward schemes for solving variational inclusion problems, which ensure convergence and demonstrate efficiency in practical applications such as image deblurring (see [10,11]). These developments illustrate the increasing interest in designing inertial-type iterative methods for monotone inclusion and convex optimization problems.
Building upon and inspired by the above-mentioned studies, in this work, we aim to propose a new accelerated fixed-point algorithm designed to address variational inclusion and convex bilevel optimization problems, and prove its strong convergence behavior, including comparison its effectiveness in data classification with the existing algorithms.
The structure of this paper is as follows. In Section 2, we introduce some fundamental definitions and key lemmas used in the later sections. The main theoretical contributions of our study are presented in Section 3. In Section 4, we apply the proposed algorithm to data classification problems using breast cancer and hypertension datasets, and compare its performance with other existing methods. Finally, the conclusion of our work is given in Section 5.

2. Preliminaries

Throughout this work, let H be a real Hilbert space and T : H H a mapping. We use x n x for strong convergence and x n x for weak convergence.
Definition 1.
A mapping T : H H is called Lipschitzian if
T x T y     L x y , x , y H ,
for some L     0 . If  L [ 0 , 1 ) , then T is a k-contraction, and if L = 1 , then T is nonexpansive.
Definition 2.
Let D H , and let β R + + . Then, T is β-cocoercive if β T is firmly nonexpansive, i.e., 
( x D ) ( y D ) x y , T x T y     β T x T y 2 .
Definition 3.
Let x H and C H be closed and convex. Then, there is a unique point x * C , such that
x * x     y x , y C .
The mapping P C : H C , defined by P C x = x * , is called the metric projection onto C.
We conclude this section with several auxiliary lemmas and propositions essential for supporting the main results.
Lemma 1
([12]). Let x , y H and λ [ 0 , 1 ] . Then, the following identities and inequality hold:
(1) 
x ± y 2 = x 2 ± 2 x , y + y 2 ;
(2) 
x + y 2 x 2 + 2 y , x + y ;
(3) 
λ x + ( 1 λ ) y 2 = λ x 2 + ( 1 λ ) y 2 λ ( 1 λ ) x y 2 .
Proposition 1.
Let C H be convex and let x H , x * C . Then,
x * = P C x x x * , y x * 0 , y C .
Proposition 2
([13]). Suppose ϕ : H R is strongly convex with parameter σ > 0 and continuously differentiable, such that ϕ is Lipschitz continuous with constant L ϕ . Then, the mapping I s ϕ is k-contraction for all 0 s 2 L ϕ + σ , where k = 1 2 σ s L ϕ σ + L ϕ and I is the identity mapping.
Lemma 2
([14]). Let T : H H be a nonexpansive mapping with Fix ( T ) . If there exists a sequence { x n } H such that x n x H and x n T x n 0 , then x F i x ( T ) .
Lemma 3
([15]). Let A : H H be a β-cocoercive mapping and B : H 2 H a maximal monotone mapping. Then, we have
(1) 
Fix ( J γ B ( I γ A ) ) = ( A + B ) 1 ( 0 ) , γ > 0
(2) 
x H , x J γ B ( I γ A ) x   2 x J γ ¯ B ( I γ ¯ A ) x , x H , 0 < γ γ ¯ .
Lemma 4
([16]). Let { a n } R + , { b n } R , and { ξ n } ( 0 , 1 ) satisfy n = 1 ξ n = and
a n + 1 ( 1 ξ n ) a n + ξ n b n , n N .
If for every subsequence { a n i } of { a n } , such that lim inf i ( a n i + 1 a n i ) 0 , we have lim sup i b n i 0 , then lim n a n = 0 .

3. Main Results

In this section, we introduce the Double Inertial Viscosity Forward–Backward Algorithm (DIVFBA) which is a modification of Algorithm 6, introduced by Peeyada [9], in order to accelerate its convergence by using a double inertial step at the first step and the viscosity approximation method at the final step. It is worth to mentioning that Algorithm 6 obtained only weak convergence, while we aim to prove strong convergence of our proposed algorithm. Moreover, we will compare the performance of our algorithm and the others in data classification in the next section.
Throughout this section, let S : H H be a k-contraction mapping with k [ 0 , 1 ) , A be a β -cocoercive mapping on H , and B be a maximal monotone operator from H into 2 H , such that ( A + B ) 1 ( 0 ) .
We are now ready to present our accelerated fixed-point algorithm (Algorithm 7).
Algorithm 7 DIVFBA
  • Initialization: Choose x 1 , x 0 , x 1 H , { γ n } ( 0 , 2 β ) and { α n } , { β n } , { η n } , { τ n } ( 0 , 1 ) . Take { ξ n } , { μ n } ( 0 , ) and { ρ n } ( , 0 ) .
  • Iterative steps: For n 1 , calculate x n + 1 as follows:
  • Step 1. Compute the inertial step:
    θ n = min { μ n , ξ n x n x n 1 } , if x n x n 1 ; μ n , otherwise ,
    and
    δ n = max { ρ n , ξ n x n 1 x n 2 } , if x n 1 x n 2 ; ρ n , otherwise ,
  • Step 2. Compute
    y n = x n + θ n ( x n x n 1 ) + δ n ( x n 1 x n 2 ) ,
    z n = y n + η n ( x n y n ) ,
    v n = ( 1 β n ) x n + β n J γ n B ( I γ n A ) z n ,
    w n = ( 1 τ n ) J γ n B ( I γ n A ) y n + τ n J γ n B ( I γ n A ) v n ,
    x n + 1 = α n S ( x n ) + ( 1 α n ) w n .
    Set n := n + 1 and return to Step 1.
Theorem 1.
Let { x n } be a sequence generated by Algorithm 7, such that the following additional conditions hold:
(i) 
lim n ξ n α n = 0 ,
(ii) 
lim n α n = 0 and n = 1 α n = ,
(iii) 
0 < lim inf n τ n lim sup n τ n < 1 ,
(iv) 
0 < lim inf n β n lim sup n β n < 1 ,
(v) 
0 < lim inf n γ n lim sup n γ n < 2 β .
Then, x n p ( A + B ) 1 ( 0 ) , such that p = P ( A + B ) 1 ( 0 ) S ( p ) .
Proof. 
Let p ( A + B ) 1 ( 0 ) . Firstly, we prove that the sequence { x n } is bounded.
From (8), we have
y n p = x n + θ n ( x n x n 1 ) + δ n ( x n 1 x n 2 ) p x n p + θ n x n x n 1 + | δ n | · x n 1 x n 2 .
From (9) and (13), we have
z n p = y n + η n ( x n y n ) p η n x n p + ( 1 η n ) y n p η n x n p + ( 1 η n ) x n p + ( 1 η n ) [ θ n x n x n 1 + | δ n | · x n 1 x n 2 ] x n p + θ n x n x n 1 + | δ n | · x n 1 x n 2 .
From (10) and the nonexpansiveness of J γ n B ( I γ n A ) , we have
v n p = ( 1 β n ) x n + β n J γ n B ( I γ n A ) z n p ( 1 β n ) x n p + β n J γ n B ( I γ n A ) z n J γ n B ( I γ n A ) p ( 1 β n ) x n p + β n z n p ( 1 β n ) x n p + β n x n p + β n [ θ n x n x n 1 + | δ n | · x n 1 x n 2 ] x n p + θ n x n x n 1 + | δ n | · x n 1 x n 2 .
From (11), (13), and (15), we have
w n p = ( 1 τ n ) J γ n B ( I γ n A ) y n + τ n J γ n B ( I γ n A ) v n p ( 1 τ n ) J γ n B ( I γ n A ) y n J γ n B ( I γ n A ) p + τ n J γ n B ( I γ n A ) v n J γ n B ( I γ n A ) p ( 1 τ n ) y n p + τ n v n p x n p + θ n x n x n 1 + | δ n | · x n 1 x n 2 .
From (12), we obtain
x n p = α n S ( x n ) + ( 1 α n ) w n p α n S ( x n ) p + ( 1 α n ) w n p α n S ( x n ) S ( p ) + α n S ( p ) p + ( 1 α n ) w n p α n k x n p + α n S ( p ) p + ( 1 α n ) w n p α n k x n p + α n S ( p ) p + ( 1 α n ) [ x n p + θ n x n x n 1 + | δ n | · x n 1 x n 2 ( 1 α n ( 1 k ) ) x n p + α n S ( p ) p + θ n x n x n 1 + | δ n | · x n 1 x n 2 = ( 1 α n ( 1 k ) ) x n p + α n ( 1 k ) [ 1 1 k ( θ n α n x n x n 1 + | δ n | α n · x n 1 x n 2 + S ( p ) p ) ] max { x n p , 1 1 k ( θ n α n x n x n 1 + | δ n | α n · x n 1 x n 2 + S ( p ) p ) } .
By (6) and the condition (i), we have
θ n α n x n x n 1 0   as   n .
Hence, there is M 1 > 0 , such that
θ n α n x n x n 1 < M 1 , n N .
Similarly, we also have that
| δ n | α n · x n 1 x n 2 0   as   n ,
and
| δ n | α n · x n 1 x n 2 < M 2 , M 2 > 0 n N .
From (17), we obtain
x n + 1 p max { x n p , 1 1 k ( M 1 + M 2 + S ( p ) p ) } max { x 1 p , 1 1 k ( M 1 + M 2 + S ( p ) p ) } .
This implies that { x n } is bounded and so are { y n } , { z n } , { w n } , { v n } , and { S ( x n ) } .
By (8), we have
y n p 2 = x n + θ n ( x n x n 1 ) + δ n ( x n 1 x n 2 ) p 2 = x n p 2 + 2 θ n x n p , x n x n 1 + 2 δ n x n p , x n 1 x n 2 + θ n ( x n x n 1 ) + δ n ( x n 1 x n 2 ) 2 = x n p 2 + 2 θ n x n p , x n x n 1 + 2 δ n x n p , x n 1 x n 2 + θ n 2 x n x n 1 2 + 2 θ n δ n x n x n 1 , x n 1 x n 2 + δ n 2 x n 1 x n 2 2 x n p 2 + 2 θ n x n p · x n x n 1 + 2 | δ n | · x n p · x n 1 x n 2 + θ n 2 x n x n 1 2   +   2 θ n | δ n | · x n x n 1 · x n 1 x n 2 + δ n 2 x n 1 x n 2 2 .
From (9) and (20), we have
z n p 2 = y n + η n ( x n y n ) p 2 η n x n p 2 + ( 1 η n ) y n p 2 η n x n p 2 + ( 1 η n ) [ x n p 2 + 2 θ n x n p · x n x n 1   +   2 | δ n | · x n p · x n 1 x n 2 + θ n 2 x n x n 1 2 + 2 θ n | δ n | · x n x n 1 · x n 1 x n 2 + δ n 2 x n 1 x n 2 2 ] x n p 2 + 2 θ n x n p · x n x n 1 + 2 | δ n | · x n p · x n 1 x n 2   +   θ n 2 x n x n 1 2 + 2 θ n | δ n | · x n x n 1 · x n 1 x n 2   +   δ n 2 x n 1 x n 2 2 .
By the nonexpansiveness of J γ n B ( I γ n A ) and (21), we have
v n p 2 = ( 1 β n ) x n + β n J γ n B ( I γ n A ) z n p 2 = ( 1 β n ) ( x n p ) + β n ( J γ n B ( I γ n A ) z n J γ n B ( I γ n A ) p ) 2 ( 1 β n ) x n p 2 + β n z n p 2 β n ( 1 β n ) x n J γ n B ( I γ n A ) z n 2 ( 1 β n ) x n p 2 + β n [ x n p 2 + 2 θ n x n p · x n x n 1 + 2 | δ n | · x n p · x n 1 x n 2   +   θ n 2 x n x n 1 2 + 2 θ n | δ n | · x n x n 1 · x n 1 x n 2   +   δ n 2 x n 1 x n 2 2 ] β n ( 1 β n ) x n J γ n B ( I γ n A ) z n 2 x n p 2 + 2 θ n x n p · x n x n 1 + 2 | δ n | · x n p · x n 1 x n 2   +   θ n 2 x n x n 1 2 + 2 θ n | δ n | · x n x n 1 · x n 1 x n 2   +   δ n 2 x n 1 x n 2 2 β n ( 1 β n ) x n J γ n B ( I γ n A ) z n 2 .
Using (20) and (22), we have
w n p 2 = ( 1 τ n ) J γ n B ( I γ n A ) y n + τ n J γ n B ( I γ n A ) v n p 2 ( 1 τ n ) J γ n B ( I γ n A ) y n J γ n B ( I γ n A ) p 2 + τ n J γ n B ( I γ n A ) v n J γ n B ( I γ n A ) p 2 ( 1 τ n ) y n p 2 + τ n v n p 2 x n p 2 + 2 θ n x n p · x n x n 1 + 2 | δ n | · x n p · x n 1 x n 2   +   θ n 2 x n x n 1 2 + 2 θ n | δ n | · x n x n 1 · x n 1 x n 2   +   δ n 2 x n 1 x n 2 2 τ n β n ( 1 β n ) x n J γ n B ( I γ n A ) z n 2 .
It follows from Lemma 1 and (23), that
x n + 1 p 2 = α n S ( x n ) + ( 1 α n ) w n p 2 = α n ( S ( x n ) S ( p ) ) + ( 1 α n ) ( w n p ) + α n ( S ( p ) p ) 2 α n ( S ( x n ) S ( p ) ) + ( 1 α n ) ( w n p ) 2 + 2 α n S ( p ) p , x n + 1 p α n S ( x n ) S ( p ) 2 + ( 1 α n ) w n p 2 + 2 α n S ( p ) p , x n + 1 p α n k 2 x n p 2 + ( 1 α n ) w n p 2 + 2 α n S ( p ) p , x n + 1 p α n k x n p 2 + ( 1 α n ) [ x n p 2 + 2 θ n x n p · x n x n 1 + 2 | δ n | · x n p · x n 1 x n 2   +   θ n 2 x n x n 1 2 + 2 θ n | δ n | · x n x n 1 · x n 1 x n 2   +   δ n 2 x n 1 x n 2 2 ] + 2 α n S ( p ) p , x n + 1 p ( 1 α n ) τ n β n ( 1 β n ) x n J γ n B ( I γ n A ) z n 2 = ( 1 α n ( 1 k ) ) x n p 2 + 2 θ n x n p · x n x n 1 + 2 | δ n | · x n p · x n 1 x n 2   +   θ n 2 x n x n 1 2 + 2 θ n | δ n | · x n x n 1 · x n 1 x n 2   +   δ n 2 x n 1 x n 2 2 + 2 α n S ( p ) p , x n + 1 p ( 1 α n ) τ n β n ( 1 β n ) x n J γ n B ( I γ n A ) z n 2 = ( 1 α n ( 1 k ) ) x n p 2 + θ n x n x n 1 [ 2 x n p   +   θ n x n x n 1   +   2 | δ n | · x n 1 x n 2 ] + | δ n | · x n 1 x n 2 [ 2 x n p   +   | δ n | · x n 1 x n 2 ] + 2 α n S ( p ) p , x n + 1 p ( 1 α n ) τ n β n ( 1 β n ) x n J γ n B ( I γ n A ) z n 2 .
Since
θ n x n x n 1   =   α n · θ n α n x n x n 1 0   as   n ,
and
| δ n | · x n 1 x n 2   +   α n · | δ n | α n x n 1 x n 2 0   as   n ,
there exist the positive constants M 3 and M 4 , such that for all n 1 ,
θ n x n x n 1 M 3 ,
| δ n | · x n 1 x n 2     M 4 .
We may deduce from (24) that for all n 1 ,
x n + 1 p 2 ( 1 α n ( 1 k ) ) x n p 2 + 5 M 5 θ n x n x n 1 2 + 3 M 5 | δ n | · x n 1 x n 2 + 2 α n S ( p ) p , x n + 1 p ( 1 α n ) τ n β n ( 1 β n ) x n J γ n B ( I γ n A ) z n 2 = ( 1 α n ( 1 k ) ) x n p 2 + α n ( 1 k ) [ 5 M 5 1 k · θ n α n x n x n 1 2 + 3 M 5 1 k · | δ n | α n · x n 1 x n 2 + 2 1 k S ( p ) p , x n + 1 p ] ( 1 α n ) τ n β n ( 1 β n ) x n J γ n B ( I γ n A ) z n 2 ,
where M 5 = max { sup n x n p , M 3 , M 4 } .
From (27), we set
a n : = x n p 2 ,
b n : = 1 1 k [ 5 M 5 θ n α n x n x n 1 2 + 3 M 5 | δ n | α n x n 1 x n 2 + 2 S ( p ) p , x n + 1 p ]
and
ε n : = α n ( 1 k ) .
Hence, we obtain
a n + 1 = ( 1 ε n ) α n + ε n b n .
Suppose there is a subsequence { a n i } of { a n } , satisfying
lim inf i ( a n i + 1 a n i ) 0 .
It follows from (27) that
( 1 α n ) τ n β n ( 1 β n ) x n J γ n B ( I γ n A ) z n 2 a n a n + 1 + α n ( 1 k ) M ,
where M = sup n b n .
By conditions (ii), (iii), and (iv), we obtain
lim sup i ( 1 α n ) τ n β n ( 1 β n ) x n J γ n B ( I γ n A ) z n 2 lim sup i ( a n i a n i + 1 + ( 1 k ) M lim i α n i = lim inf i ( a n i a n i + 1 ) 0 ,
which implies
x n i J γ n i B ( I γ n i A ) z n i 2 = 0 .
From (8), we have
y n i x n i   +   θ n i x n i x n i 1 + | δ n i | · x n i 1 x n i 2 0 ,
as i .
From (9) and (31), we have
z n i y n i   +   η n i x n i y n i 0 ,
and
z n i x n i   +   ( 1 η n i ) y n i x n i 0 ,
as i .
From (10), (30), and (32), we have
v n i z n i = ( 1 β n i ) x n i + β n i J γ n i B ( I γ n i A ) z n i z n i ( 1 β n i ) x n i z n i   +   β n i J γ n i B ( I γ n i A ) z n i z n i 0 ,
and
v n i x n i = ( 1 β n i ) x n i + β n i J γ n i B ( I γ n i A ) z n i x n i β n i x n i J γ n i B ( I γ n i A ) z n i 0 ,
as i .
Using (30), (32), (34), and (35), we obtain
w n i v n i = ( 1 τ n i ) J γ n i B ( I γ n i A ) y n i + τ n i J γ n i B ( I γ n i A ) v n i v n i ( 1 τ n i ) J γ n i B ( I γ n i A ) y n i v n i + τ n i J γ n i B ( I γ n i A ) v n i v n i ( 1 τ n i ) J γ n i B ( I γ n i A ) y n i J γ n i B ( I γ n i A ) z n i + ( 1 τ n i ) J γ n i B ( I γ n i A ) z n i v n i + τ n i J γ n i B ( I γ n i A ) v n i J γ n i B ( I γ n i A ) z n i + τ n i J γ n i B ( I γ n i A ) z n i v n i ( 1 τ n i ) y n i z n i   +   J γ n i B ( I γ n i A ) z n i v n i + τ n i v n i z n i ( 1 τ n i ) y n i z n i   +   J γ n i B ( I γ n i A ) z n i x n i + x n i v n i   +   τ n i v n i z n i 0 ,
as i .
Using condition (ii), (31), (32), (34), and (36), we obtain
x n i + 1 x n i = α n i S ( x n i ) + ( 1 α n i ) w n i x n i α n i S ( x n i ) x n i   +   ( 1 α n i ) w n i x n i α n i S ( x n i ) x n i   +   w n i y n i   +   y n i x n i α n i S ( x n i ) x n i   +   w n i z n i   +   z n i y n i + y n i x n i α n i S ( x n i ) x n i   +   w n i v n i   +   v n i z n i + z n i y n i   +   y n i x n i 0 ,
as i .
We next show that lim sup i S ( p ) p , x n i + 1 p 0 .
Let { x n i j } be a subsequence of { x n i } , such that
lim j S ( p ) p , x n i j + 1 p = lim sup i S ( p ) p , x n i + 1 p .
Since { x n i j } is bounded, there exists a subsequence { x n i j k } of { x n i j } , such that x n i j k x H . Without loss of generality, we may assume that x n i j x .
From γ n i j ( 0 , 2 β ) , we know that the mapping J γ n i j B ( I γ n i j A ) is nonexpansive. Due to (30) and (33), the following result is obtained:
x n i j J γ n i j B ( I γ n i j A ) x n i j x n i j J γ n i j B ( I γ n i j A ) z n i j + J γ n i j B ( I γ n i j A ) z n i j J γ n i j B ( I γ n i j A ) x n i j x n i j J γ n i j B ( I γ n i j A ) z n i j   +   z n i j x n i j 0 as   j .
Using Lemmas 2 and 3, we obtain x F ( J γ n i j B ( I γ n i j A ) ) = ( A + B ) 1 ( 0 ) .
Since lim i x n i + 1 x n i = 0 , we have x n i j + 1 x . From p = P ( A + B ) 1 ( 0 ) S ( p ) , it is implied Proposition 1 that
lim sup i S ( p ) p , x n i + 1 p = lim j S ( p ) p , x n i j + 1 p = S ( p ) p , x p 0 .
By Lemma 4, we can conclude that x n p .    □
Remark 1.
Note that Algorithm 7 is a modification of Algorithm 6 in order to accelerate its convergence by using a double inertial step at the first step and the viscosity approximation method at the final step. Moreover, our proposed algorithm has a strong convergence result while Algorithm 6 obtained only weak convergence. Furthermore, Algorithm 7 can be applied to solve convex bilevel optimization problems, as seen in Theorem 2, while Algorithm 6 cannot be used to solve such problems.
Next, we employ the Bilevel Double Inertial Forward–Backward Algorithm (BDIFBA, Algorithm 8) to solve the convex bilevel optimization Problem (1), by replacing A and B in Algorithm 7 with f and g , respectively.
Algorithm 8 BDIFBA
  • Initialization: Choose x 1 , x 0 , x 1 H , { γ n } ( 0 , 2 β ) and { α n } , { β n } , { η n } , { τ n } ( 0 , 1 ) . Take { ξ n } , { μ n } ( 0 , ) and { ρ n } ( , 0 ) .
  • Iterative steps: For n 1 , calculate x n + 1 as follows:
  • Step 1. Compute the inertial step:
    θ n = min { μ n , ξ n x n x n 1 } , if x n x n 1 ; μ n , otherwise ,
    and
    δ n = max { ρ n , ξ n x n 1 x n 2 } , if x n 1 x n 2 ; ρ n , otherwise ,
  • Step 2. Compute
    y n = x n + θ n ( x n x n 1 ) + δ n ( x n 1 x n 2 ) ,
    z n = y n + η n ( x n y n ) ,
    v n = ( 1 β n ) x n + β n prox γ g ( I γ n f ) z n ,
    w n = ( 1 τ n ) prox γ n g ( I γ n f ) y n + τ n prox γ n g ( I γ n f ) v n ,
    x n + 1 = α n ( I s ϕ ) ( x n ) + ( 1 α n ) w n .
    Set n := n + 1 and return to Step 1.
The following result is obtained directly by Theorem 1.
Theorem 2.
Let { x n } be a sequence generated by Algorithm 8 with the same condition as in Theorem 1. Then, x n p Γ where p = P Γ S ( p ) .
Proof. 
Set S = I s ϕ in Theorem 1. From Proposition 2, we know that I s ϕ is a contraction. By Theorem 1, we obtain that x n p Γ , where p = P Γ S ( p ) . From Proposition 1, it can be obtained that for any x Ω ,
0 S ( p ) p , x p ) = ( p s ϕ ( p ) ) p , x p = s ϕ ( p ) , x p
hence ϕ ( p ) , x p 0 , that is, p Ω .

4. Application

In this section, we apply our proposed algorithm to improve the training of deep learning models by reformulating their training tasks as structured convex optimization problems. Our approach is based on fixed-point theory, which provides strong theoretical guarantees for convergence and solution reliability. This makes the training process more stable, efficient, and robust, especially in the presence of noise or ill-conditioned data.
We focus on a class of models called Extreme Learning Machines (ELM) and their deeper extensions, Two-Hidden-Layer ELM (TELM). These models are known for their fast training and competitive accuracy. Unlike traditional neural networks, ELMs randomly assign hidden layer weights and only compute output weights, typically by solving a least-squares problem.
However, when the hidden layer output matrix is ill-conditioned or the data is noisy, direct pseudoinverse computations become unstable and prone to overfitting. To address this, we reformulate the training process as a convex minimization problem with regularization. This structure naturally fits into the framework of fixed-point problems, allowing us to apply our algorithm without relying on explicit matrix inversion.

4.1. Application to ELM

ELM is a neural network model initially proposed by Huang et al. [17]. ELM is well-known for its rapid training capability and strong generalization performance. By integrating our algorithm into the ELM framework, we aim to boost both optimization efficiency and predictive accuracy.
Let us define the training dataset as ( x i , t i ) R n × R m : i = 1 , 2 , , s , consisting of s input–target pairs, where x i denotes the input vector and t i denotes the associated target output.
ELM is designed for Single-Layer Feedforward Networks (SLFNs) and operates based on the following functional form:
o i = j = 1 h η j G ( ω j , x i + b j ) , i = 1 , , s ,
where o i is the predicted output, h denotes the number of hidden neurons, G ( · ) is the activation function, ω j and η j are weight vectors for input and output connections of the j-th hidden node, and b j is the corresponding bias term.
Let the hidden layer output matrix H R s × h be defined as
H i j = G ( ω j , x i + b j ) i = 1 , , s , j = 1 , , h .
The training objective is to find a solution that best approximates the output target:
t i = j = 1 h η j G ( ω j , x i + b j ) , i = 1 , 2 , , s .
which can be compactly written in matrix form as
H u = T ,
where u = [ η 1 T , , η h T ] T is the output weight vector and T = [ t 1 T , , t s T ] T is the desired output matrix.
To enhance generalization and reduce overfitting, a LASSO regularization term is introduced. The resulting optimization problem becomes
min u H u T 2 2 + λ u 1 ,
where · 1 denotes the l 1 -norm and λ > 0 is a regularization coefficient that controls sparsity.

4.2. Application to TELM

TELM is an extension of the traditional ELM that improves learning capacity by incorporating two hidden layers. Unlike conventional backpropagation-based multi-layer networks, TELM retains the fast training characteristics of ELM by leveraging analytic solutions in both stages. It is particularly suitable for modeling complex nonlinear relationships in high-dimensional data while avoiding the computational cost of iterative optimization.
A work by Janngam et al. [18] demonstrated that TELM, when trained using their proposed algorithm, not only converges significantly faster than standard ELM but also achieves higher classification accuracy on various medical and benchmark datasets. Additionally, earlier work by Qu et al. [19] showed that TELM consistently outperforms traditional ELM, especially in nonlinear and high-dimensional settings, by yielding better average accuracy with fewer hidden neurons.
These cumulative findings reinforce the choice of TELM as the core learning model for our study, particularly when enhanced with the proposed algorithm.
Let the training set be defined as { ( x i , t i ) R n × R m : i = 1 , 2 , , s } , where x i is the input vector and t i is the corresponding target output.
 
Stage 1: Initial Feature Transformation and Output Weights.
To simplify the initialization process, TELM begins by temporarily combining the two hidden layers into a single equivalent hidden layer. The combinated hidden layer matrix H is defined as
H = G ( X W + B ) ,
where X R s × n is the input matrix, W R n × h is the randomly initialized weight matrix for the first hidden layer, B R s × h is the bias matrix, and G ( · ) is the activation function.
The output weights u R h × m connecting the hidden layer to the output layer are determined based on the linear system:
H u = T ,
where T R s × m is the target matrix.
We find the optimal weight u using Algorithms 7 and 8 for solving the convex optimization problem with LASSO regularization as follows:
min u H u T 2 2 + λ u 1 ,
where λ > 0 is the regularization parameter that controls model complexity and prevents overfitting.
 
Stage 2: Separation and Refinement of Hidden Layers.
After computing the initial output weights u from the first stage using (52), the two hidden layers are separated to allow independent refinement.
To estimate the expected output of the second hidden layer, denoted as H 1 R s × h , we express that it satisfies the following equation:
H 1 u = T .
However, rather than computing H 1 directly from matrix inversion, we apply our proposed algorithm to solve the following convex optimization problem with LASSO regularization:
min H 1 H 1 u T 2 2 + λ H 1 1 ,
where λ > 0 is the regularization parameter.
Next, TELM updates the weights and bias between the first and second hidden layers, denoted as W 1 R h × h and B 1 R s × h , respectively, using the expected output H 1 from (57). Ideally, the following equation describes the connection between layers:
H 1 = G ( H W 1 + B 1 ) .
However, since both W 1 and B 1 are unknown, solving (55) directly is not feasible. To address this, we reformulate the equation as
H 1 = G ( H E W HE ) ,
where H E = [ 1 H ] R s × ( h + 1 ) is the extended input matrix and W HE = [ B 1 W 1 ] T R ( h + 1 ) × h combines the weights and biases into a single matrix.
To estimate W HE , we solve the following convex optimization problem with LASSO regularization:
min W HE H E W HE G 1 ( H 1 ) 2 2 + λ W HE 1 ,
where G 1 ( · ) denotes the inverse of the activation function G, and λ > 0 is the regularization parameter.
Finally, using the estimated W HE from (57), the refined output of the second hidden layer H 2 is computed as
H 2 = G ( H E W HE ) ,
where H 2 R s × h represents the updated hidden layer of the second hidden layer after adjusting the weights and biases.
 
Final Stage: Output Layer Update.
Finally, TELM updates the output weight matrix u new R h × m , which connects the second hidden layer to the output layer, by solving
H 2 u new = T .
To obtain u new , we solve the following convex optimization problem using the LASSO technique:
min u new H 2 u new T 2 2 + λ u new 1 ,
where λ > 0 is the regularization parameter. Once u new is obtained, the predicted output matrix Y R s × m is computed as
Y = H 2 u new .
This approach enhances numerical stability and improves the model ability to handle high-dimensional or noisy real-world data.

4.2.1. Experiments: Data Classification for Minimization Problems

Data classification is a fundamental task in machine learning, where the objective is to assign each input sample to one of several predefined categories. Common applications include medical diagnosis, object recognition, and fraud detection. In this work, we apply our proposed algorithm to train TELM for practical classification tasks.
To evaluate classification performance, we conducted experiments on three benchmark datasets and one real-world medical dataset. Each dataset was divided into 70% training and 30% testing sets. The details of the datasets are summarized in Table 1.
  • Breast Cancer Dataset: A widely used dataset containing features extracted from digitized images of breast masses, used to classify tumors as benign or malignant.
  • Heart Disease Dataset: A standard dataset used to predict the presence of heart disease based on clinical attributes.
  • Diabetes Dataset: Contains diagnostic data for predicting the onset of diabetes in patients.
  • Hypertension Dataset: A real-world dataset collected by Sripat Medical Center, Faculty of Medicine, Chiang Mai University.
Table 2 summarizes the parameter settings for each algorithm compared in our experiments.
In addition, the following settings were consistently applied across all experimental setups:
  • Regularization parameter: λ = 10 5 .
  • Activation function: Sigmoid, g ( x ) = 1 1 + e x .
  • Number of hidden nodes: h = 30 .
  • Contraction mapping: S = 1 3 x + 1 .
  • In Algorithm 6, θ n is defined by
    θ n = θ n ¯ n 2 x n x n 1 if x n x n 1 θ n ¯ , otherwise .
To assess and compare the classification performance of each algorithm, we employed four widely used evaluation metrics: accuracy, precision, recall, and F1-score.
Accuracy measures the proportion of correctly classified samples, both positive and negative, relative to the total number of samples. It is computed as
Accuracy ( acc ) = T P + T N T P + T N + F P + F N × 100 ,
where T P and T N are the true positives and true negatives, respectively; F P is the number of false positives (incorrectly predicting a patient as diseased); and F N is the number of false negatives (failing to detect a diseased patient).
Precision reflects the proportion of true positives among all instances predicted as positive:
Precision ( pre ) = T P T P + F P .
Recall, or sensitivity, represents the proportion of actual positive cases that are correctly identified:
Recall ( rec ) = T P T P + F N .
F1-score is the harmonic mean of precision and recall, providing a balanced measure of model performance, particularly in imbalanced datasets:
F 1 -score ( F 1 ) = 2 × p r e × r e c p r e + r e c .
The performance of each algorithm is analyzed at the 1000th iteration, as presented in Table 3. Four datasets, breast cancer, heart disease, diabetes, and hypertension, were utilized to evaluate and compare the effectiveness of Algorithms 6 and 7 using standard classification metrics including accuracy, precision, recall, and F1-score on both training and testing data.
The results indicate that Algorithm 7 consistently performs well across all datasets. In particular, in the hypertension dataset, which reflects real-world conditions, Algorithm 7 achieves high accuracy and balanced precision–recall performance. This demonstrates its strong generalization capability and suitability for real-world medical applications that require reliable predictions and low error sensitivity.
To evaluate model performance with respect to both goodness-of-fit and model complexity, we utilize the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC). These criteria are defined as follows:
  • Akaike Information Criterion (AIC):
    AIC = 2 k 2 ln ( L ^ )
    where k is the number of estimated parameters in the model and L ^ is the maximum value of the likelihood function.
  • Bayesian Information Criterion (BIC):
    BIC = k ln ( n ) 2 ln ( L ^ )
    where n is the number of observations, k is the number of parameters, and L ^ is the maximum likelihood of the model.
Lower AIC and BIC values indicate better models in terms of balancing accuracy and simplicity.
To assess the consistency of the performance of model across multiple trials or datasets, we compute the mean and standard deviation (std) of AIC and BIC values.
  • Mean AIC and BIC:
    Mean AIC = 1 m i = 1 m AIC i , Mean BIC = 1 m i = 1 m BIC i
  • Standard Deviation of AIC and BIC:
    Std AIC = 1 m 1 i = 1 m AIC i Mean AIC 2
    Std BIC = 1 m 1 i = 1 m BIC i Mean BIC 2
These statistics indicate the central tendency and dispersion of the AIC and BIC scores, where smaller standard deviations imply more stable model performance across different experiments.
To understand how well each algorithm fits the data without too much complexity, we compare their AIC and BIC values, as shown in Table 4. Both AIC and BIC are commonly used to measure how good a model is; lower values mean that the model is more efficient and avoids overfitting.
The results show that Algorithm 7 gives lower AIC and BIC values than Algorithm 6 for all datasets. This means that Algorithm 7 is simpler and better at handling the data. The difference is most noticeable in the hypertension dataset, which comes from real-world health data. These results confirm that Algorithm 7 is a strong choice for real-world applications, where the model needs to be both accurate and not too complicated.

4.2.2. Application to Convex Bilevel Optimization Problems

The TELM model can also be formulated within the framework of convex bilevel optimization to better capture hierarchical learning structures. In this setting, we interpret the output weight learning (final step of TELM) as the solution to a lower-level convex problem, and the optimization of the hidden transformation weights (e.g., W H E ) as the upper-level objective.
In our TELM-based learning problem, this bilevel formulation arises naturally:
  • The inner problem corresponds to learning the output weights u given the fixed transformation W H E , and can be cast as a LASSO-type convex minimization:
    f ( u )   =   H 2 u T 2 2 , g ( u ) = λ 1 u 1 ,
    where H 2 is the second hidden layer output and T is the target.
  • The outer problem focuses on optimizing the hidden transformation weights W H E based on the optimal solution u * ( W H E ) from the inner problem. The upper-level loss is given by
    ϕ ( W H E ) = 1 2 W H E 2 2 .
Solving this bilevel problem directly is challenging due to the implicit constraint Γ . However, by leveraging our proposed algorithm and proximal operator techniques, we can solve both levels efficiently and with guaranteed convergence under mild assumptions. This makes TELM highly suitable for structured learning tasks where the learning objectives are nested and interdependent.
To assess the performance of Algorithm 8 in solving convex bilevel optimization problems, we conducted experiments on the same datasets used in the convex optimization setting (see Section 4.2.1). These include the breast cancer, heart disease, diabetes, and hypertension datasets, with a 70%/30% split for training and testing, respectively.
We evaluated classification performance using the same metrics—accuracy, precision, recall, and F1-score—to ensure consistency across experiments.
In this bilevel setting, we compared our method against Algorithm 1 (BiGSAM), Algorithm 2 (iBiGSAM), Algorithm 3 (aiBiGSAM), Algorithm 4 (miBiGSAM), and Algorithm 5 (amiBiGSAM).
All algorithms were configured according to the parameter settings summarized in Table 5, ensuring fair and reproducible evaluation across all methods.
In addition, the following settings were consistently applied across all experimental setups:
  • Regularization parameter: λ = 10 5 .
  • Activation function: Sigmoid, g ( x ) = 1 1 + e x .
  • Number of hidden nodes: h = 30 .
To evaluate the effectiveness of the proposed algorithm (Algorithm 8), we conducted experiments on four datasets. Each algorithm was trained for 1000 iterations, and the performance was measured in terms of accuracy, precision, recall, and F1-score for both training and testing phases. The comparative results of all algortithms are summarized in Table 6.
As shown in Table 6, the proposed algorithm (Algorithm 8) consistently outperforms other methods across all datasets in both training and testing phases. In particular, for the breast cancer and diabetes datasets, Algorithm 8 achieves the highest test accuracy and F1-scores, demonstrating its strong generalization capability and classification performance.
Notably, in the hypertension dataset, which represents real-world medical data with high variability and complexity, the proposed method maintains superior accuracy and F1-score compared to baseline algorithms. This highlights the robustness and practical applicability of Algorithm 8 in real-world clinical settings.
Overall, the results support the effectiveness and stability of the proposed algorithm, making it a promising approach for medical classification tasks across diverse domains.
To statistically evaluate the performance of each algorithm, we computed the AIC and BIC values, including their mean and standard deviation, for both the training and testing phases. The experiments were conducted on four datasets: breast cancer, heart disease, diabetes, and hypertension. The summarized results presented in Table 7 serve to compare the statistical efficiency of each algorithm.
According to the results in Table 7, the proposed algorithm (Algorithm 8) consistently shows lower AIC and BIC values across several datasets. This means that the model fits well and is less likely to overfit the data. In particular, in the hypertension dataset, which contains real and complex medical data, Algorithm 8 achieves the lowest and most consistent scores. This shows that the algorithm can handle real-world situations effectively and gives reliable results.
From Table 6 and Table 7, it is evident that Algorithm 8 consistently outperforms all variants of BiG-SAM, including the improved versions (Algorithms 2–5).
In all datasets considered in this work (see Table 6 and Table 7), Algorithm 8 achieves the highest classification performance and also yields the lowest AIC and BIC scores, suggesting a better model fit with lower complexity. Moreover, its standard deviations are relatively small, indicating robustness and stability across different runs. Therefore, Algorithm 8 can be considered the most effective and reliable algorithm among those evaluated.

5. Conclusions

We proposed the Double Inertial Viscosity Forward–Backward Algorithm (DIVFBA) and the Bilevel Double Inertial Forward–Backward Algorithm (BDIFBA), which is the modification of Algorithm 6, to solve variational inclusion and bilevel optimization problems, respectively. The proposed algorithms ensure strong convergence and achieve higher accuracy and stability compared to existing algorithms. We applied the proposed algorithms to train TELM models, where they consistently outperformed other existing algorithms in terms of evaluation metrics, and statistic values. These results confirm the effectiveness of the proposed algorithms in practical applications.

Author Contributions

Software, E.P.; writing—original draft, P.S.-j.; writing—review and editing, S.S. All authors have read and agreed to the published version of the manuscript.

Funding

The APC was funded by the Fundamental Fund 2025, Chiang Mai University.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

This research was partially supported by Chiang Mai University and the Fundamental Fund 2025, Chiang Mai University. The first author would like to thank the CMU Presidential Scholarship for the financial support.

Conflicts of Interest

The authors declare no conflicts of interest and the funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

  1. Sabach, S.; Shtern, S. A first order method for solving convex bilevel optimization problems. SIAM J. Optim. 2017, 27, 640–660. [Google Scholar] [CrossRef]
  2. Polyak, B.T. Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 1964, 4, 1–17. [Google Scholar] [CrossRef]
  3. Shehu, Y.; Vuong, P.T.; Zemkoho, A. An inertial extrapolation method for convex simple bilevel optimization. Optim. Methods Softw. 2021, 36, 1–19. [Google Scholar] [CrossRef]
  4. Duan, P.; Zhang, Y. Alternated and multi-step inertial approximation methods for solving convex bilevel optimization problems. Optimization 2023, 72, 2517–2545. [Google Scholar] [CrossRef]
  5. Wattanataweekul, R.; Janngam, K.; Suantai, S. A novel two-step inertial viscosity algorithm for bilevel optimization problems applied to image recovery. Mathematics 2023, 11, 3518. [Google Scholar] [CrossRef]
  6. Sae-jia, P.; Suantai, S. A new two-step inertial algorithm for solving convex bilevel optimization problems with application in data classification problems. AIMS Math. 2024, 9, 8476–8496. [Google Scholar] [CrossRef]
  7. Sae-jia, P.; Suantai, S. A novel accelerated fixed point algorithm for convex bilevel optimization problems with applications to machine learning for data classification. J. Nonlinear Funct. Anal. 2025, 2025, 1–21. [Google Scholar] [CrossRef]
  8. Moudafi, A.; Oliny, M. Convergence of a splitting inertial proximal method for monotone operators. J. Comput. Appl. Math. 2003, 155, 447–454. [Google Scholar] [CrossRef]
  9. Peeyada, P.; Suparatulatorn, R.; Cholamjiak, W. An inertial Mann forward-backward splitting algorithm of variational inclusion problems and its applications. Chaos Solitons Fractals 2022, 158, 112048. [Google Scholar] [CrossRef]
  10. Kesornprom, S.; Peeyada, P.; Cholamjiak, W.; Ngamkhum, T.; Jun-on, N. New iterative method with inertial technique for split variational inclusion problem to classify tpack level of pre-service mathematics teachers. Thai J. Math. 2023, 21, 351–365. [Google Scholar]
  11. Inkrong, P.; Cholamjiak, P. Multi-Inertial Forward-Backward Methods for Solving Variational Inclusion Problems and Applications in Image Deblurring. Thai J. Math. 2025, 23, 263–277. [Google Scholar]
  12. Takahashi, W. Introduction to Nonlinear and Convex Analysis; Yokohama Publishers: Yokohama, Japan, 2009. [Google Scholar]
  13. Hanjing, A.; Thongpaen, P.; Suantai, S. A new accelerated algorithm with a linesearch technique for convex bilevel optimization problems with applications. AIMS Math. 2024, 9, 22366–22392. [Google Scholar] [CrossRef]
  14. Goebel, K.; Kirk, W.A. Topics in Metric Fixed Point Theory; No. 28; Cambridge University Press: Cambridge, UK, 1990. [Google Scholar]
  15. López, G.; Martín-Márquez, V.; Wang, F.; Xu, H.K. Forward-backward splitting methods for accretive operators in Banach spaces. Abstr. Appl. Anal. 2012, 2012, 109236. [Google Scholar] [CrossRef]
  16. Saejung, S.; Yotkaew, P. Approximation of zeros of inverse strongly monotone operators in Banach spaces. Nonlinear Anal. Theory Methods Appl. 2012, 75, 742–750. [Google Scholar] [CrossRef]
  17. Huang, G.B.; Zhu, Q.Y.; Siew, C.K. Extreme learning machine: Theory and applications. Neurocomputing 2006, 70, 489–501. [Google Scholar] [CrossRef]
  18. Janngam, K.; Suantai, S.; Wattanataweekul, R. A novel fixed-point based two-step inertial algorithm for convex minimization in deep learning data classification. AIMS Math. 2025, 10, 6209–6232. [Google Scholar] [CrossRef]
  19. Qu, B.Y.; Lang, B.F.; Liang, J.J.; Qin, A.K.; Crisalle, O.D. Two-hidden-layer extreme learning machine for regression and classification. Neurocomputing 2016, 175, 826–834. [Google Scholar] [CrossRef]
Table 1. Summary of datasets used in the experiments.
Table 1. Summary of datasets used in the experiments.
DatasetTotal SamplesTrainingTestingFeaturesClasses
Breast Cancer68347820592
Heart Disease30321291132
Diabetes43630513182
Hypertension61084275183392
Table 2. Parameter settings for each algorithm.
Table 2. Parameter settings for each algorithm.
ParameterAlgorithm 7Algorithm 6
μ n 1 1 n + 1 -
ρ n 1 n -
ξ n 33 × 10 20 n 2 -
β n n 2 n + 1 -
τ n 0.5 + 1 2 n -
α n n 2 n + 1 -
η n 1 n + 2 1 n + 2
γ n 1 L f 1 L f
θ n ¯ - 1 x n x n 1 5 + n 5
Table 3. Performance comparison between Algorithms 6 and 7 on each dataset.
Table 3. Performance comparison between Algorithms 6 and 7 on each dataset.
DatasetsAlgorithmAccuracy (%)PrecisionRecallF1
Train Test Train Test Train Test Train Test
Breast CancerAlgorithm 797.299596.77750.95760.95520.96560.95400.96160.9539
Algorithm 696.827796.92460.94140.94640.96980.97050.95530.9570
Heart DiseaseAlgorithm 784.415381.49460.83630.81410.88750.87170.86110.8376
Algorithm 677.412977.18280.79800.79410.78860.80070.79250.7931
DiabetesAlgorithm 792.278291.74950.93200.92790.93400.92800.93300.9266
Algorithm 687.640587.63210.95340.96020.82560.82080.88480.8811
HypertensionAlgorithm 789.043588.91630.85090.85010.92410.92210.88600.8846
Algorithm 687.844787.45930.82590.82250.93280.92820.87610.8721
Table 4. Statistical comparison of Algorithms 6 and 7 based on AIC and BIC values.
Table 4. Statistical comparison of Algorithms 6 and 7 based on AIC and BIC values.
DatasetsAlgorithmTrainTest
AIC Mean AIC Std BIC Mean BIC Std AIC Mean AIC Std BIC Mean BIC Std
Breast
Cancer
Algorithm 7545.9057.46351872.2457.4886591.2146.20821258.3846.0285
Algorithm 61848.7931.16063175.1331.2089737.365.88451404.535.8996
Heart
Disease
Algorithm 71306.0513.17702821.5712.8471895.6611.84651488.2911.3720
Algorithm 61579.115.88593094.636.1726921.981.34021514.616.2564
DiabetesAlgorithm 7930.6525.19952003.1725.2196584.1120.16051063.3620.7454
Algorithm 61600.2635.62802672.7735.7123657.046.44991136.297.1045
HypertensionAlgorithm 78142.3663.578810,126.0063.58171438.9274.39512763.3574.4419
Algorithm 614,998.70334.182516,982.34334.18372200.9538.05033525.3838.0505
Table 5. The setting of parameters for each algorithms.
Table 5. The setting of parameters for each algorithms.
ParametersAlgorithm 8Algorithm 1Algorithm 2Algorithm 3Algorithm 4Algorithm 5
μ n 1 1 n + 1 -----
ρ n 1 n -----
η n 1 n + 2 -----
β n n 2 n + 1 -----
τ n 0.5 + 1 2 n -----
ξ n 33 × 10 20 n 2 - 1 ( n + 1 ) 2 1 ( n + 1 ) 2 1 ( n + 1 ) 2 1 ( n + 1 ) 2
α n n 2 n + 1 1 n + 1 1 n + 1 1 n + 1 1 n + 1 1 n + 1
γ n 1 L f n · 10 5 ( n + 1 ) · L F n · 10 5 ( n + 1 ) · L F 1 L F 1 L F 1 L F
s0.0010.0010.0010.0010.0010.001
α --3333
q----44
Table 6. Performance comparison of algorithms on each dataset.
Table 6. Performance comparison of algorithms on each dataset.
DatasetsAlgorithmAccuracy (%)PrecisionRecallF1
Train Test Train Test Train Test Train Test
Breast
Cancer
Algorithm 897.423497.36360.94550.94540.98280.98330.96380.9633
Algorithm 196.193395.45610.94490.94950.94700.92070.94530.9320
Algorithm 296.892896.34270.95330.95480.95820.94150.95570.9473
Algorithm 397.055596.77960.94900.94800.96790.96250.95830.9546
Algorithm 497.332097.07160.94850.94470.97680.97500.96240.9591
Algorithm 597.250896.92460.94880.94820.97400.96670.96120.9567
Heart
Disease
Algorithm 887.348684.79570.86270.850.91310.88460.88710.8635
Algorithm 183.205181.16130.85310.84110.83570.81140.84410.8213
Algorithm 286.615083.47310.86290.84190.89700.86580.87940.8496
Algorithm 386.651783.47310.86340.84190.89700.86580.87970.8496
Algorithm 486.285383.79570.85320.83760.90370.87790.87760.8540
Algorithm 586.431183.80650.86370.84280.89160.87210.87740.8532
DiabetesAlgorithm 894.903194.28120.93320.92540.98190.98000.95690.9517
Algorithm 187.334686.96090.95940.95880.81450.80850.88100.8730
Algorithm 294.036793.58880.91860.91770.98360.97600.95000.9458
Algorithm 394.036793.58880.91860.91770.98360.97600.95000.9458
Algorithm 493.501191.52750.93410.91470.95440.94020.94410.9264
Algorithm 594.164193.81610.92080.91820.98320.98000.95100.9480
HypertensionAlgorithm 889.054488.99800.87140.86980.89450.89550.88270.8822
Algorithm 165.100865.14160.60710.60750.58950.59530.59340.5964
Algorithm 288.017588.01570.83130.83110.92820.92890.87710.8772
Algorithm 388.017588.01570.83130.83110.92820.92890.87710.8772
Algorithm 487.928487.90110.82790.82780.93150.93140.87670.8765
Algorithm 587.939387.96660.82880.82900.93030.93100.87660.8770
Table 7. Comparison of AIC and BIC scores (mean and standard deviation) for all algorithms.
Table 7. Comparison of AIC and BIC scores (mean and standard deviation) for all algorithms.
DatasetsAlgorithmTrainTest
AIC Mean AIC Std BIC Mean BIC Std AIC Mean AIC Std BIC Mean BIC Std
Breast
Cancer
Algorithm 8519.387756.38681845.727956.3719584.622733.49391251.788933.2445
Algorithm 12238.5854200.31853564.9257200.2692782.986219.96211450.152420.5936
Algorithm 2635.470458.16491961.810658.1312605.955633.98591273.121834.3455
Algorithm 3635.688156.01381962.028355.9914605.955633.98591273.121834.3455
Algorithm 4540.884550.30971867.224750.3208589.642534.45051256.808734.2693
Algorithm 5558.577335.30921884.917535.3339598.352834.69551265.519134.1824
Heart
Disease
Algorithm 81238.630516.53342754.146216.5096887.819010.75771480.453411.6136
Algorithm 11609.63903.45813125.15483.8379925.78431.27511518.41877.5333
Algorithm 21253.481711.98952768.997411.8911890.070010.71551482.704411.1995
Algorithm 31253.481711.98952768.997411.8911890.396610.74701483.031011.5066
Algorithm 41261.573412.52412777.089212.2814889.831411.55171482.465811.8088
Algorithm 51258.980313.12102774.496112.7994890.080510.73641482.714911.6097
DiabetesAlgorithm 8756.023321.23321828.539221.1804562.442625.21201041.690926.6428
Algorithm 11479.645316.50492552.161216.7075643.34546.25911122.59387.2984
Algorithm 2818.298726.20851890.814626.2338567.968824.26711047.217225.1373
Algorithm 3818.298726.20851890.814626.2338567.968824.26711047.217225.1373
Algorithm 4852.571129.20811925.086929.0782579.119019.46711058.367320.9319
Algorithm 5818.423024.73341890.938824.8507567.241922.92621046.490323.5958
HypertensionAlgorithm 87848.5782.949832.1782.941403.9375.192728.3675.19
Algorithm 114,416.872736.2316400.472736.242145.51286.743469.94286.68
Algorithm 28350.1663.6810333.7663.681457.4471.742781.8771.73
Algorithm 38348.9361.9410332.5361.941457.4471.742781.8771.73
Algorithm 48414.8661.8410398.4661.841464.7469.332789.1769.34
Algorithm 58402.4762.2910386.0762.291467.4163.792791.8463.80
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sae-jia, P.; Panyahan, E.; Suantai, S. A New Accelerated Forward–Backward Splitting Algorithm for Monotone Inclusions with Application to Data Classification. Mathematics 2025, 13, 2783. https://doi.org/10.3390/math13172783

AMA Style

Sae-jia P, Panyahan E, Suantai S. A New Accelerated Forward–Backward Splitting Algorithm for Monotone Inclusions with Application to Data Classification. Mathematics. 2025; 13(17):2783. https://doi.org/10.3390/math13172783

Chicago/Turabian Style

Sae-jia, Puntita, Eakkpop Panyahan, and Suthep Suantai. 2025. "A New Accelerated Forward–Backward Splitting Algorithm for Monotone Inclusions with Application to Data Classification" Mathematics 13, no. 17: 2783. https://doi.org/10.3390/math13172783

APA Style

Sae-jia, P., Panyahan, E., & Suantai, S. (2025). A New Accelerated Forward–Backward Splitting Algorithm for Monotone Inclusions with Application to Data Classification. Mathematics, 13(17), 2783. https://doi.org/10.3390/math13172783

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop