Proximal Gradient Method for Solving Bilevel Optimization Problems

: In this paper, we consider a bilevel optimization problem as a task of ﬁnding the optimum of the upper-level problem subject to the solution set of the split feasibility problem of ﬁxed point problems and optimization problems. Based on proximal and gradient methods, we propose a strongly convergent iterative algorithm with an inertia effect solving the bilevel optimization problem under our consideration. Furthermore, we present a numerical example of our algorithm to illustrate its applicability.


Introduction
Let H be a real Hilbert space and consider the constrained minimization problem: where C is a nonempty closed convex subset of H and h : H → R is a convex and continuously differentiable function. The gradient-projection algorithm (GPA, for short) is usually applied to solve the minimization problem (1) and has been studied extensively by many authors; see, for instance, [1][2][3] and references therein. This algorithm generates a sequence {x n } through the recursion: where ∇h is the gradient of h, x 0 is the initial guess chosen arbitrarily from C, γ is a stepsize which may be chosen in different ways, and P C is the metric projection from H onto C. By the optimality condition on problem (1), it follows that x ∈ C solves (1) if and only if ∇h(x), y −x ≥ 0, ∀y ∈ C.
If ∇h is Lipschitz continuous and strongly monotone, i.e., there exists L h > 0 and σ > 0 such that for all x, y ∈ H, ∇h(x) − ∇h(y) ≤ L h x − y and ∇h(x) − ∇h(y), x − y ≥ σ x − y 2 , then the operator T γ = P C (I − γ∇h) is a contraction provided that 0 < γ < 2σ . Therefore, for , we can apply Banach's contraction principle to get that the sequence {x n } defined by (2) converges strongly to the unique fixed point of T γ (or the unique solution of the minimization (1)). Moreover, if you set C = H in (1), then we have an unconstrained optimization problem, and hence the gradient algorithm x n+1 = x n − γ∇h(x n ) generates a sequence {x n } strongly convergent to the global minimizer point of h.
Consider the other most well-known problem called unconstrained minimization problem: where H is a real Hilbert space and g : H → R ∪ {+∞} is a proper, convex, lower semicontinuous function. An analogous method for solving (3) with better properties is based on the notion of proximal mapping introduced by Moreau [4], i.e., the proximal operator of the function g with scaling parameter λ > 0 is a mapping prox λg : H → H given by prox λg (x) = arg min y∈H {g(y) + 1 2λ x − y 2 }.
Proximal operators are firmly nonexpansive and the optimality condition of (3) is x ∈ H solves (3) if and only if prox λg (x) =x.
Many properties of proximal operator can be found in [5] and the references therein. We know that the so called proximal point algorithm, i.e., x n+1 = prox λg (x n ), is the most popular method solving optimization problem (3) (introduced by Martinet [6,7] and later by Rockafellar [8]). The split inverse problem (SIP) [9] is formulated by linking problems installed in two different places X and Y connected by a linear transformations, i.e., SIP is a problem of finding a point in space X solving a problem IP1 installed in X and its image under linear transformation solves a problem IP2 installed in another space Y. The presence of step size choice dependent on operator norm is not quite recommended in the iterative method of solving SIPs, as it is not always easy to estimate the norm of an operator; see, for example, the Theorem of Hendrickx and Olshevsky in [10]. For example, in the early study of the iterative method of solving the split feasibility problem [11][12][13], the determination of the step-size depends on the operator norm (or at least estimate value of the operator norm) and this is not as easy of a task. To overcome this difficulty, Lopez et al. [14] introduced a new way of selecting the step sizes that the information of operator norm is not necessary for solving a split feasibility problem (SFP): where C and Q are closed convex subsets of real Hilbert spaces H 1 and H 2 , respectively. To be precise, Lopez et al. [14] introduced an iterative algorithm that generates a sequence {x n } by The parameter γ n appeared in (4) by γ n = ρ n l(x n ) ∇l(x n ) 2 , ∀n ≥ 1, where ρ n ∈ (0, 4), l(x n ) = 1 2 (I − P Q )Ax n 2 and ∇l(x n ) = A * (I − P Q )Ax n .
A bilevel problem is a two-level hierarchical problem such that the solution of the lower level problem determines the feasible space of the upper level problem. In general, Yimer et al. [15] presented a bilevel problem as an archetypal model given by findx ∈ S ⊂ X that solves problem P1 installed in a space X, where S is the solution set of the problem find x * ∈ Y ⊂ X that solves problem P2 installed in a space X.
According to [16], the bilevel problem (problem (5) and (6)) is a hierarchical game of two players as decision makers who make their decisions according to a hierarchical order. The problem is also called the leader's and follower's problem where the problem (5) is called the leader's problem and (6) is called the follower's problem, meaning, the first player (which is called the leader) makes his selection first and communicates it to the second player (the so-called follower). There are many studies for several type bilevel problems, see, for example, [15,[17][18][19][20][21][22][23][24]. The bilevel optimization problem is a bilevel problem when the hierarchical structure involves the optimization problem. Bilevel optimization problems have become an increasingly important class of optimization problems during the last few years and decades due their to vast application of solving the real life problems. For example, in toll-setting problem [25], in chemical engineering [26], in electricity markets [27], and in supply chain problems [28]. Motivated by the above theoretical results and inspired by the applicability of the bilevel problem, we consider the following bilevel optimization problem given by where The demimetric mapping is introduced by Takahashi [29] in a smooth, strictly convex and reflexive Banach space. For a real Hilbert space H, (8) is equivalent to the following: and FixU is a closed and convex subset of H [29]. The class of demimetric mappings contains the classes of strict pseudocontractions, firmly quasi-nonexpansive mappings, and quasi-nonexpansive mappings, see [29,30] and the references therein. Assume that Ω is the set of solutions of lower level problems of the bilevel optimization problem (7), that is, Therefore, the bilevel optimization problem (7) is simply where Ω is given by (9). If H 1 = H 2 = H, A = I (identity operator), g = g j for all j ∈ {1, . . . , M}, the problem (7) is reduced to the bilevel optimization problem: Bilevel problems like (10) have already been considered in the literature, for example, [23,31,32] for the case H = R p . Note that, to the best of our knowledge, the bilevel optimization problem (7), with a finite intersection of fixed point sets of the broadest class of nonlinear mappings and finite intersection of minimize point sets of non-smooth functions as a lower level, has not been addressed before.
An inertial term is a two-step iterative method, and the next iterate is defined by making use of the previous two iterates. It is firstly introduced by Polyak [33] as an acceleration process in solving a smooth convex minimization problem. It is well known that combining algorithms with an inertial term speeds up or accelerates the rate of convergence of the sequence generated by the algorithm. In this paper, we introduce a proximal gradient inertial algorithm with a strong convergence result for approximating a bilevel optimization problem (7), where our algorithm is designed to address a way of selecting the step-sizes such that its implementation does not need any prior information about the operator norm.

Preliminary
Let C be a nonempty closed convex subset of a real Hilbert space H. The metric projection on C is a mapping P C : H → C defined by For x ∈ H and z ∈ C, then z = P C (x) if and only if x − z, y − z ≤ 0, ∀y ∈ C.
If L ∈ (0, 1), then we call T a contraction with constant L. If L = 1, then T is called a nonexpansive mapping. (b) T is strongly monotone if there exists σ > 0 such that In this case, T is called σ-strongly monotone.
which is equivalent to If T is firmly nonexpansive, I − T is also firmly nonexpansive.
Let H be a real Hilbert space. If G : H → 2 H is maximal monotone set-valued mapping, then we define the resolvent operator J G λ associated with G and λ > 0 as follows: It is well known that J G λ is single-valued, nonexpansive, and 1-inverse strongly monotone (firmly nonexpansive). Moreover, 0 ∈ G(x) if and only ifx is a fixed point of J G λ for all λ > 0; see more about maximal monotone and its associated resolvent operator and examples of maximal monotone operators in [34].
The subdifferential of a convex function f : this is the gradient of f . If f is a proper, lower semicontinuous function, the subdifferential operator is a maximal monotone operator, and the proximal operator is the resolvent of the subdifferential operator (see, for example, in [5]), i.e., Thus, this results in proximal operators being firmly nonexpansive, and a pointx minimizes f if and only if prox λ f (x) =x.

Lemma 2.
[35] Let {c n } and {γ n } be a sequences of nonnegative real numbers, {β n } be a sequences of real numbers such that where 0 < α n < 1 and ∑ γ n < ∞.
Definition 2. Let {Γ n } be a real sequence. Then, {Γ n } decreases at infinity if there exists n 0 ∈ N such that Γ n+1 ≤ Γ n for n ≥ n 0 . In other words, the sequence {Γ n } does not decrease at infinity, if there exists a subsequence {Γ n t } t≥1 of {Γ n } such that Γ n t < Γ n t +1 for all t ≥ 1.

Lemma 3.
[36] Let {Γ n } be a sequence of real numbers that does not decrease at infinity. In addition, consider the sequence of integers {ϕ(n)} n≥n 0 defined by Then, {ϕ(n)} n≥n 0 is a nondecreasing sequence verifying lim n→∞ ϕ(n) = 0, and, for all n ≥ n 0 , the following two estimates hold: Let D be a closed, convex subset of a real Hilbert space H and g : D × D → R be a bifunction. Then, we say that g satisfies condition CO on D if the following four assumptions are satisfied: ) is convex and lower semicontinuous on D for each u ∈ D.

Lemma 4.
[37] (Lemma 2.12) Let g satisfy condition CO on D. Then, for each r > 0 and u ∈ H 2 , define a mapping (called resolvant of g), given by Then, the following holds: is closed and convex.

Main Results
Our approach here is based on taking an existing algorithm on (1), (3), and the fixed point problem of nonlinear mapping, and determining how it can be used in the setting of bilevel optimization problem (7) considered in this paper. We present a self-adaptive proximal gradient algorithm with an inertial effect for generating a sequence that converges to the unique solution of the bilevel optimization problem (7) under the the following basic assumptions. Assumption 1. Assume that A, h, g j (j ∈ {1, . . . , M}) and U i (i ∈ {1, . . . , N}) in a bilevel optimization problem (7) satisfies A1. Each A is nonzero bounded linear operator; A2. h is proper, convex, continuously differentiable, and the gradient ∇h is a σ-strongly monotone operator and L h -Lipschitz continuous; A3. Each U i is ω i -demimetric and demiclosed mapping for all i ∈ {1, . . . , N}; A4. Each g j is a proper, convex, lower semicontinuous function for all j ∈ {1, . . . , M}.

Assumption 2.
Let θ ∈ [0, 1) and γ be a real number, and the real sequences {β Assuming that the Assumption 1 is satisfied, the solution set Ω of the lower level problem of (7) is nonempty, and, for each j ∈ {1, . . . , M}, define l (j) by Note that, from Aubin [38], if g j is indicator function, then l (j) is convex, w-lsc and differentiable for each j ∈ {1, . . . , M}, and ∇l (j) is given by Next, we present and analyze the strong convergence of Algorithm 1 using l (j) and ∇l (j) by assuming that l (j) is differentiable.
Algorithm 1: Self-adaptive proximal gradient algorithm with inertial effect.
Initialization: Let the real number γ and the real sequences {β . Choose x 0 , x 1 ∈ H arbitrarily and proceed with the following computations: Step 1. Given the iterates x n−1 and x n (n ≥ 1), choose θ n such that 0 ≤ θ n ≤θ n , wherē otherwise.
Step 6. Set n := n + 1 and go to Step 1.

Remark 1. From Condition (C7) and
Step 1 of Algorithm 1, we have that Since {α n } is bounded, we also have θ n x n − x n−1 → 0, n → ∞. Note that Step 1 of Algorithm 1 is easily implemented in numerical computation since the value of x n − x n−1 is a priori known before choosing θ n .
Note that: where for µ = 1 − γ(2σ − γL 2 h ). Therefore, for γ ∈ (0, 2σ , the mapping V γ is a contraction mapping with constant µ. Consequently, the mapping P Ω V γ is also a contraction mapping with constant µ, i.e., Hence, by the Banach contraction principle, there exists a unique elementx ∈ H 1 such thatx = P Ω V γ (x). Clearly,x ∈ Ω and we havē Lemma 5. For the sequences {s n }, {y n } and {z n } generated by Algorithm 1 and forx ∈ Ω, we have Proof. Letx ∈ Ω. Now, since I − prox λg j are firmly nonexpansive, and since A(x) is the minimizer of each g j , we have for all x ∈ H 1 By the definition of z n , we get Using the definition of y n , Lemma 1 (ii), and (13), we have The result (i) follows from (14) and (15), and, in view of (C2)-(C6), the result (ii) follows from (14) and (15).

Theorem 1.
The sequence {x n } generated by Algorithm 1 converges strongly to the solution of problem (7).

Application to the Bilevel Variational Inequality Problem
Let H 1 and H 2 be two real Hilbert spaces. Assume that F : H 1 → H 1 is L h -Lipschitz continuous and σ-strongly monotone on H 1 , A : H 1 → H 2 is a bounded linear operator, g j : H 2 → R ∪ {+∞} is a proper, convex, lower semicontinuous function for all j ∈ {1, . . . , M}, and U i : H 1 → H 1 is ω i -demimetric and demiclosed mapping for all i ∈ {1, . . . , N}. Then, replacing ∇h by F in Algorithm 1, we obtain strong convergence for an approximation of a solution of the bilevel variational inequality problem where Ω is the solution set of arg min g j .

Application to a Bilevel Optimization Problem with a Feasibility Set Constraint, Inclusion Constraint, and Equilibrium Constraint
Let H 1 and H 2 be two real Hilbert spaces, A : H 1 → H 2 be a linear transformation and h : H 1 → R be proper, convex, continuously differentiable, and the gradient ∇h is σ-strongly monotone operator and L h -Lipschitz continuous. Now, consider the bilevel optimization problem with a feasibility set constraint min h where each Q j is a closed convex subset of H 2 for j ∈ {1, . . . , M}. Replacing U i = I for all i ∈ {1, . . . , N} and prox λg j by projection mapping P Q j in Algorithm 1, we obtain strong convergence for an approximation of the solution of the bilevel problem (44). Consider the bilevel optimization problem with inclusion constraint min h where G j : H 2 → 2 H 2 is maximal monotone mapping for j ∈ {1, . . . , M}. Setting U i = I for all i ∈ {1, . . . , N} and, replacing the proximal mapping g j in Algorithm 1 by the resolvent operators J G j λ = (I + λG j ) −1 (for λ > 0), and following the method of proof in theorems, we obtain a strong convergence result for approximation of the solution of the bilevel problem (45). Consider the bilevel optimization problem with equilibrium constraint min h where g j : H 2 × H 2 → R is a bifunction and each g j satisfies condition CO on H 2 . We have strong convergence results solving (46) by setting U i = I for all i ∈ {1, . . . , N} and replacing the proximal mappings by the resolvent operators T g j r in Algorithm 1 (see (11) and properties of it in Lemma 4 (i)-(iv)).

Numerical Example
Taking the bilevel optimization problem (7) for H 1 = R p , H 2 = R q , the linear transformations A : R p → R q are given by A(x) = G q×p , where G q×p is q × p matrix, and for where D and B are invertible symmetric positive semidefinite p × p and q × q matrix, respectively, i ≤ 1 ∀i ∈ {1, . . . , N}, z = (z 1 , . . . , z q ) ∈ R q , . p is the Euclidean norm in R p , . q is the Euclidean norm in R q , and Φ(z t ) = max{|z t | − 1, 0} for t = 1, 2, . . . , q.
Here, h(x) = f (x) + 1 2 x 2 p where f (x) = 1 2 x T Dx and hence the gradient ∇ f is D -Lipschitz. Thus, the gradient ∇h is 1-strongly monotone and ( D + 1)-Lipschitz. We choose γ = 1 ( D +1) 2 . Now, for λ = 1, the proximal g 1 , g 2 and g 3 is given by We consider for p = q, i = 1 i+1 for i ∈ {1, . . . , N} and G q×p = I p×p , where I p×p is identity p × p matrix. The parameters are chosen are β n+1 , ε n = 1 (n+1) 2 , ρ n = 1 and θ n =θ n . For the purpose of testing our algorithm, we took the following data: • D and B are randomly generated invertible symmetric positive semidefinite p × p matrices, respectively. • x 0 and x 1 are randomly generated starting points.

•
The stopping criteria Tables 1, 2 and Figure 1 illustrate the numerical results of our algorithms for this example under the parameters and data given above and for θ = 0.5. The number of iterations (Iter(n)), CPU time in seconds (CPU(s)), and the error err (n) = x n −x , wherex is the solution set of the bilevel optimization problem (x = 0 here in this example), are reported in Table 1.  We now compare our algorithm for different θ n , i.e., for non-inertial accelerated case (θ n = 0) and for inertial accelerated case (θ n = 0). For the non-inertial accelerated case, we just simply take θ = 0, and, for the inertial accelerated case, we take a very small θ with θ ∈ (0, 1) so that θ n =θ n = θ. Numerical comparisons of our proposed algorithm with inertial version (θ n = 0) and its non-inertial version (θ n = 0) are presented in Table 3.  Tables 1 and 2 show that the CPU time and number of iterations of the algorithm increase linearly with the size or complexity of the problem (with the size of dimension p and q, number of mappings R and N, and number of functions M). From Table 3, we can see that our algorithm has a better performance for the stepsize choice θ n = 0. This implies that the inertial version of our algorithm has a better convergence analysis.

Conclusions
In this paper, we have proposed the problem of minimizing a convex function over the solution set of the split feasiblity problem of fixed point problems of demimetric mappings and constrained minimization problems of nonsmooth convex functions. We have showed that this problem can be solved by proximal and gradient methods where the gradient method is used for an upper level problem and the proximal method is used for a lower level problem. Most of the standard bilevel problems are particular cases of our framework.
Author Contributions: S.E.Y., P.K. and A.G.G. contributed equally in this research paper particularly on the conceptualization, methodology, validation, formal analysis, resource, and writing and preparing the original draft of the manuscript; however, P.K. fundamentally plays a great role in supervision and funding acquisition as well. Moreover, A.G.G. particularly wrote the code and run the algorithm in the MATLAB program. All authors have read and agreed to the published version of the manuscript.