1. Introduction
In [
1], Richard Bellman presented the concept of dynamic programming, demonstrating its utility in addressing complex decision-making problems by breaking them into simpler coupled sub-problems that are solved iteratively over time. Dynamic programming techniques have played a pivotal role in optimization-based feedback control and have since been expanded to encompass a wide range of control problems, including impulsive control, as noted in [
2,
3,
4] and the related works.
A notable set of such iterative techniques are referred to as reinforcement learning (RL) algorithms. In their work, Bertsekas and Tsitsiklis [
5] presented a comprehensive collection of RL algorithms, organized within the frameworks of value iteration (VI) and policy iteration (PI) methods.
In [
6], Bertsekas and Ioffe explored the temporal differences policy iteration (
) approach within the neuro-dynamic programming context, showing that the
method can be incorporated into a PI framework known as
-PIR. Subsequently, in [
7], Bertsekas investigated the connection between
and proximal algorithms, which are especially useful for tackling convex optimization challenges.
In [
8], Yachao et al. expanded the
-PIR framework from finite policy scenarios to contractive models with infinite policy spaces, utilizing abstract dynamic programming techniques. They defined precise compact operators essential for the operation of the algorithm and determined the conditions under which the
-PIR method achieves almost certain convergence for problems involving infinite-dimensional policy spaces.
In [
9], Belhenniche et al. explored the characteristics of reinforcement learning techniques designed for feedback control within the context of fixed-point theory. By making relatively broad assumptions, they derived sufficient conditions that ensure almost sure convergence in scenarios involving infinite-dimensional policy spaces.
Fixed-point theory provides a robust framework with wide-ranging applications across fields such as topology, nonlinear analysis, optimal control, and machine learning. A fundamental result in this domain is the Banach contraction principle, established by Banach [
10], which states that, if
is a complete metric space and
is a mapping that satisfies
for some
and all
, then
T has a unique fixed point
. Furthermore, for any starting point
, the sequence
generated by iteration
converges to
. Numerous variations and generalizations of the Banach contraction principle have been thoroughly explored in various mathematical settings. Notably, Kannan [
11] proposed a completely different type of contraction that also ensures the existence of a unique fixed point for the associated operator.
Kannan contractive mappings are less restrictive than Banach contractions as the contraction condition depends on the distances to the images (not just between points). This can be useful when dealing with operators in Markov Decision Processes (MDPs), where strict Banach contraction properties may not hold, particularly in the context of -policy iteration.
In [
12], Vetro introduced an extension of Hardy–Rogers-type contractions, proving fixed-point theorems for self-mappings defined on metric spaces and ordered metric spaces. These results were also applied to multistage decision processes, demonstrating their applicability in optimization and decision-making frameworks. The framework established in [
12] highlights the importance of generalized contractions in iterative decision-making processes, particularly in scenarios where classical contraction conditions do not apply directly.
Building on the foundations of fixed-point theory, this work extends and generalizes the results in [
8]. Specifically, we demonstrate that the properties of the iterative process proposed in [
8] remain valid for weakly contractive mappings, which represent a significantly broader class of systems than those previously considered, including the models studied in previous research.
A mapping
is defined as a Kannan mapping if there exists
such that
for all
.
Later, Chatterjea [
13] proposed a related contractive condition, expressed as follows:
where
and
.
Expanding on these contraction principles, Zamfirscu [
14] introduced a broader class of contractive mappings that unify and generalize the conditions given by Banach, Kannan, and Chatterjea. This is expressed as follows: let
X be a complete metric space,
be real numbers with
,
,
, and
be a function such that, for each pair of different functions
, at least one of the following conditions is satisfied:
- (1)
- (2)
- (3)
Then, T has a unique fixed point. These generalized contraction principles form the basis for the results presented in this study.
Building on these theoretical foundations, our work advances reinforcement learning algorithms by developing convergence guarantees for weakly contractive mappings via Zamfirescu’s fixed-point theorem, significantly broadening the applicability beyond classical Banach contraction settings. We further demonstrate norm convergence in infinite-dimensional policy spaces under substantially relaxed assumptions, overcoming longstanding limitations in analyzing practical reinforcement learning systems. This framework successfully handles several critical challenges, including non-contractive operators, continuous-state and -action spaces inducing infinite policy dimensions, and discontinuous operators frequently encountered in real-world application cases where conventional fixed-point methods typically fail.
Dynamic programming operators, like the Bellman operator in Markov Decision Processes (MDPs), often do not meet the Banach contraction conditions due to the inherent structure of the problem, such as the presence of high discount factors or incomplete policy evaluations. In these situations, Zamfirscu’s fixed-point theorem offers a more flexible and powerful framework, enabling the establishment of convergence where one of the three contractions may fail. This alternative approach is particularly relevant for handling complex problems like those encountered in -policy iteration and other reinforcement learning contexts, where weak contraction properties may hold.
Zamfirscu’s fixed-point theorem is particularly suitable for reinforcement learning because it deals with weakly contractive mappings, which are more general than standard Banach contractions. In reinforcement learning, especially in scenarios with high discount factors or non-standard reward structures, the operators involved may not satisfy the strong contraction conditions of Banach’s theorem. However, they might still satisfy the weaker conditions provided by Zamfirscu’s theorem, ensuring convergence of iterative algorithms.
In this paper, we employ fixed-point theory techniques to explore policy iteration and value iteration algorithms for mappings classified as weak contractions. By leveraging Zamfirescu’s theorem, we ensure the existence and uniqueness of solutions as well as the convergence of the involved operators. This approach is particularly significant for its applicability to discontinuous operators, offering a broader framework than traditional Banach or Kannan contractions.
Building on this foundation, we extend the convergence analysis of reinforcement learning algorithms—specifically value iteration (VI) and policy iteration (PI)—to a wider class of operators, satisfying Zamfirescu’s weak contraction conditions. Prior works, such as [
8,
9], rely on stricter contraction assumptions or compact operator properties, limiting their applicability. In contrast, our use of Zamfirescu’s theorem [
14] accommodates discontinuous operators and non-uniform contraction constants, as demonstrated in our robotic navigation example (
Section 5). This flexibility allows convergence guarantees in dynamic programming problems with high discount factors or irregular reward structures, common in feedback control. Moreover, we provide explicit sufficient conditions for norm convergence in infinite-dimensional policy spaces, enhancing the practicality of VI and PI in complex reinforcement learning tasks, such as real-time adaptive control.
The recent advances in policy iteration methods have evolved along two primary paths. The first involves viscosity approaches for regularized control problems [
15], while the second focuses on constrained optimization frameworks [
16]. Unlike these works, which assume Lipschitz continuity or explicit control constraints, our approach ensures convergence under milder conditions using fixed-point methods. This makes it especially well suited for reinforcement learning applications with sparse rewards or discontinuous dynamics.
The structure of this paper is as follows. In the upcoming section, we define the iterative feedback control problem under investigation, outlining the core definitions and the necessary assumptions that the data must meet. In the following section,
Section 3, we introduce several fixed-point results that form the foundation of our contributions, which will also be presented in this section. The value iteration and policy iteration framework is developed in
Section 4, within the framework of Banach, Kannan, and Chatterjea operators, and under the perspective of Zamfirescu’s fixed-point theorem. In
Section 5, we provide an application of our findings in real-world scenarios, demonstrating their practical relevance. Lastly, we briefly discuss the conclusions and potential directions for future research in
Section 6.
2. Preliminaries
We examine a collection of states denoted by X, a collection of controls denoted by U, and, for each state , a non-empty subset of controls . The set M is defined as the collection of all functions satisfying for every , which we will refer to as a policy in the subsequent analysis.
Let denote the space of real-valued functions , where all outputs are finite. Its extension, , represents the space of extended real-valued functions , where incorporates infinite values. This distinction separates conventional function spaces () from their generalized counterparts (), which are essential for handling unbounded behavior in optimization, measure theory, and related fields.
Let
represent the set of functions
, and let
represent the set of functions
, where
. We analyze an operator defined as follows:
and, for each policy
, we introduce a mapping
given by
Next, we define a mapping
as
Based on the definitions above, we can derive
For a given positive function
, we define
as the set of functions
V for which
, with the norm
on
specified as
We assume that
is positive, bounded, and satisfies
to guarantee that
remains finite, even when
X is unbounded. The weighting function
normalizes
spatially via
and ensures
under
, making
a proper normed space even for unbounded
X or growing
.
The following lemma naturally follows from these definitions:
Lemma 1 ([
8]).
The space is complete under the topology defined by the norm . Additionally, it is evident that is both closed and convex. Therefore, for a sequence and a function , if converges to V such that , it follows that for every .
We now present the following standard assumptions.
Assumption 1 ([
8] (Well-posedness)).
For every and every policy , it holds that and . This assumption ensures that the value functions remain bounded under the given norm, which is typical in discounted MDPs commonly used in reinforcement learning to model decision-making processes. Definition 1 ([
11] (Kannan’s mapping)).
A self-map of is called a Kannan’s mapping if there exists a constant such that One immediate consequence of Definition 1 is that, if each policy evaluation operator
is a Kannan contraction with modulus
, then the Bellman optimality operator
F, defined as
is also a Kannan contraction with the same modulus
. To show this, observe that
Taking infimum over
yields
Since
, it follows that
; hence,
By symmetry, the reverse inequality also holds as follows:
Combining both yields the Kannan contraction inequality for
F:
thereby confirming that
F is a Kannan contraction. This is relevant to RL as it allows convergence analysis in policy evaluation steps, where operators may not be strongly contractive but still exhibit weaker contractive properties suitable for infinite state–action spaces.
This kind of mapping is very important in metric fixed-point theory. It is well known that Banach’s contraction mappings are continuous, while Kannan’s mappings are not necessarily continuous. Kannan’s theorem is important because Subrahmanyam in [
17] proved that Kannan’s theorem characterizes the metric completeness. That is, a metric space
X is complete if and only if every mapping satisfying (
2) on
X with
has a fixed point. Banach’s contraction does not have this property.
Example 1 ([
11]).
Let and if and if . Here, F is discontinuous at ; consequently, Banach contraction is not satisfied. But, it is easily seen that Kannan’s map is satisfied by taking . Example 2. Let be a metric space and be defined byThen, T is not continuous on X and T satisfies contractive condition (2) with . Definition 2 ([
13] (Chatterjea’s mapping)).
A self-map of is called Chatterjae’s mapping if there exist such thatThis inequality holds for all . Notation Summary
The key mathematical notations used throughout this paper are summarized in
Table 1.
3. Auxiliary Results
In this section, a number of results that will be instrumental in the proof of the main result of this article are presented.
Theorem 1 ([
11,
14] (Existence)).
Suppose that the operators are Kannan mappings. Then, there exist fixed points for F and for , respectively. Based on the result of Theorem 1, we can establish the following:
Lemma 2 ([
11]).
The following properties are satisfied:- (i)
Starting with any , the sequence generated by the iteration converges to in the norm.
- (ii)
Starting with any , the sequence generated by the iteration converges to in the norm.
The results of Lemma 2 highlight two fundamental iterative processes in dynamic programming: value iteration (VI) and policy iteration (PI). Value iteration is an approach in which the value function is updated iteratively using the Bellman optimality operator F until it converges to the optimal value function . This approach does not directly maintain a policy at each iteration but instead progressively improves the value function, which can then be used to derive the optimal policy. In contrast, policy iteration operates through two alternating phases: policy evaluation and policy enhancement. The evaluation phase, as indicated in the lemma by , focuses on calculating the value function for a given policy . Following this, the enhancement phase updates the policy to minimize the expected cost, and this cycle continues until the optimal policy is achieved. Unlike value iteration, which directly refines the value function, policy iteration employs systematic updates via policy evaluation and enhancement, often resulting in quicker convergence in real-world scenarios.
As noted above, the findings of Lemma 2 provide a foundation for value iteration (VI). However, to guarantee the success of policy iteration (PI), further assumptions must be introduced.
Assumption 2 ([
8] (Monotonicity)).
For any , if , then it follows thatwhere the inequality ≤ is understood in a pointwise manner throughout X.
Assumption 3 ([
8] (Attainability)).
For every , there exists a policy such that . Clarification of Contraction Assumptions
To ensure convergence of value iteration (VI) and policy iteration (PI), we precisely define the contraction properties assumed for the Bellman optimality operator F and the policy evaluation operator , consistently applied across Definition 1, Theorem 1, and all convergence results.
4. Main Results
In this section, we employ the abstract dynamic programming (DP) framework developed in [
18], integrating both the policy iteration (PI) and value iteration (VI) methods to extend finite-policy problems to three contractive models with infinite policy spaces. These extensions are analyzed through the theoretical lens of Zamfirescu’s fixed-point theorem. Infinite policy spaces naturally emerge in two key scenarios: (1) systems with infinite state spaces or (2) finite state spaces with uncountable control action sets. To rigorously address these cases, we investigate the following generalization of contractive mappings:
Theorem 2 ([
14]).
Let be a Banach space, real numbers with ,
,
, and
a functions such that, for each couple of different functions ,
at least one of the following conditions is satisfied:- (C1)
- (C2)
- (C3)
Then, there exist fixed points for F and for , respectively.
Proof. We apply the same methodology as Zamfirscu [
14], but within the framework of dynamic programming notation. Consider a number
Obviously,
. Now, let us choose a random
and fix an integer number
. Take
and
. Suppose
; otherwise,
is a fixed point of
. If for these two points condition (C1) is satisfied, then
If for
condition (C2) is verified, then
which implies
In case condition (C3) is satisfied,
implying the contraction:
This inequality, true for every
n, clearly implies that
is a Cauchy sequence and therefore converges to some point
. Since
is not necessarily continuous, let us assume that
, Then, we can write the following estimate:
As both
and
tend to 0 as
, it follows that
, hence, a contradiction. Then, we have
Now, we show that this fixed point
is unique. Suppose this is not true; then, we can set
for some point
different from
. Then,
ensuring that none of the three conditions of Theorem 2 are met at the points
and
W. □
The Zamfirscu theorem introduces three different types of contractions: Banach, Kannan, and Chatterjea contractions. The Banach-type contraction (-condition) ensures that distances shrink by a fixed factor, guaranteeing a unique fixed point. The Kannan-type contraction (-condition), instead of comparing two points directly, considers distances from each point to its image under the function, an essential property in dynamic programming (DP) models where strict contraction may not hold yet iterative updates still lead to convergence. The Chatterjea-type contraction (-condition) introduces a symmetric relationship between distances, providing additional flexibility in systems where direct pointwise contraction is too restrictive.
Indirectly, Banach contractions impose a uniform reduction in distance, making them more restrictive and often inapplicable in RL settings with discontinuous operators or high discount factors. In contrast, Kannan contractions relax this requirement by focusing on distances to mapped points, offering a broader scope that aligns with the iterative nature of RL algorithms like value iteration, where convergence can occur even without continuity. Zamfirscu’s theorem unifies these, providing a robust framework where at least one condition holds, enhancing applicability over relying solely on Banach or Kannan alone.
The advantage of this formulation is its adaptability: if one contraction condition fails, another may still hold, ensuring the fixed-point property. This makes the theorem particularly valuable in DP models, especially when dealing with infinite policy spaces that arise from either continuous-state spaces or uncountable action sets, where establishing a single global contraction is often impractical.
Corollary 1 (Banach [
10]).
Let be a complete metric space, , and a function such that, for each couple of different points in , condition (C1) of Theorem 2 is verified. Then, has a unique fixed point. Corollary 2 (Kannan [
11]).
Let be a complete metric space, , and a function such that, for each couple of different points in , condition (C2) of Theorem 2 is verified. Then, has a unique fixed point. Corollary 3 (Chatterjea [
13]).
Let be a complete metric space, , and a function such that, for each couple of different points in , condition (C3) of Theorem 2 is verified. Then, has a unique fixed point. The following proposition presents key implications of Zamfirescu’s Theorem 2.
Proposition 1. Let Theorem 2 hold; then,
- (a)
- (b)
For any and :
Proof. Let us consider
as we have
Then,
Now, we are going to use the triangle inequality, where
n represents the iteration index in the sequence
, which is generated by applying the operator
F iteratively. Proceeding as follows:
Taking the limit as
and using Lemma 2, we obtain
□
Brief Justification:
The bounds in Proposition 1 follow from three key observations:
- 1
Zamfirescu’s Contraction: The operator
F satisfies
derived from conditions (C1–C3) in Theorem 1.
- 2
Iterative Shrinkage: For the sequence
, we have
showing each iteration reduces distance by factor
.
- 3
Geometric Convergence: The fixed-point error bound emerges from summing the geometric series:
The identical logic applies to due to uniform contraction across . The factor quantifies how contraction strength governs approximation error.
Remark 1. Based on Proposition 1, if we set , it can be shown that, for any , there exists a policy such that the norm . This can be achieved by selecting to minimize over with an error not exceeding for all . Specifically, we obtainConsequently, this impliesThe significance of monotonicity and Zamfirescu’s theorem is illustrated by proving that , the fixed point of F, represents the infimum over all of , which is the unique fixed point of . Proposition 2. Let the monotonicity and Zamfirscu’s theorem conditions hold. Then, for all , we have Proof. We will establish the two inequalities that define the infimum.
First inequality: .
For all
, we have
. Since
by the definition of the fixed point, this simplifies to
Applying the operator
repeatedly to both sides and using the monotonicity assumption, we obtain
Taking the limit as
and assuming the sequence
converges, it follows that
Thus, we have
.
Second inequality: .
Using Remark 1, for any
, there exists
such that
Thus, for each
, we have
Taking the limit as
, we obtain
Therefore,
.
Combining these two inequalities, we obtain
□
These results show that the fixed point of F acts as a lower bound for policy-dependent fixed points , ensuring the optimal value function can be approximated within a controlled error. The upper bounds from Zamfirescu’s theorem highlight convergence and stability, key to feedback control in dynamic programming and optimization.
4.1. Algorithmic Framework
Based on Theorem 2, we describe the implementation of value iteration (VI) and policy iteration (PI) for weakly contractive operators:
Value Iteration Algorithm:
- 1
Initialize (typically for all ).
- 2
Set convergence threshold and maximum iterations .
- 3
For :
Compute for all .
If , return as the approximate fixed point .
- 4
Return (with warning if unconverged).
Policy Iteration Algorithm:
- 1
Initialize policy arbitrarily.
- 2
Set convergence threshold and maximum iterations .
- 3
For :
- (a)
Policy Evaluation:
- (b)
Policy Improvement:
- (c)
If , return as optimal solution.
- 4
Return (with warning if unconverged).
Key features of these algorithms:
The stopping criterion derives from Proposition 1.
Policy evaluation leverages weak contraction properties (C1)–(C3) from Theorem 2.
Convergence is guaranteed even when standard contraction factors approach 1.
4.2. Examples
Consider the Banach space defined as
with the norm
. Define the function
as
For all
, the function satisfies the contractive inequality:
With . Therefore, at least one of the three conditions in Zamfirscu’s theorem holds, ensuring a unique fixed point.
A fixed point of
F satisfies
, leading to the following cases: if
,
This contradiction with
implies
. If
,
Solving for
,
Since
lies in
, it is the unique fixed point.
Given that
represents the optimal cost-to-go function, the optimal feedback control follows by minimizing
For a cost function
and dynamics
, the optimal control law is
With
, explicit control computation is achievable, demonstrating the theorem’s application in control theory. Zamfirscu’s theorem ensures convergence, highlighting its significance in solving infinite policy DP models.
5. Application
Consider a robotic navigation task in a bounded environment, where the goal is to optimize a warehouse robot’s path to a docking station while minimizing energy consumption. Let
represent the normalized position space,
denote the control input space (e.g., speed), and
be the Banach space of value functions with norm
. The operator
models the cost-to-go as
where
,
is the discount factor, and
. Define the policy evaluation operator
and the Bellman optimality operator
.
To compare Zamfirscu’s and Banach’s approaches, consider a piecewise operator for
:
This operator satisfies Zamfirscu’s conditions:
with
. For Banach’s approach, we test a modified operator:
The fixed point at
in
is
The optimal control is
yielding
,
.
To validate Zamfirscu’s flexibility versus Banach’s limitations, we analyze convergence with . For in at , Zamfirscu uses
Banach-type: .
Kannan-type: .
Chatterjea-type: .
Zamfirscu switches conditions dynamically, ensuring convergence. For Banach, the operator’s discontinuity at
causes failure. Consider
,
:
However, boundary effects amplify the effective contraction factor. As shown in
Figure 1, the operator
exhibits piecewise behavior with a stable fixed point at
under Zamfirscu’s framework, while Banach’s method becomes unstable near
. This instability is further quantified in
Table 2, where Banach’s single-rate contraction (
) yields an error of
after 10 iterations significantly higher than Zamfirscu’s adaptive methods (errors
).
These results highlight that Zamfirscu’s approach is useful. The Banach’s approach fails due to discontinuity at
(see
Appendix A), amplifying the contraction factor in the discretized settings, while Zamfirscu’s condition-switching ensures robust convergence for robotic navigation.