Next Article in Journal
Frequency Stability Constrained Unit Commitment Considering Control Mode Transition of Renewable Generations
Previous Article in Journal
A Novel Multi-Q Valued Bipolar Picture Fuzzy Set Approach for Evaluating Cybersecurity Risks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Convergence Analysis of Reinforcement Learning Algorithms Using Generalized Weak Contraction Mappings

by
Abdelkader Belhenniche
,
Roman Chertovskih
* and
Rui Gonçalves
Research Center for Systems and Technologies SYSTEC-ARISE, Faculty of Engineering, University of Porto, Rua Dr. Roberto Frias s/n, 4200-465 Porto, Portugal
*
Author to whom correspondence should be addressed.
Symmetry 2025, 17(5), 750; https://doi.org/10.3390/sym17050750
Submission received: 11 April 2025 / Revised: 6 May 2025 / Accepted: 9 May 2025 / Published: 13 May 2025

Abstract

:
We investigate the convergence properties of policy iteration and value iteration algorithms in reinforcement learning by leveraging fixed-point theory, with a focus on mappings that exhibit weak contractive behavior. Unlike traditional studies that rely on strong contraction properties, such as those defined by the Banach contraction principle, we consider a more general class of mappings that includes weak contractions. Employing Zamfirscu’s fixed-point theorem, we establish sufficient conditions for norm convergence in infinite-dimensional policy spaces under broad assumptions. Our approach extends the applicability of these algorithms to feedback control problems in reinforcement learning, where standard contraction conditions may not hold. Through illustrative examples, we demonstrate that this framework encompasses a wider range of operators, offering new insights into the robustness and flexibility of iterative methods in dynamic programming.
MSC:
47H10; 93C40; 49L20

1. Introduction

In [1], Richard Bellman presented the concept of dynamic programming, demonstrating its utility in addressing complex decision-making problems by breaking them into simpler coupled sub-problems that are solved iteratively over time. Dynamic programming techniques have played a pivotal role in optimization-based feedback control and have since been expanded to encompass a wide range of control problems, including impulsive control, as noted in [2,3,4] and the related works.
A notable set of such iterative techniques are referred to as reinforcement learning (RL) algorithms. In their work, Bertsekas and Tsitsiklis [5] presented a comprehensive collection of RL algorithms, organized within the frameworks of value iteration (VI) and policy iteration (PI) methods.
In [6], Bertsekas and Ioffe explored the temporal differences policy iteration ( T D ( λ ) ) approach within the neuro-dynamic programming context, showing that the T D ( λ ) method can be incorporated into a PI framework known as λ -PIR. Subsequently, in [7], Bertsekas investigated the connection between T D ( λ ) and proximal algorithms, which are especially useful for tackling convex optimization challenges.
In [8], Yachao et al. expanded the λ -PIR framework from finite policy scenarios to contractive models with infinite policy spaces, utilizing abstract dynamic programming techniques. They defined precise compact operators essential for the operation of the algorithm and determined the conditions under which the λ -PIR method achieves almost certain convergence for problems involving infinite-dimensional policy spaces.
In [9], Belhenniche et al. explored the characteristics of reinforcement learning techniques designed for feedback control within the context of fixed-point theory. By making relatively broad assumptions, they derived sufficient conditions that ensure almost sure convergence in scenarios involving infinite-dimensional policy spaces.
Fixed-point theory provides a robust framework with wide-ranging applications across fields such as topology, nonlinear analysis, optimal control, and machine learning. A fundamental result in this domain is the Banach contraction principle, established by Banach [10], which states that, if ( X , d ) is a complete metric space and T : X X is a mapping that satisfies
d ( T x , T y ) γ d ( x , y ) ,
for some 0 γ < 1 and all x , y X , then T has a unique fixed point x * . Furthermore, for any starting point x 0 X , the sequence { x n } generated by iteration x n + 1 = T x n converges to x * . Numerous variations and generalizations of the Banach contraction principle have been thoroughly explored in various mathematical settings. Notably, Kannan [11] proposed a completely different type of contraction that also ensures the existence of a unique fixed point for the associated operator.
Kannan contractive mappings are less restrictive than Banach contractions as the contraction condition depends on the distances to the images (not just between points). This can be useful when dealing with operators in Markov Decision Processes (MDPs), where strict Banach contraction properties may not hold, particularly in the context of λ -policy iteration.
In [12], Vetro introduced an extension of Hardy–Rogers-type contractions, proving fixed-point theorems for self-mappings defined on metric spaces and ordered metric spaces. These results were also applied to multistage decision processes, demonstrating their applicability in optimization and decision-making frameworks. The framework established in [12] highlights the importance of generalized contractions in iterative decision-making processes, particularly in scenarios where classical contraction conditions do not apply directly.
Building on the foundations of fixed-point theory, this work extends and generalizes the results in [8]. Specifically, we demonstrate that the properties of the iterative process proposed in [8] remain valid for weakly contractive mappings, which represent a significantly broader class of systems than those previously considered, including the models studied in previous research.
A mapping T : X X is defined as a Kannan mapping if there exists 0 γ < 1 2 such that
d ( T x , T y ) γ ( d ( x , T x ) + d ( y , T y ) ) ,
for all x , y X .
Later, Chatterjea [13] proposed a related contractive condition, expressed as follows:
d ( T ( x ) , T ( y ) ) γ ( d ( x , T y ) + d ( y , T x ) ) ,
where 0 γ < 1 2 and x , y X .
Expanding on these contraction principles, Zamfirscu [14] introduced a broader class of contractive mappings that unify and generalize the conditions given by Banach, Kannan, and Chatterjea. This is expressed as follows: let X be a complete metric space, α , β , γ be real numbers with α < 1 , β < 1 2 , γ < 1 2 , and T : X X be a function such that, for each pair of different functions x , y X , at least one of the following conditions is satisfied:
(1)
T x T y α x y ,
(2)
T x T y β x T x + y T y ,
(3)
T x T y γ x T y + y T x .
Then, T has a unique fixed point. These generalized contraction principles form the basis for the results presented in this study.
Building on these theoretical foundations, our work advances reinforcement learning algorithms by developing convergence guarantees for weakly contractive mappings via Zamfirescu’s fixed-point theorem, significantly broadening the applicability beyond classical Banach contraction settings. We further demonstrate norm convergence in infinite-dimensional policy spaces under substantially relaxed assumptions, overcoming longstanding limitations in analyzing practical reinforcement learning systems. This framework successfully handles several critical challenges, including non-contractive operators, continuous-state and -action spaces inducing infinite policy dimensions, and discontinuous operators frequently encountered in real-world application cases where conventional fixed-point methods typically fail.
Dynamic programming operators, like the Bellman operator in Markov Decision Processes (MDPs), often do not meet the Banach contraction conditions due to the inherent structure of the problem, such as the presence of high discount factors or incomplete policy evaluations. In these situations, Zamfirscu’s fixed-point theorem offers a more flexible and powerful framework, enabling the establishment of convergence where one of the three contractions may fail. This alternative approach is particularly relevant for handling complex problems like those encountered in λ -policy iteration and other reinforcement learning contexts, where weak contraction properties may hold.
Zamfirscu’s fixed-point theorem is particularly suitable for reinforcement learning because it deals with weakly contractive mappings, which are more general than standard Banach contractions. In reinforcement learning, especially in scenarios with high discount factors or non-standard reward structures, the operators involved may not satisfy the strong contraction conditions of Banach’s theorem. However, they might still satisfy the weaker conditions provided by Zamfirscu’s theorem, ensuring convergence of iterative algorithms.
In this paper, we employ fixed-point theory techniques to explore policy iteration and value iteration algorithms for mappings classified as weak contractions. By leveraging Zamfirescu’s theorem, we ensure the existence and uniqueness of solutions as well as the convergence of the involved operators. This approach is particularly significant for its applicability to discontinuous operators, offering a broader framework than traditional Banach or Kannan contractions.
Building on this foundation, we extend the convergence analysis of reinforcement learning algorithms—specifically value iteration (VI) and policy iteration (PI)—to a wider class of operators, satisfying Zamfirescu’s weak contraction conditions. Prior works, such as [8,9], rely on stricter contraction assumptions or compact operator properties, limiting their applicability. In contrast, our use of Zamfirescu’s theorem [14] accommodates discontinuous operators and non-uniform contraction constants, as demonstrated in our robotic navigation example (Section 5). This flexibility allows convergence guarantees in dynamic programming problems with high discount factors or irregular reward structures, common in feedback control. Moreover, we provide explicit sufficient conditions for norm convergence in infinite-dimensional policy spaces, enhancing the practicality of VI and PI in complex reinforcement learning tasks, such as real-time adaptive control.
The recent advances in policy iteration methods have evolved along two primary paths. The first involves viscosity approaches for regularized control problems [15], while the second focuses on constrained optimization frameworks [16]. Unlike these works, which assume Lipschitz continuity or explicit control constraints, our approach ensures convergence under milder conditions using fixed-point methods. This makes it especially well suited for reinforcement learning applications with sparse rewards or discontinuous dynamics.
The structure of this paper is as follows. In the upcoming section, we define the iterative feedback control problem under investigation, outlining the core definitions and the necessary assumptions that the data must meet. In the following section, Section 3, we introduce several fixed-point results that form the foundation of our contributions, which will also be presented in this section. The value iteration and policy iteration framework is developed in Section 4, within the framework of Banach, Kannan, and Chatterjea operators, and under the perspective of Zamfirescu’s fixed-point theorem. In Section 5, we provide an application of our findings in real-world scenarios, demonstrating their practical relevance. Lastly, we briefly discuss the conclusions and potential directions for future research in Section 6.

2. Preliminaries

We examine a collection of states denoted by X, a collection of controls denoted by U, and, for each state x X , a non-empty subset of controls U ( x ) U . The set M is defined as the collection of all functions μ : X U satisfying μ ( x ) U ( x ) for every x X , which we will refer to as a policy in the subsequent analysis.
Let υ ( X ) denote the space of real-valued functions V : X R , where all outputs are finite. Its extension, υ ¯ ( X ) , represents the space of extended real-valued functions V : X R ¯ , where R ¯ = R { , + } incorporates infinite values. This distinction separates conventional function spaces ( υ ( X ) ) from their generalized counterparts ( υ ¯ ( X ) ), which are essential for handling unbounded behavior in optimization, measure theory, and related fields.
Let υ ( X ) represent the set of functions V : X R , and let υ ¯ ( X ) represent the set of functions V : X R ¯ , where R ¯ = R { , } . We analyze an operator defined as follows:
H : X × U × υ ( X ) R ,
and, for each policy μ M , we introduce a mapping F μ : υ ( X ) υ ( X ) given by
F μ V ( x ) : = H ( x , μ ( x ) , V ) , for all x X .
Next, we define a mapping F : υ ( X ) υ ¯ ( X ) as
F V ( x ) : = inf μ M { F μ V ( x ) } , for all x X .
Based on the definitions above, we can derive
F V ( x ) = inf μ M { H ( x , μ ( x ) , V ) } = inf u U { H ( x , u , V ) } .
For a given positive function ν : X R , we define B ( X ) as the set of functions V for which V < , with the norm · on B ( X ) specified as
V = sup x X | V ( x ) | ν ( x ) .
We assume that ν ( x ) is positive, bounded, and satisfies inf x X ν ( x ) > 0 to guarantee that V remains finite, even when X is unbounded. The weighting function ν ( x ) normalizes | V ( x ) | spatially via | V ( x ) | ν ( x ) and ensures V < under inf x X ν ( x ) > 0 , making B ( X ) a proper normed space even for unbounded X or growing V ( x ) .
The following lemma naturally follows from these definitions:
Lemma 1
([8]). The space B ( X ) is complete under the topology defined by the norm · .
Additionally, it is evident that B ( X ) is both closed and convex. Therefore, for a sequence { V k } k = 1 B ( X ) and a function V B ( X ) , if V k converges to V such that lim k V k V = 0 , it follows that lim k V k ( x ) = V ( x ) for every x X .
We now present the following standard assumptions.
Assumption 1
([8] (Well-posedness)). For every V B ( X ) and every policy μ M , it holds that F μ V B ( X ) and F V B ( X ) . This assumption ensures that the value functions remain bounded under the given norm, which is typical in discounted MDPs commonly used in reinforcement learning to model decision-making processes.
Definition 1
([11] (Kannan’s mapping)). A self-map F μ of B ( X ) is called a Kannan’s mapping if there exists a constant γ [ 0 , 1 2 ) such that
F μ V F μ V γ V F μ V + V F μ V , V , V B ( X ) .
One immediate consequence of Definition 1 is that, if each policy evaluation operator F μ is a Kannan contraction with modulus γ , then the Bellman optimality operator F, defined as
F V ( x ) : = inf μ M ( F μ V ) ( x ) ,
is also a Kannan contraction with the same modulus γ . To show this, observe that
( F μ V ) ( x ) v ( x ) ( F μ V ) ( x ) v ( x ) + γ V F μ V + V F μ V , x X , μ M .
Taking infimum over μ M yields
F V ( x ) v ( x ) F V ( x ) v ( x ) + γ V F μ V + V F μ V .
Since F V = inf μ F μ V , it follows that V F μ V     V F V ; hence,
F V ( x ) v ( x ) F V ( x ) v ( x ) + γ V F V + V F V , x X .
By symmetry, the reverse inequality also holds as follows:
F V ( x ) v ( x ) F V ( x ) v ( x ) + γ V F V + V F V .
Combining both yields the Kannan contraction inequality for F:
F V F V     γ V F V + V F V ,
thereby confirming that F is a Kannan contraction. This is relevant to RL as it allows convergence analysis in policy evaluation steps, where operators may not be strongly contractive but still exhibit weaker contractive properties suitable for infinite state–action spaces.
This kind of mapping is very important in metric fixed-point theory. It is well known that Banach’s contraction mappings are continuous, while Kannan’s mappings are not necessarily continuous. Kannan’s theorem is important because Subrahmanyam in [17] proved that Kannan’s theorem characterizes the metric completeness. That is, a metric space X is complete if and only if every mapping satisfying (2) on X with 0 γ < 1 2 has a fixed point. Banach’s contraction does not have this property.
Example 1
([11]). Let E = { V B ( X ) : 0 V ( x ) 1 } and F V ( x ) = V ( x ) 4 if V ( x ) [ 0 , 1 2 ) and F V ( x ) = V ( x ) 5 if V ( x ) [ 1 2 , 1 ] . Here, F is discontinuous at V ( x ) = 1 2 ; consequently, Banach contraction is not satisfied. But, it is easily seen that Kannan’s map is satisfied by taking α = 4 9 .
Example 2.
Let X = R be a metric space and T : X X be defined by
T ( x ) = 0 , x 2 , 1 2 , x > 2 .
Then, T is not continuous on X and T satisfies contractive condition (2) with γ = 1 5 .
Definition 2
([13] (Chatterjea’s mapping)). A self-map F μ of B ( X ) is called Chatterjae’s mapping if there exist γ [ 0 , 1 2 ) such that
F μ V F μ V     γ ( V F μ V + V F μ V ) .
This inequality holds for all V , V B ( X ) .

Notation Summary

The key mathematical notations used throughout this paper are summarized in Table 1.

3. Auxiliary Results

In this section, a number of results that will be instrumental in the proof of the main result of this article are presented.
Theorem 1
([11,14] (Existence)). Suppose that the operators F , F μ : B ( X ) B ( X ) are Kannan mappings. Then, there exist fixed points V * for F and V μ for F μ , respectively.
Based on the result of Theorem 1, we can establish the following:
Lemma 2
([11]). The following properties are satisfied:
(i) 
Starting with any V 0 B ( X ) , the sequence { V k } generated by the iteration V k + 1 = F μ V k converges to V μ in the norm.
(ii) 
Starting with any V 0 B ( X ) , the sequence { V k } generated by the iteration V k + 1 = F V k converges to V * in the norm.
The results of Lemma 2 highlight two fundamental iterative processes in dynamic programming: value iteration (VI) and policy iteration (PI). Value iteration is an approach in which the value function is updated iteratively using the Bellman optimality operator F until it converges to the optimal value function V * . This approach does not directly maintain a policy at each iteration but instead progressively improves the value function, which can then be used to derive the optimal policy. In contrast, policy iteration operates through two alternating phases: policy evaluation and policy enhancement. The evaluation phase, as indicated in the lemma by V k + 1 = F μ V k , focuses on calculating the value function V μ for a given policy μ . Following this, the enhancement phase updates the policy to minimize the expected cost, and this cycle continues until the optimal policy is achieved. Unlike value iteration, which directly refines the value function, policy iteration employs systematic updates via policy evaluation and enhancement, often resulting in quicker convergence in real-world scenarios.
As noted above, the findings of Lemma 2 provide a foundation for value iteration (VI). However, to guarantee the success of policy iteration (PI), further assumptions must be introduced.
Assumption 2
([8] (Monotonicity)). For any V , V B ( X ) , if V V , then it follows that
H ( x , u , V ) H ( x , u , V ) , x X , u U ( x ) ,
where the inequality ≤ is understood in a pointwise manner throughout X.
Assumption 3
([8] (Attainability)). For every V B ( X ) , there exists a policy μ M such that F μ V = F V .

Clarification of Contraction Assumptions

To ensure convergence of value iteration (VI) and policy iteration (PI), we precisely define the contraction properties assumed for the Bellman optimality operator F and the policy evaluation operator F μ , consistently applied across Definition 1, Theorem 1, and all convergence results.

4. Main Results

In this section, we employ the abstract dynamic programming (DP) framework developed in [18], integrating both the policy iteration (PI) and value iteration (VI) methods to extend finite-policy problems to three contractive models with infinite policy spaces. These extensions are analyzed through the theoretical lens of Zamfirescu’s fixed-point theorem. Infinite policy spaces naturally emerge in two key scenarios: (1) systems with infinite state spaces or (2) finite state spaces with uncountable control action sets. To rigorously address these cases, we investigate the following generalization of contractive mappings:
Theorem 2 
([14]). Let B ( X ) be a Banach space, α , β , γ real numbers with α < 1 , β < 1 2 , γ < 1 2 , and F , F μ : B ( X ) B ( X ) a functions such that, for each couple of different functions V , V B ( X ) , at least one of the following conditions is satisfied:
(C1) 
F μ V F μ V     α V V ,
(C2) 
F μ V F μ V     β V F μ V + V F μ V ,
(C3) 
F μ V F μ V     γ V F μ V + V F μ V .
Then, there exist fixed points V * for F and V μ for F μ , respectively.
Proof. 
We apply the same methodology as Zamfirscu [14], but within the framework of dynamic programming notation. Consider a number
δ = max α , β 1 β , γ 1 γ .
Obviously, δ < 1 . Now, let us choose a random V ( x 0 ) M and fix an integer number n 0 . Take V ( x ) = F μ n V 0 ( x 0 ) and V ( x ) = F μ n + 1 V 0 ( x 0 ) . Suppose V ( x ) V ( x ) ; otherwise, V ( x ) is a fixed point of F μ . If for these two points condition (C1) is satisfied, then
F μ n + 1 V F μ n + 2 V α F μ n V F μ n + 1 V δ F μ n V F μ n + 1 V .
If for V ( x ) , V ( x ) condition (C2) is verified, then
F μ n + 1 V F μ n + 2 V β ( F μ n V F μ n + 1 V + F μ n + 1 V F μ n + 2 V ) ,
which implies
F μ n + 1 V F μ n + 2 V β 1 β F μ n V F μ n + 1 V δ F μ n V F μ n + 1 V .
In case condition (C3) is satisfied,
F μ n + 1 V F μ n + 2 V γ ( F μ n V F μ n + 1 V + F μ n + 1 V F μ n + 2 V ) ,
implying the contraction:
F μ n + 1 V F μ n + 2 V δ F μ n V F μ n + 1 V .
This inequality, true for every n, clearly implies that { F μ n V } n = 0 is a Cauchy sequence and therefore converges to some point V * B ( X ) . Since F μ is not necessarily continuous, let us assume that V * F μ V * = r > 0 , Then, we can write the following estimate:
V * F μ V *     V * V n + 1   +   V n + 1 F μ V * , =   V * V n + 1   +   F μ V n F μ V * ,   V * V n + 1   +   δ V n V * .
As both V * V n + 1 and V n V * tend to 0 as n + , it follows that r < 0 , hence, a contradiction. Then, we have V * = F μ V * .
Now, we show that this fixed point V * is unique. Suppose this is not true; then, we can set F μ W = W for some point W V * different from V * . Then,
F μ V * F μ W = V * W ,
F μ V * F μ W > V * F μ V * + W F μ W ,
F μ V * F μ W = 1 2 ( V * F μ W + W F μ V * ) ,
ensuring that none of the three conditions of Theorem 2 are met at the points V * and W. □
The Zamfirscu theorem introduces three different types of contractions: Banach, Kannan, and Chatterjea contractions. The Banach-type contraction ( α -condition) ensures that distances shrink by a fixed factor, guaranteeing a unique fixed point. The Kannan-type contraction ( β -condition), instead of comparing two points directly, considers distances from each point to its image under the function, an essential property in dynamic programming (DP) models where strict contraction may not hold yet iterative updates still lead to convergence. The Chatterjea-type contraction ( γ -condition) introduces a symmetric relationship between distances, providing additional flexibility in systems where direct pointwise contraction is too restrictive.
Indirectly, Banach contractions impose a uniform reduction in distance, making them more restrictive and often inapplicable in RL settings with discontinuous operators or high discount factors. In contrast, Kannan contractions relax this requirement by focusing on distances to mapped points, offering a broader scope that aligns with the iterative nature of RL algorithms like value iteration, where convergence can occur even without continuity. Zamfirscu’s theorem unifies these, providing a robust framework where at least one condition holds, enhancing applicability over relying solely on Banach or Kannan alone.
The advantage of this formulation is its adaptability: if one contraction condition fails, another may still hold, ensuring the fixed-point property. This makes the theorem particularly valuable in DP models, especially when dealing with infinite policy spaces that arise from either continuous-state spaces or uncountable action sets, where establishing a single global contraction is often impractical.
Corollary 1
(Banach [10]). Let B ( x ) be a complete metric space, 0 α < 1 , and F μ : B ( X ) B ( X ) a function such that, for each couple of different points in B ( X ) , condition (C1) of Theorem 2 is verified. Then, F μ has a unique fixed point.
Corollary 2
(Kannan [11]). Let B ( X ) be a complete metric space, 0 β < 1 2 , and F μ : B ( X ) B ( X ) a function such that, for each couple of different points in B ( X ) , condition (C2) of Theorem 2 is verified. Then, F μ has a unique fixed point.
Corollary 3
(Chatterjea [13]). Let B ( X ) be a complete metric space, 0 γ < 1 2 , and F μ : B ( X ) B ( X ) a function such that, for each couple of different points in B ( X ) , condition (C3) of Theorem 2 is verified. Then, F μ has a unique fixed point.
The following proposition presents key implications of Zamfirescu’s Theorem 2.
Proposition 1.
Let Theorem 2 hold; then,
(a) 
For any V B ( X ) :
V * V   1 1 δ F V V .
(b) 
For any V B ( X ) and μ M :
V μ V   1 1 δ F μ V V .
Proof. 
Let us consider V n = F V n 1 as we have
F V n F V n 1     δ F V n 1 F V n 2 .
Then,
F V n F V n 1     δ n 1 F V 1 F V 0 .
Now, we are going to use the triangle inequality, where n represents the iteration index in the sequence { F V n } n = 1 , which is generated by applying the operator F iteratively. Proceeding as follows:
F V n V k = 1 n F V k F V k 1 k = 1 n δ k 1 F V V .
Taking the limit as k + and using Lemma 2, we obtain
V * V 1 1 δ F V V .
Brief Justification:
The bounds in Proposition 1 follow from three key observations:
1
Zamfirescu’s Contraction: The operator F satisfies
F V F W δ V W where δ = max α , β 1 β , γ 1 γ < 1 ,
derived from conditions (C1–C3) in Theorem 1.
2
Iterative Shrinkage: For the sequence V n = F n V , we have
V n + 1 V n     δ V n V n 1   δ n F V V ,
showing each iteration reduces distance by factor δ .
3
Geometric Convergence: The fixed-point error bound emerges from summing the geometric series:
V * V   k = 1 V k V k 1   1 1 δ F V V .
The identical logic applies to F μ due to uniform contraction across M . The factor 1 1 δ quantifies how contraction strength governs approximation error.
Remark 1.
Based on Proposition 1, if we set V = V * , it can be shown that, for any ϵ > 0 , there exists a policy μ ϵ M such that the norm V μ ϵ V * ϵ . This can be achieved by selecting μ ϵ ( x ) to minimize H ( x , u , V * ) over U ( x ) with an error not exceeding ( 1 δ ) ϵ ν ( x ) for all x X . Specifically, we obtain
V μ ϵ V * 1 1 δ F μ ϵ V * V * = 1 1 δ F μ ϵ V * F V * ϵ .
Consequently, this implies
| F μ ϵ ( x ) V * ( x ) F V * ( x ) | ( 1 δ ) ϵ ν ( x ) .
The significance of monotonicity and Zamfirescu’s theorem is illustrated by proving that V * , the fixed point of F, represents the infimum over all μ M of V μ , which is the unique fixed point of F μ .
Proposition 2.
Let the monotonicity and Zamfirscu’s theorem conditions hold. Then, for all x X , we have
V * ( x ) = inf μ M V μ ( x ) .
Proof. 
We will establish the two inequalities that define the infimum.
First inequality: V * ( x ) inf μ M V μ ( x ) .
For all μ M , we have F V * F μ V * . Since F V * = V * by the definition of the fixed point, this simplifies to
V * F μ V * .
Applying the operator F μ repeatedly to both sides and using the monotonicity assumption, we obtain
V * F μ k V * for all k > 0 .
Taking the limit as k and assuming the sequence F μ k V * converges, it follows that
V * V μ for all μ M .
Thus, we have V * ( x ) inf μ M V μ ( x ) .
Second inequality: V * ( x ) inf μ M V μ ( x ) .
Using Remark 1, for any ϵ > 0 , there exists μ ϵ M such that
V μ ϵ V * + ϵ .
Thus, for each ϵ > 0 , we have
inf μ M V μ ( x ) V μ ϵ ( x ) V * ( x ) + ϵ .
Taking the limit as ϵ 0 , we obtain
inf μ M V μ ( x ) V * ( x ) .
Therefore, V * ( x ) inf μ M V μ ( x ) .
Combining these two inequalities, we obtain
V * ( x ) = inf μ M V μ ( x ) for all x X .
These results show that the fixed point of F acts as a lower bound for policy-dependent fixed points V μ , ensuring the optimal value function can be approximated within a controlled error. The upper bounds from Zamfirescu’s theorem highlight convergence and stability, key to feedback control in dynamic programming and optimization.

4.1. Algorithmic Framework

Based on Theorem 2, we describe the implementation of value iteration (VI) and policy iteration (PI) for weakly contractive operators:
Value Iteration Algorithm:
1
Initialize V 0 B ( X ) (typically V 0 ( x ) = 0 for all x X ).
2
Set convergence threshold ϵ > 0 and maximum iterations K max .
3
For k = 0 , 1 , , K max 1 :
  • Compute V k + 1 ( x ) = F V k ( x ) = inf u U ( x ) H ( x , u , V k ) for all x X .
  • If V k + 1 V k   < ϵ ( 1 δ ) / δ , return V k + 1 as the approximate fixed point V * .
4
Return V K max (with warning if unconverged).
Policy Iteration Algorithm:
1
Initialize policy μ 0 M arbitrarily.
2
Set convergence threshold ϵ > 0 and maximum iterations K max .
3
For k = 0 , 1 , , K max 1 :
(a)
Policy Evaluation:
  • Initialize V 0 μ k B ( X ) .
  • Repeat:
    Compute V n + 1 μ k ( x ) = F μ k V n μ k ( x ) = H ( x , μ k ( x ) , V n μ k ) .
    Until V n μ k V n 1 μ k   < ϵ ( 1 δ ) / δ .
(b)
Policy Improvement:
  • Update μ k + 1 ( x ) = arg min u U ( x ) H ( x , u , V n μ k ) for all x X .
(c)
If μ k + 1 = μ k , return ( V n μ k , μ k ) as optimal solution.
4
Return ( V μ K max , μ K max ) (with warning if unconverged).
Key features of these algorithms:
  • The stopping criterion ϵ ( 1 δ ) / δ derives from Proposition 1.
  • Policy evaluation leverages weak contraction properties (C1)–(C3) from Theorem 2.
  • Convergence is guaranteed even when standard contraction factors approach 1.

4.2. Examples

Consider the Banach space defined as
B ( X ) = V B ( X ) : 0 < V ( x ) < 2 ,
with the norm V = sup x X | V ( x ) | . Define the function F μ : B ( X ) B ( X ) as
F μ V = V 4 , V S 1 = V B ( X ) : 0 V ( x ) 1 , 1 2 V 4 , V S 2 = V B ( X ) : 1 < V ( x ) 2 .
For all V , V B ( X ) , the function satisfies the contractive inequality:
F μ V F μ V     δ max V V , V F μ V + V F μ V 2 , V F μ V + V F μ V 2 .
With δ = 1 2 . Therefore, at least one of the three conditions in Zamfirscu’s theorem holds, ensuring a unique fixed point.
A fixed point of F satisfies V * = F μ V * , leading to the following cases: if V * S 1 ,
V * = V * 4 V * V * 4 = 0 V * 1 1 4 = 0 .
This contradiction with 0 < V * < 2 implies V * S 1 . If V * S 2 ,
V * = 1 2 V * 4 .
Solving for V * ,
V * + V * 4 = 1 2 5 4 V * = 1 2 V * = 2 5 .
Since V * = 2 5 lies in ( 0 , 2 ) , it is the unique fixed point.
Given that V * represents the optimal cost-to-go function, the optimal feedback control follows by minimizing
u * ( x ) = arg min u C ( x , u ) + δ V * ( f ( x , u ) ) .
For a cost function C ( x , u ) = x 2 + u 2 and dynamics f ( x , u ) = a x + b u , the optimal control law is
u * ( x ) = K x , K = b 1 λ a .
With V * = 2 5 , explicit control computation is achievable, demonstrating the theorem’s application in control theory. Zamfirscu’s theorem ensures convergence, highlighting its significance in solving infinite policy DP models.

5. Application

Consider a robotic navigation task in a bounded environment, where the goal is to optimize a warehouse robot’s path to a docking station while minimizing energy consumption. Let X = [ 0 , 1 ] represent the normalized position space, U = [ 0 , 1 ] denote the control input space (e.g., speed), and B ( X ) = { V B ( X ) : 0 < V ( x ) < 2 } be the Banach space of value functions with norm V = sup x X | V ( x ) | . The operator H ( x , u , V ) models the cost-to-go as
H ( x , u , V ) = c ( x , u ) + γ V ( f ( x , u ) ) ,
where c ( x , u ) = x 2 + u 2 , γ = 0.5 is the discount factor, and f ( x , u ) = x + u ( 1 x ) . Define the policy evaluation operator F μ V ( x ) = x 2 + μ ( x ) 2 + 0.5 V ( x + μ ( x ) ( 1 x ) ) and the Bellman optimality operator F V ( x ) = inf u { x 2 + u 2 + 0.5 V ( x + u ( 1 x ) ) } .
To compare Zamfirscu’s and Banach’s approaches, consider a piecewise operator for μ ( x ) = 0.5 :
F μ V ( x ) = x 2 + 0.25 + 0.5 · V ( x ) 4 , V ( x ) S 1 = { V B ( X ) : 0 V ( x ) 1 } , x 2 + 0.25 + 0.5 · 1 2 V ( x ) 4 , V ( x ) S 2 = { V B ( X ) : 1 < V ( x ) 2 } .
This operator satisfies Zamfirscu’s conditions:
F μ V F μ V   δ max V V , V F μ V   +   V F μ V 2 , V F μ V   +   V F μ V 2 ,
with δ = 0.5 . For Banach’s approach, we test a modified operator:
F μ V F μ V 0.9 V V .
The fixed point at x = 0.5 in S 1 is
V * ( 0.5 ) = 0.5 + 0.125 V * ( 0.5 ) 0.875 V * ( 0.5 ) = 0.5 V * ( 0.5 ) 0.571 .
The optimal control is
u * ( x ) = arg min u x 2 + u 2 + 0.5 · 0.571 ( 0.5 x + 0.5 u ) ,
yielding u * ( x ) = K x , K 0.667 .
To validate Zamfirscu’s flexibility versus Banach’s limitations, we analyze convergence with V 0 ( x ) = 1 . For F μ V ( x ) = x 2 + 0.25 + 0.5 · V ( x ) 4 in S 1 at x = 0.5 , Zamfirscu uses
  • Banach-type: F μ V F μ V   0.125 V V .
  • Kannan-type: F μ V F μ V   0.2 ( V F μ V + V F μ V ) .
  • Chatterjea-type: F μ V F μ V   0.2 ( V F μ V + V F μ V ) .
Zamfirscu switches conditions dynamically, ensuring convergence. For Banach, the operator’s discontinuity at V = 1 causes failure. Consider V ( 0.5 ) = 1 , V ( 0.5 ) = 1 + ϵ :
F μ V ( 0.5 ) = 0.625 , F μ V ( 0.5 ) = 0.625 0.125 ϵ ,
| F μ V ( 0.5 ) F μ V ( 0.5 ) | | V ( 0.5 ) V ( 0.5 ) | = 0.125 ϵ ϵ = 0.125 ,
However, boundary effects amplify the effective contraction factor. As shown in Figure 1, the operator F μ exhibits piecewise behavior with a stable fixed point at V * 0.571 under Zamfirscu’s framework, while Banach’s method becomes unstable near V = 1 . This instability is further quantified in Table 2, where Banach’s single-rate contraction ( α = 0.9 ) yields an error of 0.031 after 10 iterations significantly higher than Zamfirscu’s adaptive methods (errors 0.00016 ).
These results highlight that Zamfirscu’s approach is useful. The Banach’s approach fails due to discontinuity at V = 1 (see Appendix A), amplifying the contraction factor in the discretized settings, while Zamfirscu’s condition-switching ensures robust convergence for robotic navigation.

6. Conclusions

In this article, we extended policy iteration and value iteration algorithms to mappings that are merely weak contractions, broadening their applicability beyond traditional strong contractions like Banach mappings. By leveraging Zamfirscu’s fixed-point theorem, we established sufficient conditions for the existence, uniqueness, and convergence of solutions in infinite-dimensional policy spaces. Our findings highlight the flexibility of fixed-point methods in reinforcement learning and dynamic programming, particularly for handling discontinuous operators. Our ongoing research aims to explore further generalizations, such as lambda policy iteration with randomization, alongside stochastic system extensions incorporating probabilistic transitions in Markov Decision Processes and control engineering applications like real-time adaptive control in robotics or traffic management systems to enhance the practical impact of these theoretical advancements.

Author Contributions

Conceptualization, A.B. and R.C.; methodology, A.B.; validation, R.C.; formal analysis, A.B., R.C. and R.G.; investigation, A.B. and R.C.; writing—original draft preparation, A.B.; writing—review and editing, R.C. and R.G.; visualization, A.B. and R.C.; funding acquisition, R.C. All authors have read and agreed to the published version of the manuscript.

Funding

A.B. acknowledges the financial support of the Foundation for Science and Technology (FCT, Portugal) in the framework of grant 2021.07608.BD. The authors also acknowledge the financial support of the FCT in the framework of ARISE (DOI 10.54499/LA/P/0112/2020) and R&D Unit SYSTEC (base UIDB/00147/2020 and programmatic UIDP/00147/2020 funds).

Data Availability Statement

The datasets analyzed in this paper are not readily available due to technical limitations.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Analysis of Banach Contraction Failure in Robotic Navigation Example

To clarify why the standard Banach contraction principle fails for the policy evaluation operator F μ in the robotic navigation example, we provide a rigorous analysis of its behavior across the Banach space B ( X ) = { V B ( X ) : 0 < V ( x ) < 2 } with the supremum norm V = sup x X | V ( x ) | . The operator F μ is defined piecewise as
F μ V ( x ) = x 2 + 0.25 + 0.5 · V ( x ) 4 , V ( x ) S 1 = { V B ( X ) : 0 V ( x ) 1 } , x 2 + 0.25 + 0.5 · 1 2 V ( x ) 4 , V ( x ) S 2 = { V B ( X ) : 1 < V ( x ) 2 } .
A Banach contraction requires existence of a constant α [ 0 , 1 ) such that
F μ V F μ V α V V for all V , V B ( X ) .
In what follows we demonstrate why this condition fails due to the operator’s discontinuity at V ( x ) = 1 .

Appendix A.1. Local Contractivity at the Boundary

We analyze the operator F μ at x = 0.5 for V ( 0.5 ) = 1 S 1 and V ( 0.5 ) = 1 + ϵ S 2 :
F μ V ( 0.5 ) = 0 . 5 2 + 0.25 + 0.5 · 1 4 = 0.625 ,
F μ V ( 0.5 ) = 0 . 5 2 + 0.25 + 0.5 · 1 2 1 + ϵ 4 = 0.625 0.125 ϵ ,
| F μ V ( 0.5 ) F μ V ( 0.5 ) | | V ( 0.5 ) V ( 0.5 ) | = 0.125 ϵ ϵ = 0.125 .
This suggests local contractivity with α = 0.125 near the boundary V ( x ) = 1 at x = 0.5 as the operator contracts the distance between V and V locally.

Appendix A.2. Failure of Uniform Contractivity

Despite local contractivity, the operator fails to satisfy the Banach contraction condition uniformly across B ( X ) due to its discontinuity at V ( x ) = 1 . To illustrate, consider a discretized state space X = { x 1 , , x N } (e.g., N = 100 ) and functions V , V B ( X ) such that V ( x i ) = 1 ϵ S 1 , V ( x i ) = 1 + ϵ S 2 for some x i and V ( x j ) = V ( x j ) for j i . The norm is V V = sup x X | V ( x ) V ( x ) | = | ( 1 ϵ ) ( 1 + ϵ ) | = 2 ϵ . Compute
F μ V ( x i ) = x i 2 + 0.25 + 0.5 · 1 ϵ 4 = x i 2 + 0.375 0.125 ϵ ,
F μ V ( x i ) = x i 2 + 0.25 + 0.5 · 1 2 1 + ϵ 4 = x i 2 + 0.375 0.125 ( 1 + ϵ ) = x i 2 + 0.25 0.125 ϵ ,
| F μ V ( x i ) F μ V ( x i ) | = | ( x i 2 + 0.375 0.125 ϵ ) ( x i 2 + 0.25 0.125 ϵ ) | = 0.125 ,
F μ V F μ V = sup x X | F μ V ( x ) F μ V ( x ) | = 0.125 ,
F μ V F μ V V V = 0.125 2 ϵ as ϵ 0 .
This shows that, near the discontinuity at V ( x ) = 1 , the contraction factor becomes arbitrarily large, violating the requirement for a uniform α < 1 . The discontinuity causes the operator to amplify differences across the boundary, preventing a global contraction constant.

References

  1. Bellman, R. The theory of dynamic programming. Bull. Am. Math. Soc. 1954, 60, 503–515. [Google Scholar] [CrossRef]
  2. Arutyunov, A.; Jaćimović, V.; Pereira, F. Second order necessary conditions for optimal impulsive control problems. J. Dyn. Control Syst. 2003, 9, 131–153. [Google Scholar] [CrossRef]
  3. Fraga, S.L.; Pereira, F.L. Hamilton-Jacobi-Bellman equation and feedback synthesis for impulsive control. IEEE Trans. Autom. Control 2011, 57, 244–249. [Google Scholar] [CrossRef]
  4. Chertovskih, R.; Ribeiro, V.M.; Gonçalves, R.; Aguiar, A.P. Sixty Years of the Maximum Principle in Optimal Control: Historical Roots and Content Classification. Symmetry 2024, 16, 1398. [Google Scholar] [CrossRef]
  5. Bertsekas, D.; Tsitsiklis, J.N. Neuro-Dynamic Programming; Athena Scientific: Belmont, MA, USA, 1996. [Google Scholar]
  6. Bertsekas, D.P.; Ioffe, S. Temporal Differences-Based Policy Iteration and Applications in Neuro-Dynamic Programming; Laboratory for Information and Decision Systems Report LIDS-P-2349; MIT: Cambridge, MA, USA, 1996; Volume 14. [Google Scholar]
  7. Bertsekas, D.P. Proximal algorithms and temporal difference methods for solving fixed point problems. Comput. Optim. Appl. 2018, 70, 709–736. [Google Scholar] [CrossRef]
  8. Li, Y.; Johansson, K.H.; Mårtensson, J. Lambda-policy iteration with randomization for contractive models with infinite policies: Well-posedness and convergence. In Proceedings of the Learning for Dynamics and Control, Online, 10–11 June 2020; pp. 540–549. [Google Scholar]
  9. Belhenniche, A.; Benahmed, S.; Pereira, F.L. Extension of λ-PIR for weakly contractive operators via fixed point theory. Fixed Point Theory 2021, 22, 511–526. [Google Scholar] [CrossRef]
  10. Banach, S. On operations in abstract sets and their application to integral equations. Fundam. Math. 1922, 3, 133–181. [Google Scholar] [CrossRef]
  11. Kannan, R. Some results on fixed points—II. Am. Math. Mon. 1969, 76, 405–408. [Google Scholar]
  12. Vetro, F. F-Contractions of Hardy–Rogers Type and Application to Multistage Decision. Nonlinear Anal. Model. Control 2016, 21, 531–546. [Google Scholar] [CrossRef]
  13. Chatterjea, S.K. Fixed point theorems for a sequence of mappings with contractive iterates. Publ. l’Institut Mathématique 1972, 14, 15–18. [Google Scholar]
  14. Zamfirescu, T. Fix point theorems in metric spaces. Arch. Math. 1972, 23, 292–298. [Google Scholar] [CrossRef]
  15. Tang, W.; Tran, H.V.; Zhang, Y.P. Policy Iteration for the Deterministic Control Problems: A Viscosity Approach. SIAM J. Control Optim. 2025, 63, 375–401. [Google Scholar] [CrossRef]
  16. Kundu, S.; Kunisch, K. Policy iteration for Hamilton–Jacobi–Bellman equations with control constraints. Comput. Optim. Appl. 2024, 87, 785–809. [Google Scholar] [CrossRef]
  17. Subrahmanyam, P.V. Completeness and fixed-points. Monatshefte Math. 1975, 80, 325–330. [Google Scholar] [CrossRef]
  18. Bertsekas, D. Abstract Dynamic Programming; Athena Scientific: Belmont, MA, USA, 2022. [Google Scholar]
Figure 1. F μ at x = 0.5 , showing piecewise behavior (blue for V 1 ; red for V > 1 ). Zamfirscu stabilizes at V * 0.571 (black dot), unlike Banach’s limitation.
Figure 1. F μ at x = 0.5 , showing piecewise behavior (blue for V 1 ; red for V > 1 ). Zamfirscu stabilizes at V * 0.571 (black dot), unlike Banach’s limitation.
Symmetry 17 00750 g001
Table 1. Summary of mathematical notations.
Table 1. Summary of mathematical notations.
SymbolDescription
XState space
UControl/action space
U ( x ) Admissible controls for state x X
M Policy space (set of admissible policies)
μ Policy ( μ : X U )
B ( X ) Banach space of bounded value functions
υ ( X ) Space of value functions ( X R )
υ ¯ ( X ) Extended value functions ( X R ¯ )
H ( x , u , V ) Cost-to-go function
F μ Policy evaluation operator
FBellman optimality operator
V * ( x ) Optimal value function
V μ ( x ) Value function for policy μ
ν ( x ) Weighting function for norm definition
· Weighted supremum norm
γ , δ Contraction/modulus parameters
α , β Zamfirescu contraction coefficients
S 1 , S 2 Subsets of B ( X ) (Section 4.2)
Table 2. Convergence errors after 10 iterations at x = 0.5 ( V * 0.571 ).
Table 2. Convergence errors after 10 iterations at x = 0.5 ( V * 0.571 ).
ApproachContraction TypeError | V 10 V * |
ZamfirscuBanach-type 0.00014
ZamfirscuKannan-type 0.00016
ZamfirscuChatterjea-type 0.00015
BanachSingle ( α = 0.9 ) 0.031
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Belhenniche, A.; Chertovskih, R.; Gonçalves, R. Convergence Analysis of Reinforcement Learning Algorithms Using Generalized Weak Contraction Mappings. Symmetry 2025, 17, 750. https://doi.org/10.3390/sym17050750

AMA Style

Belhenniche A, Chertovskih R, Gonçalves R. Convergence Analysis of Reinforcement Learning Algorithms Using Generalized Weak Contraction Mappings. Symmetry. 2025; 17(5):750. https://doi.org/10.3390/sym17050750

Chicago/Turabian Style

Belhenniche, Abdelkader, Roman Chertovskih, and Rui Gonçalves. 2025. "Convergence Analysis of Reinforcement Learning Algorithms Using Generalized Weak Contraction Mappings" Symmetry 17, no. 5: 750. https://doi.org/10.3390/sym17050750

APA Style

Belhenniche, A., Chertovskih, R., & Gonçalves, R. (2025). Convergence Analysis of Reinforcement Learning Algorithms Using Generalized Weak Contraction Mappings. Symmetry, 17(5), 750. https://doi.org/10.3390/sym17050750

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop