Next Article in Journal
DERI1000: A New Benchmark for Dataset Explainability Readiness
Previous Article in Journal
Improved Productivity Using Deep Learning-Assisted Major Coronal Curve Measurement on Scoliosis Radiographs
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Hybrid Type-2 Fuzzy Double DQN with Adaptive Reward Shaping for Stable Reinforcement Learning

by
Hadi Mohammadian KhalafAnsar
1,
Jaime Rohten
2,* and
Jafar Keighobadi
1
1
Faculty of Mechanical Engineering, University of Tabriz, Tabriz 51666-16471, Iran
2
Department of Electrical and Electronic Engineering, Universidad del Bío-Bío, Concepción 4051381, Chile
*
Author to whom correspondence should be addressed.
AI 2025, 6(12), 319; https://doi.org/10.3390/ai6120319
Submission received: 11 November 2025 / Revised: 28 November 2025 / Accepted: 3 December 2025 / Published: 6 December 2025

Abstract

Objectives: This paper presents an innovative control framework for the classical Cart–Pole problem. Methods: The proposed framework combines Interval Type-2 Fuzzy Logic, the Dueling Double DQN deep reinforcement learning algorithm, and adaptive reward shaping techniques. Specifically, fuzzy logic acts as an a priori knowledge layer that incorporates measurement uncertainty in both angle and angular velocity, allowing the controller to generate adaptive actions dynamically. Simultaneously, the deep Q-network is responsible for learning the optimal policy. To ensure stability, the Double DQN mechanism successfully alleviates the overestimation bias commonly observed in value-based reinforcement learning. An accelerated convergence mechanism is achieved through a multi-component reward shaping function that prioritizes angle stability and survival. Results: Given the training results, the method stabilizes rapidly; it achieves a 100% success rate by episode 20 and maintains consistent high rewards (650–700) throughout training. While Standard DQN and other baselines take 100+ episodes to become reliable, our method converges in about 20 episodes (4–5 times faster). It is observed that in comparison with advanced baselines like C51 or PER, the proposed method is about 15–20% better in final performance. We also found that PPO and QR-DQN surprisingly struggle on this task, highlighting the need for stability mechanisms. Conclusions: The proposed approach provides a practical solution that balances exploration with safety through the integration of fuzzy logic and deep reinforcement learning. This rapid convergence is particularly important for real-world applications where data collection is expensive, achieving stable performance much faster than existing methods without requiring complex theoretical guarantees.

1. Introduction

Reinforcement learning (RL) is another significant instrument of AI in addressing problems that require a sequential decision-making structure. The theory of modern RL methods was essentially established in the textbook by Sutton and Barto [1]. Then, Busoniu et al. demonstrated that, with function approximation, RL could deal with more complex, continuous domains [2]. Wiering and Van Otterlo went one step further to assemble a conglomerate of approaches to adaptive learning systems [3].
With the combination of deep learning and neural networks, RL is able to outperform human behavior in challenging games like Atari and the ancient board game Go. Mnih et al. demonstrated that deep Q-networks (DQN) could master 49 Atari games on a professional level just using raw pixel data [4]. Silver et al. then applied neural networks together with Monte Carlo tree search to defeat world champions in Go [5]. All of these large victories were made possible by the deep learning explosion that LeCun et al. had described, and this provided us with the computing power to process high-dimensional inputs [6].
Despite all those successes, it is still difficult to deploy RL in the real world. Challenges such as data scarcity, reward delays, high-dimensional state spaces, and strict safety constraints decrease the trustworthiness and the slower convergence of RL algorithms. The same issues continue to be apparent to Busoniu et al., who attempt to apply the concept of function approximation to actual control systems [2], and Wiering and Van Otterlo indicate the practical obstacles they face in real-world situations [3]. This limit remained with current RL technology in a recent systematic study by Dulac et al., particularly in safety-critical domains [7].
Researchers recommend that RL should be combined with prior knowledge and fuzzy logic in order to correct these challenges. The ARIC architecture was originally developed by Berenji in 1992, and the fuzzy rules were demonstrated to be faster when applied with domain knowledge [8]. Subsequently, Er and Deng developed dynamic fuzzy Q-learning techniques, which allowed fuzzy controllers to adjust dynamically [9]. Jamshidi team extended this concept to cloud computing and designed self-learning fuzzy Q-based controllers [10]. One of the earliest attempts to combine fuzzy inference with value-based RL theory was proposed by Glorennec and Jouffe [11].
As a further enhancement of RL, fuzzy logic is excellent due to its ability to deal with uncertainty and introduce human knowledge into transparent, comprehensible rules. Integrating these approaches yields hybrid systems that retain the flexibility of RL while incorporating the stability of expert knowledge. This synergy effectively addresses the primary safety and stability issues encountered in the application of RL to the real world.
In deep RL, Q-learning remains one of the most fundamental algorithms, originally introduced by Watkins [12]. However, direct implementation with neural networks often leads to Q-value overestimation. To address this, van Hasselt et al. proposed the Double DQN algorithm [13]. Additional advancements—including Prioritized Experience Replay [14], distributional RL [15], and the Rainbow framework [16], which unifies several improvements—have further enhanced performance. From an architectural perspective, Wang et al. introduced dueling networks, which separate the state-value and advantage functions, thereby improving learning efficiency [17]. When combined with Rainbow, this architecture has achieved superior performance in large-scale environments [16].
Reward shaping also plays a critical role in accelerating RL. Ng et al. established that if potential-based reward shaping is employed, the optimal policy remains unchanged [18]. This finding has been repeatedly validated in real-world applications [19,20].
The synergy between RL and fuzzy logic has been extensively explored. Glorennec and Jouffe showed that fuzzy inference-based Q-value approximation improves convergence [11]. Interval Type-2 Fuzzy Logic Systems (IT2-FLS), developed by Mendel and Wu, offer even greater robustness in managing uncertainty [21,22]. Hsu and Juang further demonstrated that self-organizing Type-2 Fuzzy Q-learning yields more stable performance in noisy environments [23].
Recent studies have applied these hybrid approaches in domains such as energy management and traffic control [24]. For example, Rostami et al. showed that combining fuzzy controllers with deep policy methods improves energy management in fuel cell vehicles compared with pure RL [25]. Similarly, Kumar et al. [26] and Tunc and Soylemez [27] demonstrated that fuzzy inference systems can guide DRL agents in traffic signal control, significantly reducing congestion. Extending this line of work, Bi et al. [20] incorporated IT2 Fuzzy Logic to manage severe traffic uncertainties, achieving more adaptive and reliable systems.
Mobile robotics is another area where fuzzy–RL integration has been fruitful. Chen et al. [28] applied fuzzy controllers optimized via RL to enhance robot navigation and wall-following, yielding improved performance and interpretability. Building on this, Juang and You [29] proposed an Actor–Critic framework based on neuro-fuzzy networks, striking a balance between interpretability and optimal performance—an approach aligned with the broader trend toward transparent AI decision-making.
Advances in RL are closely tied to developments in deep learning. Landmark works such as LeCun et al. [6] and methods including ANFIS [30], fuzzy cognitive maps [31], and fuzzy ensemble techniques [32] have significantly influenced hybrid fuzzy–RL research. Additionally, approaches such as Guided Policy Search [33] demonstrate that policy optimization combined with control structures can improve stability and efficiency. Robust control methods—such as Active Disturbance Rejection Control (ADRC) [34], multilayer neuro-control [35], and disturbance rejection control [36]—remain widely used; however, these rely primarily on classical robustness principles rather than data-driven learning.
Nonetheless, several challenges remain. Existing methods often either focus narrowly on Q-value estimation or employ fuzzy knowledge solely as a function approximator, without actively integrating it into the learning and decision-making process. Moreover, while Type-2 Fuzzy Logic improves robustness to noise [20,23], it has typically been used in a passive or replacement role rather than an active integration within RL. Finally, few studies provide a unified framework that simultaneously ensures provable stability and interpretability, both of which are critical in real-world applications such as traffic and energy systems [7,25].
The main innovation of this hybrid framework is in providing a robust solution for the management of measurement uncertainty during the start phase of learning, specifically designed to solve the problem of Cold Start and Safe Exploration in reinforcement learning. In physical control environments such as Cart–Pole, state measurements such as angle and angular velocity are always associated with noise and uncertainty. While Type-1 Fuzzy Logic (T1FL) systems are only able to model linguistic ambiguity in the definition of statements, Interval Type-2 Fuzzy Logic (IT2FL), by introducing the concept of Footprint of Uncertainty (FOU), can dynamically manage the uncertainty present in the membership functions. This feature makes our fuzzy system act as an a priori knowledge layer, which filters and limits action outputs in the early stages, such that it prevents the agent from entering the failure area and experiencing big negative rewards. This resistance to sensor noise is a key factor in achieving rapid and stable convergence of 20 episodes, which is lacking in standard methods.
This paper makes three key contributions:
1. Stability-guaranteeing controller synthesized via Lyapunov theory and LMI, ensuring that the system always returns to a safe state—even in the presence of neural network policy errors.
2. In order to accelerate convergence, the conventional reward function is augmented with additional components, including angle-stability reward, low-angular-velocity reward, a survival reward, and a final-success reward.
3. The adaptive fusion of fuzzy inference and Dueling Double DQN agent, balanced by a dynamic confidence factor, creates a balance between exploration and exploitation, thereby improving learning stability.

2. Materials and Methods

The Cart–Pole problem is one of the classic and challenging benchmarks in the field of nonlinear control and reinforcement learning, due to the characteristics of intrinsic instability. It serves as a standard criterion for assessing the efficiency of control algorithms. In this system, a cart with a mass ( m c ) moves on a horizontal surface without friction, and a rigid rod with a mass ( m p ) and length ( L ) is hinged to the center of the cart. The purpose of the controller is to apply a horizontal force ( F ) on the cart so that the rod remains in a vertical position ( θ = 0 ).
To extract the dynamical equations of the system, either Newton–Euler or Lagrangian laws are employed. Given the horizontal forces and moments, the equations of motion are obtained as follows:
x ¨ = F + m p L θ ˙ 2 s i n θ m p g s i n θ c o s θ m c + m p s i n 2 θ θ ¨ = g s i n θ c o s θ ( F + m p L θ ˙ 2 s i n θ ) / ( m c + m p ) L 4 3 m p c o s 2 θ m c + m p
These relations prove the system’s inherent nonlinear form, since both the trigonometric and multiplicative terms, such as ( θ ˙ 2 sin θ ) are elaborated. To use the RL algorithm, the full state of the system is defined by the state vector ( s t ):
s t = [ x t , x ˙ t , θ t , θ ˙ t ] T
where components x t , x ˙ t , θ t , θ ˙ t depict the position of the cart, the speed of the cart, the pole angle relative to the vertical axis, and the angular velocity of the pole, respectively. This representation not only provides enough information for predicting the future behavior of the system, but also creates a suitable basis for fuzzy controllers and neural networks-based RL.
The important key in this model is that the input of the system is a single horizontal force, whereas the desired output is the stability of the pole in its vertical equilibrium. Therefore, this system is single-input and multi-output (SIMO), requiring an appropriate control law capable of leading the system toward stability.
To simplify the analysis, the friction between the cart and the surface or the friction in the hinge joint is typically assumed to be negligible. However, the presence of noise and uncertainty in θ and θ ˙ measurement is inevitable. Consequently, in subsequent stages, type-2 fuzzy systems are used to capture the uncertainties.
In summary, the Cart–Pole model reveals two basic features of the system:
(1) Nonlinearity, due to trigonometric terms and nonlinear coupling.
(2) Inherent instability, as the upright equilibrium is unstable.
These features cause the classical linear controllers, such as LQR, to be insufficiently robust. Therefore, the proposed framework in this paper is designed by integrating Interval Type-2 Fuzzy Logic with deep reinforcement learning (DRL) algorithms to ensure stability and optimize performance.
An Interval Type-2 Fuzzy System was used to address the measurement noise and environmental disturbances. For inputs including angle and angular velocity, three triangular membership functions are set. For instance, for angle:
μ left ( θ ) = max ( 0 , min θ ( 0.5 ) 0.3 , 0.5 θ 0.3 ) μ center ( θ ) = max ( 0 , min θ ( 0.2 ) 0.2 , 0.2 θ 0.2 ) μ right ( θ ) = max ( 0 , min θ 0.2 0.3 , 0.5 θ 0.3 )
Similarly, for angular velocity:
μ slow ( θ ˙ ) = max ( 0 , min θ ˙ ( 2 ) 1 , 0 θ ˙ 1 ) μ medium ( θ ˙ ) = max ( 0 , min θ ˙ ( 1 ) 1 , 1 θ ˙ 1 ) μ fast ( θ ˙ ) = max ( 0 , min θ ˙ 0 1 , 2 θ ˙ 1 )
To incorporate uncertainty in inputs, each input is perturbed within a limited interval:
θ [ θ δ θ , θ + δ θ ] , δ θ = 0.05 r a d θ ˙ [ θ ˙ δ θ ˙ , θ ˙ + δ θ ˙ ] , δ θ ˙ = 0.02 r a d / s
The fuzzy inference mechanism samples N = 20 perturbed instances, and the output is averaged:
u fuzzy ( θ , θ ˙ ) = 1 N i = 1 N u ( θ i , θ ˙ i )
The Fuzzy rules base consists of 9 rules, capturing intuitive control logic. For instance, Rule 4: IF angle is Center AND angular velocity is Slow, THEN action = 0. The full table of rules is in Table 1. In order to implement Interval Type-2 Fuzzy inference efficiently for real-time control, we use an approximation based on a sampling method, namely Nie-Tan type-reduction. Instead of using the costly and iterative algorithm of Karnik–Mendel, we use 20 samples inside the range of Footprint of Uncertainty (FOU), which are defined by the intervals of perturbation. Averaging these samples provides a robust defuzzified output that maintains the uncertainty management characteristics of IT2-FLS systems while keeping low inference latency [37].
Although the fuzzy controller ensures initial stability, performance optimization is gained via a Dueling Double DQN. In this architecture, the feature representation is first computed as follows:
h = f ϕ ( s ) = R e L U ( W 2 R e L U ( W 1 s + b 1 ) + b 2 )
Two separate streams are then calculated:
V s   =   W V   R e L U W V 1 h   +   b V 1 + b V 1 A s , a   =   W A R e L U W A 1 h + b A 1 + b A
with the Q-value assessed as follows:
Q ( s , a ) = V ( s ) + ( A ( s , a ) 1 | A | a A ( s , a ) )
For learning, the agent uses the Double DQN method. The optimal action vector in the next state is calculated from the original network [12]:
a t + 1 * = a r g m a x a Q ( s t + 1 , a ; θ )
And the target value is calculated with the target network:
y t = r t s h a p e d + 1 d t · γ · Q ( s t + 1 , a t + 1 * ; θ ) ,
Here, θ is the parameters of the target network and d t is the end-of-episode index. Then, the cost function is [12]
L ( θ ) = E ( s , a , r , s , d ) ~ D [ ( Q ( s , a ; θ ) y t ) 2 ] ,
which is minimized using the decreasing gradient update [12]:
θ θ η · θ L ( θ ) ,
While specific numerical coefficients are empirically set to balance the magnitude of the signals, the structure and selection of the components are based on the theory of Potential-Based Reward Shaping (PBRS), originally proposed by Ng et al. [18].
The terms angle ( r θ ) and angular velocity ( r θ ˙ ) act as potential difference operators ( Φ ( s ) Φ ( s )). They are designed to create a dense reward landscape that directs the agent to a stable equilibrium ( θ = 0 ,   θ ˙ = 0 ) without changing the optimal policy of the fundamental Markov decision-making process. Our methodology followed a constructive approach:
  • Sparse to Dense: We started with the reward of survival to prevent early termination.
  • State Regularization: We added potential-based conditions ( r θ , r θ ˙ ) to provide continuous gradient information.
  • Constraint Enforcement: Finally, the penalty term ( r p e n a l t y ) was added to create a soft constraint boundary, which is important for the safety-aware RL.
Our final shaping function:
r t s h a p e d = r b a s e + r θ + r θ ˙ + r p o s i t i o n + r t i m e + r p e n a l t y + r s u c c e s s
The angle component ( r θ   =   2.0 ( 1     | θ t | / 0.5 ) ) and angular velocity component ( r θ ˙   =   1.5 ( 1     | θ ˙ t | / 2.0 ) ) exhibit potential-like properties, whereas other components, such as the survival bonus, serve distinct auxiliary functions. The astonishing point is how these components interacted. When testing just the potential-like parts, we saw modest improvement (+32.5%). But the full shaping gave us +136.8%—suggesting a synergistic effect between components that empirically enhances learning efficiency beyond the theoretical sum of individual parts.
The components in Equation (14) are carefully tuned coefficients ensuring both stability and learning efficiency, like ( r b a s e = 1.0), ( r θ = 2.0 (1 − | θ t |/0.5)), ( r θ ˙ = 1.5 (1 − | θ ˙ |/2.0)), ( r p o s i t i o n = 0.5 (1 − | x t |/1.0)), ( r t i m e = 0.005 t ), ( r p e n a l t y = −20.0), and ( r s u c c e s s = 100.0).
Interaction between the IT2FL system and D3QN network is achieved through a dynamic gating mechanism controlled by the confidence factor ( α ). This coefficient regulates the reliance on expert fuzzy knowledge versus the learned DQN policy. The final action ( a t f i n a l ) per step is calculated as a total weighted fuzzy action ( a t F u z z y ) and DQN action ( a t D Q N ):
a t f i n a l   = α t ·   a t F u z z y +   1   α t ·   a t D Q N
The parameter α t is set with an exponential reduction rate of α t   =   m a x ( α m i n ,   α i n i t i a l α d e c a y e p i s o d e ) . In the initial phase of training, α is high at about 0.8, which ensures that the fuzzy policy directs the agent to the safe zone (Fuzzy as Teacher). With the advancement of training and the reduction in the error of DQN prediction and network convergence, α will decrease to α m i n , about 0.1, and the agent will completely rely on the policy learned (DQN). This process provides a soft and automated transition from expert knowledge to optimal policy.
DQN practice is also chosen by ( ϵ )-greedy policy:
a t D Q N =   r a n d o m ( 0,1 ) ,                             ξ < ϵ t arg max a Q s t , a ; θ ,       o t h e r w i s e
with gradual reduction ( ϵ ):
ϵ t + 1 = max ( ϵ m i n , ϵ t × λ ϵ )

2.1. Stability Analysis

Below is a complete theory–algorithm–implementation scheme that incorporates both Interval Type-2 Fuzzy LMIs and Lyapunov reduction conditions for Safe RL, allowing us to practically implement it with an LMI solver and a DQN training loop.
The dynamics of the discrete system is
x t + 1 = f x t , u t = h x t , u t + g x t , u t ,     x X , u U ,
h is the nominal model, and g is the model error. We assume that h and g are Lipschitz continuous and that the policies are in class Π L with Lipschitz constant L π ; these are the same “continuity” assumptions for ensuring confidence bands and Lyapunov reduction in Safe RL.
We write the plant as a Type-2 T-S model with p rules and construct an Interval Type-2 (IT2) controller with c rules; the closed loop takes the form of the following “model-controller composition”:
x ˙ ( t ) = i = 1 p j = 1 c h ~ i j ( x ( t ) ) [ A i + B i G j ] x ( t ) ,
where the weights h ~ i j are constructed from upper/lower membership functions and the Footprint of Uncertainty (FOU) division into subspaces l = 1 , , τ + 1 .
We consider the output of Dueling Double DQN as a “reference correction” on the fuzzy control [30]:
u t = u IT 2 ( x t ) + Δ u θ ( x t ) ,
where Δ u θ comes from the dueling Q-network and is controlled with a Lipschitz constraint (regularized) to avoid breaking the Lyapunov stability guarantee.
If matrices X = X > 0 , M = M , N j , and W i j l 0 exist such that the following linear matrix inequalities hold for all rules/sub-FOUs:
X > 0 ,   W i j l 0 , Q i j + W i j l + M > 0 , i = 1 p j = 1 c ( δ i j i 1 i n k l Q i j ( δ i j i 1 i n k l δ i j i 1 i n k l ) W i j l + δ i j i 1 i n k l M ) M < 0 , Q i j = A i X + X A i + B i N j + N j B i ,   G j = N j X 1 ,
Then, the IT2 together with Fuzzy Model-Based control (FMB) closed-loop system is “collectively” non-contradictory and asymptotically stable. These LMIs also guarantee stability in the presence of a mismatch between the membership functions of the model and the controller, and the stability region becomes larger by refining the FOU (increasing τ ).
For the policy π x = u I T 2 x + Δ u θ x , choose a candidate Lyapunov function v ( x ) —in RL, the best choice is the “value/cost function” because with positive costs, it inherently provides a one-step reduction condition and is itself Lipschitz continuous. Then, on a grid X τ and with confidence bands on the dynamics (statistical model), we enforce the reduction condition with high probability [30]:
u n ( x , π ( x ) ) = v ( μ n 1 ( x , π ( x ) ) ) + L v β n σ n 1 ( x , π ( x ) ) ,
u n x , π x < v x L Δ v τ   x V ( c ) X τ ,
where L Δ v = L v L f L π + 1 + L v . If this holds, the level set V ( c ) = x v x   c is a “region of attraction”. During training, the policy is updated by solving the following constrained problem:
π n     =     a r g m i n π θ Π L     x X τ r ( x , π θ ( x ) ) + γ J π θ ( μ n 1 ( x , π θ ( x ) ) ) +                         λ ( u n ( x , π θ ( x ) ) v ( x ) + L Δ v τ )
where λ is the “Lyapunov safety” penalty coefficient. For safe exploration, we select points from the set as follows:
S n = z S n 1 { z V ( c n ) X τ × U τ u n ( z ) + L v L f z z 1 c n } ,
and apply a backup policy when we reach the boundary.
The Lyapunov-based analysis and Linear Matrix Inequality (LMI) formulations in this work serve to define a stable backup policy that integrates Interval Type-2 Fuzzy control with RL. In practice, this analysis ensures that even if the neural policy becomes unstable, a stable controller—synthesized via IT2 Fuzzy Logic and LMI solutions—will always remain available. Thus, this section is directly linked to Step 1 of the algorithm (Table 2) and should not be regarded as an unrelated supplement.
Theorem 1.
If (i) the IT2-FMB layer LMIs are feasible and  G j  is obtained; (ii) the neural correction  Δ u θ  is trained such that the discrete Lyapunov reduction constraint with confidence band (relation above) holds on the grid X τ ; then:
(1) The closed loop with policy  π ( x ) = u I T 2 x + Δ u θ x  is stable and “safe” in the level set  V ( c n )  with probability  1 δ  (remains within the region of attraction of the current policy); moreover,  V c n   R π n .
(2) Exploration according to the rule of maximizing the length of the confidence interval in  S n , not only does it not exit  V ( c n ) , but it also converges to the largest identifiable safe region under the assumptions.
(3) By making  τ  finer and reducing uncertainty ( β n σ n ), the discrete bound becomes tighter and closer to the continuous condition.
Proof of Theorem 1.
The LMIs guarantee that the fuzzy linear-switched dynamics without neural correction have a decreasing Lyapunov function. Then, for combination with Δ u θ , constraint ensures that the neural effect, considering model error via the GP confidence band and Lipschitz constant of π , does not disrupt the one-step decrease of v on the grid. Since v is Lipschitz, the continuous-discrete difference is bounded by τ ; hence, the level set V ( c ) is inward invariant and an attraction region. As a result, “LMI stability + Lyapunov reduction with high probability” modularly composes, and the claim is proven. □
It should be noted that the LMI solution process is part of the Synthesis Phase and is carried out offline. The output of these calculations is the gain matrices ( G j ). Therefore, during the Online Execution, the control system does not need to solve optimization problems and only uses simple matrix multiplication, which has a very small computational load O ( 1 ) and does not create any bottleneck. Consequently, the heavy computational burden of LMI solving does not affect the real-time control loop, allowing for high-frequency operation suitable for embedded control systems.
The pseudo-code of the algorithm is as Table 2. Also, the schematic of the entire system is shown in Figure 1.
Table 2 illustrates that Step 1 of the algorithm—solving the LMI and synthesizing the IT2 controller—serves as the critical bridge between theoretical stability analysis and practical algorithm implementation. Subsequent steps, including the definition of the Lyapunov function, uncertainty modeling, and safety constraints, are all inspired by this analysis and are incorporated in a simplified form within the DQN training process. Therefore, Section 2.1 is not only a theoretical complement to the paper but also forms the foundational framework for ensuring algorithmic safety.

Practical Approximation

While Equations (18)–(25) are based on a general framework for safety using Gaussian processes to estimate uncertainty, online implementation of complete constraints based on GP can be computationally costly. Therefore, in our practical implementation for the Cart–Pole system, we use a Linear Quadratic Regulator (LQR) as the operational realization of the backup policy π B . The gain of the LQR controller is synthesized offline to satisfy stability conditions locally around the equilibrium point. This approach serves as an efficient computational alternative to a rigorous safety filter, ensuring that when the neural policy crosses the safety threshold, the system is returned to the safe S n set.

3. Results

To do benchmarking, we ran various reinforcement learning algorithms and compared them, such as Standard DQN, Double DQN, Dueling Double DQN, PPO, C51, QR-DQN, and PER. The safe hybrid is our method for comparison with the aforementioned approaches. The Proximal Policy Optimization (PPO) algorithm employs the clipped surrogate objective:
L CLIP ( θ ) = E [ m i n ( r t ( θ ) A ^ t , clip ( r t ( θ ) , 1 ϵ , 1 + ϵ ) A ^ t ) ]
in which r t ( θ ) = π θ ( a t s t ) π θ old ( a t s t ) is the probability ratio and advantage is defined as
A ^ t = k = 0 γ k r t + k V ( s t )
We consider all approaches in terms of average reward:
R a v g = 1 T t = 1 T r t
and success rate:
S R = N u m b e r e p i s o d e s | t 500 N u m b e r t o t a l   e p i s o d e s
Our implementation makes practical compromises to the theoretical framework. The full LMI analysis proved computationally prohibitive during training, so we developed a simplified approach that preserves the core stability properties.
We used the following practical parameters:
-
Linearization around equilibrium point ( θ = 0 ,   θ ˙ = 0 ,   x = 0 ,   x ˙ = 0 );
-
State weighting: Q   =   d i a g ( [ 0.01 ,   0.01 ,   1.0 ,   0.1 ] ) ;
-
Control weighting: R   =   [ 0.0001 ] ;
-
Conservative uncertainty bounds: δ θ   =   0.05   r a d ,   δ θ ˙   =   0.02   r a d / s ;
-
Solver: CVX with SDPT3;
-
FOU division: τ   =   3 subspaces.
The resulting optimal LQR gain matrix was as follows:
K o p t   =   [ 0.1000 ,   0.4583 ,   1.9925 ,   0.3549 ]
The LMI feasibility was confirmed with a maximum eigenvalue of 0.0732 for the closed-loop system, ensuring asymptotic stability. In our implementation, we use this LQR controller as a safety backup when the system enters unstable regions ( | θ |   >   0.12 rad or | θ ˙ |   >   0.8   r a d / s ).
The Lyapunov function candidate is V ( x )   =   x T   P   x , where P is the solution of the continuous algebraic Riccati equation. The Lipschitz constants were empirically estimated as follows:
-
L π = 2.1 (policy network);
-
L f = 4.3 (system dynamics);
-
L v = 1.8 (value function).
These constants were used to verify the Lyapunov reduction condition (23) on a grid of 1000 points within the region of attraction, with a confidence level of 95%.
The plot of training rewards (Figure 2) makes it clear that there is an enormous difference between the methods. The safe-hybrid strategy reaches stable high rewards of 650–700 in only 20 episodes, and with a very small range, the performance remains stable. PPO and QR-DQN, meanwhile, completely fail on this task, with a steadily negative reward (−115.4 and −112.8, respectively) and a 0 percent success rate over all episodes.
Standard DQN, Double DQN, and Dueling Double DQN are somewhat more erratic, eventually learning rewards of about 587, 576, and 562, albeit after 112, 102, and 108 episodes. The high performance is also secured by C51 and PER (593 and 578 rewards), but stabilizes at 95 and 104 episodes.
The success-rate plot (Figure 3) is more persuasive. Our approach achieves 100% success by episode 20 and never drops below that, whereas the optimal baseline (C51) does not stabilize at a high 96% until 95 episodes. This significant limitation of the two approaches in this control task can be seen in the fact that PPO and QR-DQN (0 percent success) did not work at all.
The reduction in variability of our conservative reward curve compared to all baselines indicates tremendous learning stability. That is important, as it implies that the system avoids false convergence and performance degradation that other solutions may experience. A 4.5-fold increase in learning efficiency is the fast transition to reliable performance (21 episodes vs. 95–112 baselines).
This all demonstrates that combining Interval Type-2 Fuzzy Logic with Dueling Double DQN introduces an entirely new learning behavior than the traditional methods, allowing the agent to stabilize quickly and at high performance rather than the slow and jagged progress found with the other methods.
Figure 4 examines the training loss of all the compared algorithms. These findings reveal significant variation in learning stability between techniques. The safe-hybrid process exhibits a steady, logarithmic decline in loss of about 10 to 0.5, with a stable low loss running between 5000 training steps. This constant streak shows that the technique learns consistently without massive fluctuations. PPO leads the pack in terms of training loss, with a range of between 600 and 4000 on a linear basis (102.8 to 103.6 on log). The extreme fluctuations in the loss curve of PPO are a cause of concern in the training phase, which is why the success rate is zero. Standard DQN, Double DQN, and Dueling Double DQN exhibit relatively high loss values that converge to 102–103, although with considerable noise during the training. C51 maintains moderate loss (≈2–3) yet fluctuates. PER dips and then rises, whereas QR-DQN has an unpleasant pattern with a significant increase in loss with time.
The overall point is not the learning loss figures but the consistency of the learning activity. Other approaches, such as Double DQN, could potentially approach the middle ground loss, but they are highly erratic in training. The steady, gradual loss reduction in the safe-hybrid approach demonstrates a high stability, which directly correlates to its high reward and success-rate measures. Stability is a paramount requirement in control applications where untrustworthy learning may lead to unsafe behavior. The steady reduction in loss without large backsliding that the hybrid method takes is a key strength compared to other algorithms, particularly in nonlinear, unstable systems where the ability to learn consistently is vital.
Figure 5 indicates the density of various action types during the training of the safe-hybrid method. The graph presents an interesting trend that is not expected at first glance. Random exploration occupies approximately 25 percent of the first 5–10 episodes to provide the system with a decent view of the state space. However, the key point is that once this phase of quick exploring is over, the fuzzy system becomes the dominant factor as the prime decision-making system. The fuzzy part was not supposed to take over this early. Rather than steadily relinquishing the reins to the DQN as training progresses, fuzzy actions continue to anchor the decisions during training, stabilizing at around 95 of all actions after only 10 episodes. The DQN side remains small, less than 5 percent of the entire period. The trend proves why the hybrid approach is progressing at such a rapid pace and with consistency. The fuzzy system, having its in-built understanding of Cart–Pole dynamics, provides immediate stability at episode 10, and this is why it suddenly rises up to 100% success, evident in the success-rate chart. It appears that the DQN predominantly optimizes the choices within the stable system established by the fuzzy element.
The fact that the safety controller is used very little (almost invisible) proves that the fuzzy component makes the system so safe that it does not have to be intervened with. The action-type distribution compares with the excellent learning stability in both reward and success rate, and serves to confirm that the fuzzy system is not a transient aid, but the entire basis of the learning process.
Technical check results for the ablation study are reported in Table 3. The results prove that the combined safe-hybrid method has more advantages than each individual method.
Figure 6 is, in essence, our last comparison of all the algorithms. It displays the mean reward and the success rate. The safe-hybrid approach scores an average of 645.20427 with a standard deviation of 13.5 and hits the peak with a 100.003. That is such a big decrease against PPO and QR-DQN, which both achieve unpleasant negative average rewards of −112.0 ± 4.1 and −111.4 ± 4.8, respectively, and a cumulative success rate of 0% in all episodes. On more advanced approaches, such as C51 (average reward 503.3 ± 192.6, 92.0% success rate) and Double DQN (417.5 ± 264.1, 80.0% success rate), they are good but still somewhat below the safe hybrid in both measures. The low standard deviation (13.5) on the reward of the safe hybrid, too, is that it is remarkably consistent across runs, unlike any of the other high-variance methods.
The most striking point is that such traditional approaches as PPO/QR-DQN completely fail at this seemingly easy task, achieving zero success in every attempt. It actually makes the point of critical stability hints in RL; even elaborate algorithms will not work without them. Our method’s success rate was 100 percent, which is significantly higher than the base DQN at 98.0%. Although the base DQN achieves a high success rate, it exhibits significantly higher variance and slower convergence. The convergence time is reduced to approximately 1/5th of the optimal baseline (21 vs. 95 episodes). That is a 4.5-fold speed-up. The average reward increased by 309% from base DQN’s 157.5 to safe hybrid’s 645.2. In summary, the proposed method achieves robust performance and maintains it consistently.
Baseline DQN and similar algorithms are erratic, with fluctuations so sharp. The safe-hybrid method, on the other hand, demonstrates high stability and low variance. Its standard deviation of 13.5 is 2–3 times smaller than any of the baselines, indicating that it is much more stable. This consistency is particularly valuable in control applications where reliability matters more than occasional peak performance.
Table 4 shows the success rate in reaching the maximum step (500 steps, meaning full stability during the episode) along with the average convergence episode and the average reward. It is worth noting that in order to guarantee a fair and accurate comparison, all baseline algorithms, including Standard DQN, Double DQN, PPO, C51, and PER, have been precisely trained using the same Adaptive Reward Shaping function described in Equation (14). This isolates the contribution of hybrid architecture and shows that the superior performance observed is not simply the result of reward engineering, but the result of synergistic integration of the IT2 Fuzzy supervisor and the Dueling Double DQN learner.
All experiments used CartPole-v1 from Gymnasium with identical termination conditions across methods:
-
Pole angle > 0.418 rad OR;
-
Cart position > 2.4 m OR;
-
Episode length > 500 steps.
Success was strictly defined as reaching the maximum episode duration of 500 steps (95% of maximum). We used the same five random seeds for every method {42, 123, 456, 789, 999} and performed final evaluation with no exploration (ε = 0) over 100 episodes.
To guarantee the validity and repeatability of the results, all functional tests were performed on five random seeds. The average reward and the final success rate are calculated along with the standard deviation ( ±   σ ). In Figure 7 (average reward) and Figure 8 (success rate), shaded areas show the standard deviation interval. These charts clearly show that not only does the method Hybrid IT2F-D3QN converge faster on average, but also it gains the lowest standard deviation in the final convergence region. This strongly suggests that the stability and high performance of our method are not solely dependent on initial random settings, but are statistically constant and resistant.
The hyperparameter search was exhaustive but practical, i.e., for our method, we tested five learning rates {1 × 10−5, 5 × 10−5, 1 × 10−4, 5 × 10−4, 1 × 10−3}, four batch sizes {32, 64, 128, 256}, etc. For baselines, we used established defaults from their papers, then tuned within reasonable ranges. All hyperparameter details are reported in Table 5.
Neural network architecture for Dueling Double DQN is reported as follows, with the aim of repeatability:
  • Input Layer: Four neurons, system states including position, velocity, angle, and angular velocity.
  • Two Feature Extraction Layers
    Fully connected layer including 256 neurons with Activation function ReLU.
  • Independent Sections (Stream Split)
    Value Stream: FC(128) ReLU FC(1) (Output Value).
    Advantage Stream: FC(128) ReLU FC(Action × Atoms) (Output Advantage).
  • AdamW Optimizer with a learning rate of 5 × 10−4 and loss function Huber Loss used in the C51 distribution framework.
For statistical comparison, we used one-way ANOVA with post hoc Tukey HSD test. The analysis was performed on the final 50 episodes of training across 5 random seeds. The ANOVA results were as follows:
F ( 7 ,   392 )   =   186.42 ,   p   <   0.0001
The post hoc Tukey HSD test showed significant differences (p < 0.05) between our method and all baseline methods.
Also, to evaluate the stability and consistency of performance, the homogeneity of variances was examined using Levene’s test. The null hypothesis in this test is the equality of the variances of rewards among different groups. The test result, with a significance level of p < 0.05 , rejected the null hypothesis and showed that the variance of the reward distribution in the proposed method is significantly lower than that of other methods. This finding confirms the higher stability and reliability of the performance of the proposed algorithm in different iterations of the experiment. The sum of these analyses provides strong statistical evidence indicating the superiority of both efficiency (higher mean) and stability (lower variance) of the proposed method.
To understand what our agent was actually focusing on, we implemented a straightforward gradient-based attribution method. For each state variable, the calculation below is carried out:
ϕ i = E [ Q ( s , a ) s i ] , ϕ i rel = ϕ i j = 1 4 ϕ j ,
allowing the Q-function to be approximated as a weighted linear expansion.
To examine the impact of each feature on the agent’s decision-making, the gradient-based attribution framework was used. Figure 9 showed that the top of the list was pole angle (0.42), then angular velocity (0.37), and last on the list were cart position (x) and velocity (dx/dt) (0.12, 0.09). This corresponds to our reward shaping concept of giving priority to angle stability.
The dependence plot (Figure 10) indicates that the agent actually learnt an intelligent policy that responds both to the current state and to its rate of change. When θ < 0 (tilts pole left), gradient-based attribution is positive, i.e., push right. And at θ > 0, gradient-based attribution is negative, correctly indicating that rightward action would be counterproductive. Even the color map displays that angular velocity dθ/dt enhances this relationship: the further to the left the pole is spinning, the greater the correcting action. We are essentially observing it act like a proportional derivative controller.
All this discussion makes it clear that a combination of fuzzy knowledge and deep RL provides us with a consistent, intelligible policy that reflects traditional control theory. The predictability and reliability of the training plots, as well as explainable decisions, contribute significantly to practical RL in control applications.
The presented gradient-based attribution dependence plot in Figure 10 depicts the relationship between the input state variables and the model’s output (the Q-value for the “move right” action) with high resolution. Its validity is further corroborated through fundamental principles of control theory. The horizontal axis represents the change in the state variable θ (effectively, the proportional error), while the vertical axis shows the contribution of this feature to the decision-making process. A clear inverse relationship is observed between θ and its dependence value for the “move right” action:
*
For θ < 0 (Deviation to the Left): The gradient-based attribution value is positive. This correctly indicates a sound control action, as applying a force to the right is the appropriate action to restore the system to equilibrium (θ = 0).
*
For θ > 0 (Deviation to the Right): The gradient-based attribution value is negative. This is also logically consistent, as a deviation to the right makes the “move right” action undesirable, steering the agent toward the opposite action (“move left”).
The primary value of this analysis lies in its visualization of interaction effects, represented by the color of the points (the variable dθ/dt). This feature plays a role similar to the derivative term in a controller, granting the system predictability and responsiveness to dynamics. The effect is observed to be amplified in critical states. For instance, when the pole is deviated to the left (θ < 0) and is simultaneously rotating further leftward (dθ/dt < 0, blue points), the gradient-based attribution value reaches its highest positive values. This signifies that the agent is reacting not only to the current state but also to its rate of change to prevent an imminent failure. Conversely, when the pole is deviated to the left (θ < 0) but is already swinging back toward the center (dθ/dt > 0, red points), the gradient-based attribution values are lower. This indicates the agent has recognized that the system is self-correcting, thereby requiring a less forceful intervention.
This plot clearly demonstrates that the DQN agent has learned a sophisticated, nonlinear control policy that extends beyond a simple linear mapping. The policy is sensitive to both the system’s instantaneous error (a proportional-like term) and its dynamics and trend of change (a derivative-like term).
Figure 11 shows the trend of changes in the two parameters, exploration (ε) and fuzzy influence (α), during training. The value of ε, which represents the probability of choosing an action randomly, is set to 1 at the beginning of training so that the agent acts completely in an exploratory manner. As time passes and the number of episodes increases, this value gradually decreases and finally reaches a minimum value of 0.05. This process causes the agent to gradually move from exploration to exploitation and rely more and more on the knowledge gained from its experiences. On the other hand, the parameter α shows the intensity of the influence of Interval Type-2 Fuzzy Logic in shaping the reward and guiding the agent. This value is initially set to 0.5 and then gradually decreases. Such a design causes fuzzy knowledge to play a greater role in guiding the agent in the early stages of training, but as the learning process progresses, its role decreases, and the model relies more on the optimal policy extracted by the network itself.
The adaptive reward function (Equation (14)) requires concise sensitivity analysis for assurance of stable convergence. We implemented three key scenarios for sensitivity testing:
  • Angle Dominant ( w θ = 15.0 ,   w θ ˙ = 0.5 ,   w x = 0.1 ): Extreme focus on angle.
  • Balanced (Our Selection) ( w θ = 10.0 ,   w θ ˙ = 1.0 ,   w x = 0.5 ): Suggested settings.
  • Velocity Dominant ( w θ = 5.0 ,   w θ ˙ = 5.0 ,   w x = 0.5 ): Strong penalty for angular velocity.
As seen in Figure 12, the Balanced (Our Selection) settings offer significantly faster convergence and the highest final reward. In contrast, the Angle Dominant scenario performs well initially, but due to ignoring the speed penalty, it undergoes volatility (Oscillation) and eventually gains less reward. The Velocity Dominant scenario also pushes the agent into very slow movements due to excessive fines, which keep them from reaching the success zone (maximum steps). These results confirm that our selective coefficients create an optimal balance point between angular stability and speed damping.
Justifying the choice of architecture, Dueling Double DQN (D3QN) is in the reduction of overestimation bias, common in Standard DQN, of estimating Q-values. To validate this selection, we compared the average of the maximum Q-value curve for both methods (Standard DQN and Hybrid IT2F-D3QN) during training. As seen in Figure 13, while Standard DQN values are occasionally unrealistically high and with strong fluctuations, D3QN keeps the Q-values in reasonable and stable intervals due to the separation of operations of Action Selection in the policy network and Action Evaluation in the target network. The stability of Q-values guaranteed by D3QN, in combination with the fuzzy factor, prevents the system divergence in the middle phases of training. The shorter line for Standard DQN confirms that it is failing significantly earlier in its episodes compared to the Hybrid IT2F-D3QN. This visual difference in the x-axis length actually serves as further evidence of the hybrid controller’s superior stability.
The computational overhead added by the IT2FL system in the inference stage is negligible, around 0.4 ms, owing to calculating membership functions and FOU. Although the inference time is slightly higher, this increase compared to the increase of 4 times in speed of convergence (convergence in 20 episodes against +100 episodes for Standard DQN) is completely justifiable. In practice, the proposed framework is superior in terms of the overall efficiency of training because of the ultra-fast convergence rate. The computational complexity analysis is reported in Table 6.
The findings of this study clarify several key points:
(1) Reward shaping alone improves the convergence speed but is unable to guarantee stability.
(2) Double DQN has been able to reduce the problem of overestimation of Q-values and plays a fundamental role in the stability of the policy.
(3) Interval Type-2 Fuzzy Logic has played an important role in managing uncertainties and has reduced the volatility of rewards.
(4) The combination of these three components, along with the use of Explainability Tools (gradient-based attribution), has resulted in the proposed system having significant advantages over the reference methods, both in terms of performance and behavioral transparency.
These achievements are of particular importance for practical applications, especially in real robotic environments where stability and transparency in decision-making are essential requirements.

Discussion and Future Research

This architecture demonstrates how providing the learner with a collection of professional rules, i.e., Interval Type-2 Fuzzy Logic, can offer that early-stage sense of stability that allows data-driven learning to proceed effectively. The success-rate curve demonstrates the mechanism by which we achieve that reliability at the beginning and maintain it with training. The failure of PPO and QR-DQN in this benchmark scenario informs us of a huge reality where several of the highest-tech algorithms require the appropriate stability devices to be effective, even in simple control issues. The fuzzy logic, coupled with deep RL, is one of the tools our framework provides.
We see a few paths forward:
(1) Practical Implementation focus: Rather than focusing solely on theoretical propositions, we should proceed to create consistent, reliable systems, which radiate in the early stages of learning, as does our quick converging approach.
(2) Domain Adaptation: We need to apply the proposed method to more highly challenging environments, such as real robots, and see how we can maintain that early-stability advantage when failures are expensive.
(3) Automated Knowledge Integration: In the case that we can extract fuzzy rules based on the example data, we may reduce the time required in engineering and still have a stability advantage.
(4) Failure Analysis: Why the algorithms such as PPO and QR-DQN fail in this case, perhaps, can guide us to learn about their weaknesses and improve on them.
The key point learned is that reliability and speed may outcompete a slightly better final score in most practical environments. The training curves demonstrate that rapid high performance that repeats significantly is worth more than waiting to attain slightly higher payoff after a long run, probably a new lens to judge RL algorithms when we really care about implementation.
Although Cart–Pole gives us a nice testbed to illustrate our arguments, we must continue benchmarking against more complex robotic systems to ensure that the benefits of quick convergence and stability apply to more dimensions. The scalability is a common challenge in fuzzy-based systems that is due to the “curse of dimensionality” or “rule explosion”. However, our hybrid architecture is designed to reduce this in higher-dimensional systems (e.g., multi-connection robots) through two mechanisms:
  • Decoupled Safety Layer: In a scaled-up scenario, we are not going to apply the Fuzzy-LMI layer to control the entire state space. Instead, the Fuzzy-LMI layer is designed to act as a secure supervisor who acts only within critical stability constraints (e.g., center of mass balance or joint limits), while the deep RL agent (Dueling Double DQN) performs high-dimensional path planning and complex task objectives.
  • Parallel Distributed Compensation: The T-S Fuzzy Model used in our LMI formulation allows the approximation of complex nonlinear systems using local linear models. For larger systems, the state space can be broken down into subsystems, or hierarchical fuzzy systems can be used to keep the rule base controllable. While Cart–Pole acts as a benchmark to prove theoretical sustainability guarantees, the separation of “stabilizing fuzzy layer” and “performance deep RL layer” makes the architecture inherently modular and scalable to more complex dynamics.

4. Conclusions

In this paper, we are examining a hybrid learning-based control system in the Cart–Pole problem. We combine Interval Type-2 Fuzzy Logic and a Double-DQN algorithm, and an adaptive reward shaping. This combination is experimentally observed to converge more quickly and be more stable than the standard baselines, as observed in the reward and success-rate plots. Investigating the reward curves, the default DQN is fluctuating, including negative rewards and sometimes good highs, around 600. It does not adhere to the level of strong performance. The proposed safe-hybrid combined method, by contrast, releases the steady stream of high rewards of 650 to 700 in only 20 episodes and maintains that record during training. That is a huge payoff on learning stability, not a marginally increased final reward. The success-rate statistics are better still; a safe hybrid will reach a 100 percent success rate (i.e., it will reach the highest possible number of steps) in 20 episodes and remain there. DQN’s average does not reach that consistency until around 112 episodes, and C51 takes 95 episodes to reach the same point. The most egregious, PPO and QR-DQN, run out of control with approximately 0 percent success rate and scoring negative rewards across the board, emphasizing the importance of stability mechanisms in these models. When counting the efficiency of learning, stable high-performance kicks in at episode 21 in our proposed method, versus episode 95 in C51, the best of the baselines. That is a 4.5-fold speed-up. The success-rate curve illustrates that by episode 20, the safe hybrid has half lock metrics, and others continue to wiggle beyond 200. It is also observed through the reward plots that our reward is less fluctuating than other methods, and it also has narrower confidence bounds. This informs us that our policy is more stable and reliable. That angle (0.42) and angular velocity (0.37) of the pole are the highest drivers in decision-making, as indicated by gradient-based feature attribution, which is what the Cart–Pole physics would tend to indicate, and that is why our fuzzy logic works.
The general scheme we present below provides a realistic method to combine symbolic knowledge with data-driven learning. The modular design allows to customize it to various control problems by (1) adapting the fuzzy rules to domain experts, (2) adapting state representation to new tasks, and (3) redefining reward shaping to achieve domain objectives. The solution is addressing the large problem of solid performance in a short time interval, considering the real world.
The main conclusions of this study are as follows:
1. An experimental combination of fuzzy logic with deep RL, where it can zero in on consistent performance early, demonstrated by the 100% success rate in 20 episodes on the success-rate plot;
2. Real-world data showing that prominent algorithms such as PPO and QR-DQN fail to learn this allegedly simple task unless equipped with the appropriate stability mechanisms, demonstrating a discontinuity in the state of RL practice;
3. An explicit structure balancing theory and practice, allowing you to break down the reasoning of RL in control cases.
In summary, the findings confirm that our new framework met the objectives:
1. Reducing convergence of learning by 4–5 times compared to the optimum baselines;
2. Achieving a 100% success rate with minimal variance, unlike all other methods tested;
3. The ability to ensure robustness, preventing the deep-learning failures exhibited in PPO and QR-DQN;
4. Establishing a good foundation to know what motivates agent performance through feature attribution.
While this study focused on the standard environment Cart–Pole, the proposed Hybrid IT2F-D3QN framework has high generalization capabilities. The double layer structure (Fuzzy Supervisor + RL Learner) allows:
1. Generalization to higher-dimensional environments: The IT2FL layer can be easily extended to manage uncertainty in a greater number of state variables, such as Acrobot or three-dimensional Pendulum, without dramatically increasing the computational complexity of RL training.
2. Generalization to continuous control: The D3QN section, which is a value-based method algorithm, can be directly replaced by Policy-Based algorithms such as DDPG and SAC. In these scenarios, the fuzzy system can be used as an Action Guider or Action Space Masker to limit and secure early exploration in continuous space, which is critical for more complex applications with continuous action spaces such as driving tasks or robot manipulators. This architecture is the basis for future research in MuJoCo environments.

Author Contributions

Conceptualization, methodology, software, validation, formal analysis, investigation, resources, data curation, writing—original draft preparation, writing—review and editing, and visualization H.M.K., J.R. and J.K.; supervision and project administration, J.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research received external funding from Universidad del Bío-Bío and the 2030 Engineering Faculty Project ING2430001.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data and Python code will be available with a journal request.

Acknowledgments

During the preparation of this manuscript, the authors used Google Colab for the purposes of Python code execution. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 1998; Volume 1. [Google Scholar]
  2. Busoniu, L.; Babuska, R.; De Schutter, B.; Ernst, D. Reinforcement Learning and Dynamic Programming Using Function Approximators; CRC Press: Boca Raton, FL, USA, 2017. [Google Scholar]
  3. Wiering, M.A.; Van Otterlo, M. Reinforcement learning. Adapt. Learn. Optim. 2012, 12, 729. [Google Scholar]
  4. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
  5. Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M. Mastering the game of Go with deep neural networks and tree search. Nature 2016, 529, 484–489. [Google Scholar] [CrossRef]
  6. LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
  7. Dulac-Arnold, G.; Levine, N.; Mankowitz, D.J.; Li, J.; Paduraru, C.; Gowal, S.; Hester, T. Challenges of real-world reinforcement learning: Definitions, benchmarks and analysis. Mach. Learn. 2021, 110, 2419–2468. [Google Scholar] [CrossRef]
  8. Berenji, H.R. A reinforcement learning—Based architecture for fuzzy logic control. Int. J. Approx. Reason. 1992, 6, 267–292. [Google Scholar] [CrossRef]
  9. Er, M.J.; Deng, C. Online tuning of fuzzy inference systems using dynamic fuzzy Q-learning. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 2004, 34, 1478–1489. [Google Scholar] [CrossRef]
  10. Jamshidi, P.; Sharifloo, A.M.; Pahl, C.; Metzger, A.; Estrada, G. Self-learning cloud controllers: Fuzzy q-learning for knowledge evolution. In Proceedings of the 2015 International Conference on Cloud and Autonomic Computing, Boston, MA, USA, 21–25 September 2015; pp. 208–211. [Google Scholar]
  11. Glorennec, P.Y.; Jouffe, L. Fuzzy Q-learning. In Proceedings of the 6th International Fuzzy Systems Conference, Barcelona, Spain, 1–5 July 1997; pp. 659–662. [Google Scholar]
  12. Watkins, C.J.; Dayan, P. Q-learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
  13. Van Hasselt, H.; Guez, A.; Silver, D. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016. [Google Scholar]
  14. Schaul, T.; Quan, J.; Antonoglou, I.; Silver, D. Prioritized experience replay. arXiv 2015, arXiv:1511.05952. [Google Scholar]
  15. Bellemare, M.G.; Dabney, W.; Munos, R. A distributional perspective on reinforcement learning. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 449–458. [Google Scholar]
  16. Hessel, M.; Modayil, J.; Van Hasselt, H.; Schaul, T.; Ostrovski, G.; Dabney, W.; Horgan, D.; Piot, B.; Azar, M.; Silver, D. Rainbow: Combining improvements in deep reinforcement learning. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
  17. Wang, Z.; Schaul, T.; Hessel, M.; Hasselt, H.; Lanctot, M.; Freitas, N. Dueling network architectures for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1995–2003. [Google Scholar]
  18. Ng, A.Y.; Harada, D.; Russell, S. Policy invariance under reward transformations: Theory and application to reward shaping. In Proceedings of the Icml, Bled, Slovenia, 27–30 June 1999; pp. 278–287. [Google Scholar]
  19. Ibrahim, S.; Mostafa, M.; Jnadi, A.; Salloum, H.; Osinenko, P. Comprehensive overview of reward engineering and shaping in advancing reinforcement learning applications. IEEE Access 2024, 12, 175473–175500. [Google Scholar] [CrossRef]
  20. Bi, Y.; Ding, Q.; Du, Y.; Liu, D.; Ren, S. Intelligent traffic control decision-making based on type-2 fuzzy and reinforcement learning. Electronics 2024, 13, 3894. [Google Scholar] [CrossRef]
  21. Tan, W.W.; Chua, T.W. Uncertain rule-based fuzzy logic systems: Introduction and new directions (Mendel, JM; 2001)[book review]. IEEE Comput. Intell. Mag. 2007, 2, 72–73. [Google Scholar] [CrossRef]
  22. Wu, D.; Mendel, J.M. Uncertainty measures for interval type-2 fuzzy sets. Inf. Sci. 2007, 177, 5378–5393. [Google Scholar] [CrossRef]
  23. Hsu, C.-H.; Juang, C.-F. Self-Organizing Interval Type-2 Fuzzy Q-learning for reinforcement fuzzy control. In Proceedings of the 2011 IEEE International Conference on Systems, Man, and Cybernetics, Anchorage, AK, USA, 9–12 October 2011; pp. 2033–2038. [Google Scholar]
  24. Arulkumaran, K.; Deisenroth, M.P.; Brundage, M.; Bharath, A.A. A brief survey of deep reinforcement learning. arXiv 2017, arXiv:1708.05866. [Google Scholar] [CrossRef]
  25. Rostami, S.M.R.; Al-Shibaany, Z.; Kay, P.; Karimi, H.R. Deep reinforcement learning and fuzzy logic controller codesign for energy management of hydrogen fuel cell powered electric vehicles. Sci. Rep. 2024, 14, 30917. [Google Scholar] [CrossRef]
  26. Kumar, N.; Rahman, S.S.; Dhakad, N. Fuzzy inference enabled deep reinforcement learning-based traffic light control for intelligent transportation system. IEEE Trans. Intell. Transp. Syst. 2020, 22, 4919–4928. [Google Scholar] [CrossRef]
  27. Tunc, I.; Soylemez, M.T. Fuzzy logic and deep Q learning based control for traffic lights. Alex. Eng. J. 2023, 67, 343–359. [Google Scholar] [CrossRef]
  28. Chen, C.-H.; Jeng, S.-Y.; Lin, C.-J. Mobile robot wall-following control using fuzzy logic controller with improved differential search and reinforcement learning. Mathematics 2020, 8, 1254. [Google Scholar] [CrossRef]
  29. Juang, C.-F.; You, Z.-B. Reinforcement learning of an interpretable fuzzy system through a neural fuzzy actor-critic Framework for Mobile Robot Control. IEEE Trans. Fuzzy Syst. 2024, 32, 3655–3668. [Google Scholar] [CrossRef]
  30. Jang, J.-S. ANFIS: Adaptive-network-based fuzzy inference system. IEEE Trans. Syst. Man Cybern. 1993, 23, 665–685. [Google Scholar] [CrossRef]
  31. Kosko, B. Fuzzy cognitive maps. Int. J. Man-Mach. Stud. 1986, 24, 65–75. [Google Scholar] [CrossRef]
  32. Zadeh, L.A. Fuzzy sets. Inf. Control 1965, 8, 338–353. [Google Scholar] [CrossRef]
  33. Levine, S.; Koltun, V. Guided policy search. In Proceedings of the International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013; pp. 1–9. [Google Scholar]
  34. Han, J. From PID to active disturbance rejection control. IEEE Trans. Ind. Electron. 2009, 56, 900–906. [Google Scholar] [CrossRef]
  35. Chen, C.P.; Liu, Y.-J.; Wen, G.-X. Fuzzy neural network-based adaptive control for a class of uncertain nonlinear stochastic systems. IEEE Trans. Cybern. 2013, 44, 583–593. [Google Scholar] [CrossRef] [PubMed]
  36. Chen, W.-H.; Yang, J.; Guo, L.; Li, S. Disturbance-observer-based control and related methods—An overview. IEEE Trans. Ind. Electron. 2015, 63, 1083–1095. [Google Scholar] [CrossRef]
  37. Nie, M.; Tan, W.W. Towards an efficient type-reduction method for interval type-2 fuzzy logic systems. In Proceedings of the 2008 IEEE International Conference on Fuzzy Systems (IEEE World Congress on Computational Intelligence), Hong Kong, China, 1–6 June 2008; pp. 1425–1432. [Google Scholar]
Figure 1. Schematic of whole system under hybrid Interval Type-2 Fuzzy.
Figure 1. Schematic of whole system under hybrid Interval Type-2 Fuzzy.
Ai 06 00319 g001
Figure 2. Average reward during episodes.
Figure 2. Average reward during episodes.
Ai 06 00319 g002
Figure 3. Success rate for all methods.
Figure 3. Success rate for all methods.
Ai 06 00319 g003
Figure 4. Comparison of training error.
Figure 4. Comparison of training error.
Ai 06 00319 g004
Figure 5. Distribution of action types during episodes.
Figure 5. Distribution of action types during episodes.
Ai 06 00319 g005
Figure 6. Cumulative performance.
Figure 6. Cumulative performance.
Ai 06 00319 g006
Figure 7. Final rewards shaded with different seeds.
Figure 7. Final rewards shaded with different seeds.
Ai 06 00319 g007
Figure 8. Final success shaded with different seeds.
Figure 8. Final success shaded with different seeds.
Ai 06 00319 g008
Figure 9. Relative importance of state variables based on gradient-based attribution.
Figure 9. Relative importance of state variables based on gradient-based attribution.
Ai 06 00319 g009
Figure 10. Dependence plot for “push right” action.
Figure 10. Dependence plot for “push right” action.
Ai 06 00319 g010
Figure 11. Exploration (ε) and Fuzzy influence (α).
Figure 11. Exploration (ε) and Fuzzy influence (α).
Ai 06 00319 g011
Figure 12. Reward function sensitivity analysis.
Figure 12. Reward function sensitivity analysis.
Ai 06 00319 g012
Figure 13. Overestimation bias mitigation experiment.
Figure 13. Overestimation bias mitigation experiment.
Ai 06 00319 g013
Table 1. Interval Type-2 Fuzzy rule base.
Table 1. Interval Type-2 Fuzzy rule base.
Rule No. θ (Angle) θ ˙ (Angular Velocity)Output (Action)
1LeftSlowPush Right
2LeftMediumPush Right
3LeftFastPush Right (strong)
4CenterSlow0 (no force)
5CenterMediumIf θ ˙ > 0 → Push Left, else Push Right
6CenterFastIf θ ˙ > 0 → Push Left (strong), else Push Right (strong)
7RightSlowPush Left
8RightMediumPush Left
9RightFastPush Left (strong)
Table 2. Pseudo-code of the algorithm.
Table 2. Pseudo-code of the algorithm.
Hybrid Interval Type-2 Fuzzy Dueling Double DQN with Reward Shaping
Step 1—IT2 Synthesis with LMI:
1. Divide the FOU into τ + 1 sub-FOUs and construct h i j l , h ¯ i j l .
2. Solve the LMIs in CVX/SDPT3 and obtain G j = N j X 1 ; this section is the “stable backup policy.”
Step 2—Define Lyapunov and Safe Moment:
1. Take v ( x ) = J π θ x (value/cost function); this function is both a Lyapunov candidate and a performance measure.
2. Estimate the Lipschitz constants L v , L f , L π (for networks, one can take the spectral bound of gradients/weights); add a regularizer constraint on L π to the loss.
Step 3—Uncertainty Model and Confidence Band:
1. Train a GP on the map f (or on the “model error” g ) and provide μ n , σ n ; select β n according to the RKHS bound, so that f μ n   1 β n σ n holds with probability 1 δ .
2. On the grid X τ , if u n ( x , π ( x ) )   v ( x t ) L Δ v τ is satisfied; find the largest c n such that V ( c n ) is definitely inside R π n .
Step 4—Optimize the Policy with the Safety Constraint:
m i n θ x X τ r ( x , π θ ( x ) ) + γ J π θ ( μ n 1 ( x , π θ ( x ) ) ) + λ ( u n ( x , π θ ( x ) ) v ( x ) + L Δ v τ )
After each episode, update the safe sets D n , S n , and collect safe data.
Step 5—Real-Time Control Filter (Projection/QP):
If in the run u n ( x t , u ^ ) > v ( x t ) L Δ v τ , by solving a local QP m i n u ~ u ~ u ^ 2 under the same constraint, or switch directly to u IT 2 ( x t ) (backup guideline).
Table 3. Ablation study results.
Table 3. Ablation study results.
ConfigurationMax Episode RewardSuccess Rate (%)Convergence Episodes
Complete Hybrid678.2100.021
w/o Interval Type-2 Fuzzy542.392.053
w/o Double DQN518.788.065
w/o Dueling Architecture502.485.072
w/o Reward Shaping287.165.0148
w/o Adaptive α598.697.035
Interval Type-2 Fuzzy Only567.393.047
Dueling Double DQN w/ Adaptive Reward Shaping (No Fuzzy)551.395.053
Table 4. Performance comparison.
Table 4. Performance comparison.
MethodSuccess Rate (%)Max Episode RewardEpisodes to Converge
Base DQN98.0 ± 1.2587.3 ± 42.1112 ± 15
PPO0.0 ± 0.0−115.4 ± 8.2Never converged
C5196.0 ± 2.1592.7 ± 38.595 ± 12
QR-DQN0.0 ± 0.0−112.8 ± 7.6Never converged
PER97.0 ± 1.8578.2 ± 45.3104 ± 14
Dueling Double DQN95.0 ± 2.3562.4 ± 40.8108 ± 16
Double DQN96.5 ± 1.9575.8 ± 39.2102 ± 13
Safe Hybrid (ours)100.0 ± 0.0678.2 ± 21.521 ± 3
Table 5. Hyperparameter details and tuning.
Table 5. Hyperparameter details and tuning.
Hybrid ControllerLearning rate: {1 × 10−5, 5 × 10−5, 1 × 10−4, 5 × 10−4, 1 × 10−3} → 5 × 10−4
Batch size: {32, 64, 128, 256} → 128
Buffer size: {10,000, 50,000, 100,000} → 50,000
γ (discount): {0.95, 0.99, 0.995} → 0.99
τ (target update): {0.001, 0.005, 0.01} → 0.005
ε decay: {0.99, 0.995, 0.997} → 0.995
α decay: {0.995, 0.997, 0.999} → 0.997
Hidden dimension: {128, 256, 512} → 256
PPO AgentLearning rate: {1 × 10−5, 3 × 10−4, 1 × 10−3} → 3 × 10−4
Batch size: {32, 64, 128} → 64
γ (discount): {0.95, 0.99, 0.995} → 0.99
GAE λ: {0.9, 0.95, 0.99} → 0.95
Clip ε: {0.1, 0.2, 0.3} → 0.2
PPO epochs: {5, 10, 20} → 10
Hidden dimension: {64, 128, 256} → 256
Standard DQNLearning rate: {1 × 10−4, 5 × 10−4, 1 × 10−3} → 1 × 10−3
Batch size: {32, 64, 128} → 64
Buffer size: {10,000, 50,000} → 10,000
γ (discount): {0.95, 0.99, 0.995} → 0.99
ε decay: {0.99, 0.995, 0.997} → 0.995
Hidden dimension: {64, 128, 256} → 128
Table 6. Computational complexity analysis.
Table 6. Computational complexity analysis.
AlgorithmInference Time per StepTotal Training Time for 500 Episodes
Standard DQN 0.8   m s 1.5   m i n (due to early failures)
Hybrid T2F-D2DQN 1.2   m s 4.8   m i n
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

KhalafAnsar, H.M.; Rohten, J.; Keighobadi, J. A Hybrid Type-2 Fuzzy Double DQN with Adaptive Reward Shaping for Stable Reinforcement Learning. AI 2025, 6, 319. https://doi.org/10.3390/ai6120319

AMA Style

KhalafAnsar HM, Rohten J, Keighobadi J. A Hybrid Type-2 Fuzzy Double DQN with Adaptive Reward Shaping for Stable Reinforcement Learning. AI. 2025; 6(12):319. https://doi.org/10.3390/ai6120319

Chicago/Turabian Style

KhalafAnsar, Hadi Mohammadian, Jaime Rohten, and Jafar Keighobadi. 2025. "A Hybrid Type-2 Fuzzy Double DQN with Adaptive Reward Shaping for Stable Reinforcement Learning" AI 6, no. 12: 319. https://doi.org/10.3390/ai6120319

APA Style

KhalafAnsar, H. M., Rohten, J., & Keighobadi, J. (2025). A Hybrid Type-2 Fuzzy Double DQN with Adaptive Reward Shaping for Stable Reinforcement Learning. AI, 6(12), 319. https://doi.org/10.3390/ai6120319

Article Metrics

Back to TopTop