Investigation into the Performance Enhancement and Configuration Paradigm of Partially Integrated RL-MPC System

Guo, Wanqi; Tateno, Shigeyuki

doi:10.3390/math13152341

Open AccessArticle

Investigation into the Performance Enhancement and Configuration Paradigm of Partially Integrated RL-MPC System

by

Wanqi Guo

^*

and

Shigeyuki Tateno

Graduate School of Information, Production and Systems, Waseda University, Kitakyushu 8080135, Fukuoka, Japan

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(15), 2341; https://doi.org/10.3390/math13152341

Submission received: 16 March 2025 / Revised: 14 July 2025 / Accepted: 18 July 2025 / Published: 22 July 2025

Download

Browse Figures

Versions Notes

Abstract

The improvement of the partially integrated reinforcement learning-model predictive control (RL-MPC) system is developed in the paper by introducing the Deep Deterministic Policy Gradient (DDPG) and Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithms. This framework differs from the traditional ones, which completely substitute the MPC prediction model; instead, an RL agent refines predictions through feedback correction and thus maintains interpretability while improving robustness. Most importantly, the study details two configuration paradigms: decoupled (offline policy application) and coupled (online policy update) and tests them for their effectiveness in trajectory tracking tasks within simulation and real-life experiments. A decoupled framework based on TD3 showed significant improvements in control performance compared to the rest of the implemented paradigms, especially concerning Integral of Time-weighted Absolute Error (ITAE) and mean absolute error (MAE). This work also illustrated the advantages of partial integration in balancing adaptability and stability, thus making it suitable for real-time applications in robotics.

Keywords:

model predictive control; twin delayed deep deterministic policy gradient; partial integration; configuration paradigm

MSC:

93C95; 68T05

1. Introduction

Importantly, Model Predictive Control (MPC) proved to be a very effective control strategy which uses dynamic models of the systems to predict their future behavior and to optimize control commands in real time. The method involves solving at every sampling instant a constrained finite-horizon optimization problem to impose constraints on optimal inputs with respect to their operational constraints, such as actuator limits, safety boundaries, and physical feasibility.

The flexibility offered by MPC in addressing multivariable system model control actions and the specification of constraints directly into the control formulation makes it an appropriate method applicable in diverse fields, including chemical process control, automotive systems, aerospace engineering, and robotics [1].

The principle of dynamic reaction of process parameters to disturbances or inaccuracies in the model is very important in product quality maintenance practices in the industrial setting with MPC [1]. In robotics, especially manipulator control, image-based visual servoing (IBVS) successfully applies model predictive control (MPC) as a constrained optimization formulation formulating the nonlinear control problem in a way that incorporates geometric and kinematic constraints [2]. In spite of all that, MPC is highly dependent upon the predictive model source on which it is functioning. Thus, a model mismatch, unmodeled dynamics, or an external uncertainty can aggravate the performance of the control because of the weak robustness in adaptive modeling techniques.

Moreover, reinforcement learning is a promising area in machine learning, whose approach seems very different from the classical ones of enhancing traditional control strategies. Unlike classical control techniques that primarily depend on imposed rules or preset tuning processes, the RL model makes it possible for autonomous agents to learn optimal plans in decision making through their interactions with an environment. In this scenario, the agent receives feedback via scalar rewards or penalties, making it idealized to formulate its strategy toward maximum reward in time. Reinforcement learning is famous for yielding excellent results in problem domains such as autonomous driving, manipulation robotics, and game playing, where modeling precision and adaptability are critical [3,4].

Recent advancements in deep reinforcement learning (DRL), especially actor-critic architectures like Deep Deterministic Policy Gradient (DDPG) and its improved variant Twin Delayed Deep Deterministic Policy Gradient (TD3), have brought great success in training effective continuous control policies. These are the algorithms that combine deep networks with off-policy learning and target networks so that training can be stabilized and convergence properties improved. The RL paradigm has been proposed which considers multi-objective reinforcement learning using multi-head critics to decompose composite reward signals, giving the system the capability of learning complex robotic tasks with multiple criteria [5].

Thus, the approach can enable both efficiency in learning and improved overall performance in high-dimensional, uncertain environments. The application of RL onto the traditional control framework is gaining increased attention in most of the lines developed for hybrid control systems. As it is expected, there are also many hybrid systems with RL and Proportional-Integral-Derivative (PID) control or MPC.

For example, many researchers have used an actor-and-critic RL algorithm to estimate and update the PID parameters on-the-fly so as to become more robust and to have a better response while operating in an uncertain environment [6]. Similarly, DRL has been used to solve multi-variable coupled solutions, designing custom reward functions and structures of control that guide it toward optimal control strategies [7,8]. Interestingly, some studies even went so far as to invert the use of PID concepts in deep RL, having mechanisms similar to PID integrated into the encoder architecture, improving feature extraction and learning stability [9].

Concerning control paradigms, the interaction between RL and MPC is most favored because of compensating advantages. MPC specializes in handling constraints and short-term planning, while RL affords learning and long-term adaptability. Thereby, intelligent control systems that adapt to changing dynamics and optimize performance over much longer horizons can be realized. This synergy has ushered improvements into wheeled robot navigation, as the coordinated RL-MPC systems show better disturbance rejection and tracking precision [10], and into large-scale networked systems, whereby RL-enhanced MPC ensures effective link-level path guidance [11]. In addition, distributed MPC approaches, bolstered by multi-agent RL functional approximation tools, seem to be a promising way of tackling the twin challenges of coordination and scalability in such complex systems [12].

Nevertheless, amidst these advances, the integration of RL into classical MPC remains an arduous task. One main challenge is the accuracy of model predictions. Since MPC relies on a predictive process, a substitution by a black-box RL mechanism may well turn out to be detrimental in terms of interpretability and could bring along a third source of instabilities. In situations where such online learning might actually be feasible, the timing restrictions in the context of MPC would mean that the controller is subjected to rigid timing limits. Hence, a different approach for integration would have to find a balance, one that considers preserving the inherent structural advantages of MPC while exploiting the flexibility offered by RL.

In our proposed partially integrated RL-MPC framework, the RL agent is not seen as performing a complete takeover of the predictive model, but rather cooperates with it in a prescribed manner. More specifically, the RL agent aims to improve the output of the prediction model through feedback correction, making model adaptation possible in real-time while preserving its physical interpretations. The evaluation of the effect of two popularly used DRL algorithms, DDPG and TD3, on enhancing the control performance was tested within this framework. Among various deep reinforcement learning algorithms, TD3 and DDPG were chosen, as they fit all specifications for continuous control tasks, provide deterministic policy output, and run on real-time model predictive control frameworks. These properties guarantee that feedback correction within the MPC loop remains stable and interpretable.

The superior performance of TD3 over DDPG in the proposed hybrid RL-MPC framework can be attributed to design innovations that target certain limitations that are typically inherent in DDPG. Amongst these, TD3 proposes a dual critic network structure that during policy updates reduces Q-function overestimation by taking the minimum of two critic estimates. Through such conservative Q-value estimation, such an overestimation bias is further reduced that might work in destabilizing DDPG under continuous control tasks. Another TD3 design is of delayed policy updates so the critic networks have a chance to converge correctly before the actor is updated. This serves to promote stability in the learning process. Because of the smoothing of the target policy, the policy does not get overfitted to very sharp peaks in the estimated Q function landscape. Such features are all the more beneficial for the partially integrated RL-MPC system, where stability is a requirement and erratic perturbative actions can otherwise destabilize robust trajectory tracking in real time within a dynamic environment. Hence, algorithmic modifications such as these confer upon TD3 its superiority as witnessed during simulations and real-life experiments.

Furthermore, this study provides a systematic comparison among various configuration paradigms: decoupling, where the RL agent functions independently during training and only interacts with the MPC module during execution, and continuous, where the RL agent is continuously added to and modifies its performance alongside the MPC in a tightly coupled loop.

Through simulation-based experiments for trajectory tracking and real-world tests concerning speed-direction control, we show that a TD3-based decoupling configuration leads to superior control performance than standard predictive models. We analyze the trade-offs between the two paradigms with an emphasis on the potential of the continuous configuration in particular experimental settings. The main contributions of this work are summarized as follows:

-: Partial Integration Framework: A novel hybrid architecture that integrates reinforcement learning with model predictive control without entirely replacing the predictive model, thus retaining interpretability and allowing for adaptive refinement using feedback correction.
-: Comparative Analysis of DRL Algorithms: An in-depth comparative analysis of the DDPG and TD3 algorithms adopted in the framework, looking primarily at computational efficiency and control performance.
-: Comparison of Configuration Paradigms: A systematic investigation into decoupling as opposed to continuous integration paradigms, providing perspectives on their relative strengths and appropriate deployment scenarios.

Although the original article does not mention why TD3 and DDPG were chosen for augmenting the predictive model within the MPC loop, the proposed hybrid RL-MPC framework employs these two popular algorithms. Unlike other RL algorithms such as SAC, PPO, or A3C, a comparative analysis shows that TD3 and DDPG perfectly fit into the mold due to their deterministic nature, compatibility with continuous action spaces, stability, real-time performance, and easy integration with MPC in safety-critical, embedded control systems. Those others, while having some strengths in exploration and data efficiency, stray from the field because their behavior is stochastic and resource-intensive or erratic in response to conditions. Therefore, TD3 and DDPG become ideal justifications for system requirements.

2. Literature Review

Though the promising improvement seen in control performance under uncertainty through Reinforcement Learning (RL) coupled with Model Predictive Control (MPC), it also brings up challenges that must be addressed to make real-world application feasible. A primary limitation of the traditional RL-based MPC (RL-MPC) frameworks is the heavy computational requirements and slow adaptation speed because RL-MPC often requires extensive exploration of the state-action space to learn effective control policies, unlike classical MPC with deterministic or well-defined stochastic models [13]. The time and resource commitment for this learning can become hideously prohibitive, especially in dynamic environments where fast decision making is of the essence. Interaction with the environment affects convergence speed, but it can also lead to serious safety and feasibility concerns during early learning.

As the complexity of the environment increases, such as in scenarios with high dimensionality or partial observability, sampling efficiency deteriorates rapidly for RL algorithms. In these cases, the agent fails to collect informative experiences efficiently; thus, the data is hamdated, and policy updates are poor [14]. Furthermore, the problem gets sharpened in networked or multi-agent control systems: coordination of decision making across entities adds another layer of complexity. As a result, many RL-MPC implementations experience difficulties in convergence, particularly in real-time control tasks affected by delays or inaccurate information that can lead to system instability or performance degradation.

Talking about increased representational power gained by introducing deep learning algorithms into RL-MPC systems, obviously one must mention the complications thrown up alongside. Model function approximations by deep neural networks (DNNs) usually give rise to the horrible problem of inaccurate models characterized by overfitting, generalization errors, or instabilities during training, and these inaccuracies tend to pass through the MPC loop, possibly taking away prediction accuracy and reliability control. Further, the black-box characteristics of such deep DNNs do not allow interpretation, which makes it impossible to diagnose failures or guarantee robustness in critical, safety-relevant applications [15].

To overcome these limitations, hybridity in models has been researched to gain from the advantages of model-based holding and the adaptability of data-driven learning. One of the most prominent uses may have come from Recurrent Neural Networks (RNNs), and specifically Long Short-Term Memory (LSTM) networks, for capturing temporal dependencies in sequential control tasks. Mahya et al. [16] exposed the advantage of coupling LSTM modules into an RL-MPC framework on the capability of this system to model long-term dynamics and enhance prediction accuracy in time-series control. Combined with encoding historical state information, which puts forward the RL agent decision-making with richer contextual cues improving both learning efficiency and control performance.

A further early avenue promising a positive return considers employing Hierarchical Reinforcement Learning to disassemble complex control tasks into easier submodules. HRL makes it possible to structure the learning process in hierarchical levels, where a high-level policy defines goals or abstract actions for a low-level controller to pursue with specific motor commands or control signals [17]. This hierarchical decay thus aids in learning scalability and knowledge transfer. In the case of robotic manipulation or autonomous navigation, using HRL could allow the agents to learn high-level strategies like path planning and obstacle avoidance and low-level policies for the execution of trajectories and fine control.

Beyond architectural novelties, the last decade has also opened new avenues toward strengthened RL-MPC systems through enhancements in neural network design. GNNs, transformers, attention-based architectures, and their derivatives allow for better modeling of agent relations, more efficient long-range dependency handling, and appropriate structured data processing [18]. Particularly in the context of distributed control amongst agents, this is applied to MPC frameworks or environments with spatial and temporal dynamics in consideration. A GNN could develop models of physical interaction among components of a mechanical system while a transformer would support the parallel processing of sequential input, thus accelerating the training and inference of both RL and MPC.

For one thing, the quest for diminishing the consequences of needing large-scale data collection or accelerating convergence in novel environments beyond RL, is one of the motivations to consider meta-navigation or transfer approaches, modeling, and all sorts of exploration otherwise. Under this description, they also fast-track the acceleration that transfers knowledge from any prior environment by allowing reinforcement learning agents to access the knowledge of how things worked in those previous environments.

At the deep end, RL-MPC remains a very active front of research work, with one arm negotiating computing efficiency, stability, and adaptiveness, while the other two explore hybrid architectures, domain-specific adaptations, and print deep learning that remains faithful to the salient fingerprints of classical control theory despite roaming the wider realms of adaptive options in modern machine learning techniques.

3. Analysis of the Discrete Model for a Wheeled Mobile Robot (WMR)

3.1. Kinematic Model Forming

The corresponding controller is designed to facilitate the trajectory tracking of a wheeled mobile robot (WMR). Mobile robots, equipped with the ability to move and navigate autonomously, have gained significant attention due to their wide range of applications in various industries [19]. The simplified structure of the WMR system, which includes a guide wheel, is illustrated in Figure 1.

In Figure 1, coordinate system

{X, O, Y}

represents a global reference fixed to the ground, while coordinate system

{x, o, y}

denotes a local reference fixed to the WMR. The distance between the two wheels is denoted as

2 L

, and the diameter of the drive wheel is

2 R

. Assuming that the WMR cannot slide horizontally, we infer that it is operating at low speed during movement [20]. Given these irregular constraints, the expression for the WMR’s motion model is presented below:

N (q (t)) \dot{q} (t) = \dot{x} (t) s i n (θ (t)) - \dot{y} (t) c o s (θ (t))

(1)

In this equation,

N (q (t))

refers to a constraint matrix situated within the global coordinate system defined by

q (t) = {[\begin{array}{l} x (t) & y (t) & θ (t) \end{array}]}^{T} \in R^{n}

, indicating that

q (t)

signifies the position and orientation of the WMR concerning the global coordinate framework.

Let

S (q (t))

denote a matrix with full rank that represents a linear combination within the null space of the constrained matrix

N (q (t))

. A first-order model of wheeled mobile robots (WMR) can be formulated [21]:

\dot{q} (t) = S (q (t)) [\begin{matrix} v (t) \\ ω (t) \end{matrix}] = [\begin{matrix} c o s (θ (t)) & 0 \\ s i n (θ (t)) & 0 \\ 0 & 1 \end{matrix}] u (t)

(2)

In this equation, the control input vector

u (t) = [v (t) ω (t)]^{T} \in R^{m}

is defined as the combination of linear speed

v

and rotational speed

ω

. The speeds of the left and right wheels of the WMR are represented as

v_{L} (t) = v (t) - ω (t) L / 2

and

v_{R} (t) =

v (t) + ω (t) L / 2

, respectively.

The motion model Equation (2) can be simplified as the equation below:

\dot{q} (t) = f (q (t), u (t))

(3)

3.2. Prediction Model Forming

In addition, the virtual WMR is defined, whose kinematic model is as follows.

{\dot{q}}_{r} (t) = f (q_{r} (t), u_{r} (t))

(4)

By applying the Taylor series expansion to Equation (3) at the reference point

(q_{r} (t), u_{r} (t))

, the subsequent equation below can be derived:

\dot{q} (t) = f (q_{r} (t), u_{r} (t)) + {\frac{\partial f (q (t), u (t))}{\partial q (t)}|}_{q (t) = q_{r} (t), u (t) = u_{r} (t)} \cdot (q (t) - q_{r} (t)) + {\frac{\partial f (q (t), u (t))}{\partial u (t)}|}_{q (t) = q_{r} (t), u (t) = u_{r} (t)} \cdot (u (t) - u_{r} (t)) + ζ (t)

(5)

In the above equation,

ζ (t)

is a higher-order term of

\dot{q} (t)

. Then, Equation (5) can be simplified as follows:

\dot{q} (t) = f (q_{r} (t), u_{r} (t)) + f (q_{r} (t)) (q (t) - q_{r} (t)) + f (u_{r} (t)) (u (t) - u_{r} (t)) + ζ (t)

(6)

where,

f (q_{r} (t))

and

f (u_{r} (t))

are the Jacobian matrices of

f (q_{r} (t), u_{r} (t))

with respect to

q_{r} (t)

and

u_{r} (t)

at the reference point

(q_{r} (t), u_{r} (t))

, respectively.

In the global coordinate system, the trajectory tracking error

e (t) =

{[e_{1} (t) e_{2} (t) e_{3} (t)]}^{T}

is defined. When

\tilde{u} (t) = u (t) - u_{r} (t), e (t) = q (t) - q_{r} (t)

(7)

By substituting Equation (4) into Equation (6), the error model of WMR is as follows.

\dot{e} (t) = A_{e} (t) e (t) + B_{e} (t) \tilde{u} (t) + ζ (t)

(8)

In which,

A_{e} (t) = [\begin{matrix} 0 & 0 & - v_{r} (t) s i n (θ_{r} (t)) \\ 0 & 0 & v_{r} (t) c o s (θ_{r} (t)) \\ 0 & 0 & 0 \end{matrix}], B_{e} (t) = [\begin{matrix} c o s (θ_{r} (t)) & 0 \\ s i n (θ_{r} (t)) & 0 \\ 0 & 1 \end{matrix}]

.

The controllable matrix is denoted as

[B_{e} (t) A_{e} (t) B_{e} (t) A_{e}^{2} (t) B_{e} (t)]

. If

u_{r} (t)

is not equal to zero, it is straightforward to demonstrate that matrix

[B_{e} (t) A_{e} (t) B_{e} (t) A_{e}^{2} (t) B_{e} (t)]

is full rank. This indicates that the WMR is controllable according to model Equation (2). Furthermore, based on the Brockett conditions, a time-varying feedback controller can achieve complete trajectory tracking for the WMR [22].

To monitor the reference trajectory, the error model represented in Equation (8) is transformed into a discrete format by employing an approximate discretization approach.

The discretization error model in Equation (8) is modified in the following manner.

e (k + 1) = A (k) e (k) + B (k) \tilde{u} (k) + d (k)

(9)

where,

A (k) = I_{n \times n} + A_{e} (k) h

and

B (k) = B_{e} (k) h, h

is the sample time.

That means the following:

A (k) = [\begin{matrix} 1 & 0 & - v_{r} (k) \sin (θ_{r} (k)) h \\ 0 & 1 & v_{r} (k) \cos (θ_{r} (k)) h \\ 0 & 0 & 1 \end{matrix}], B (k) = [\begin{matrix} \cos (θ_{r} (k)) h & 0 \\ \sin (θ_{r} (k)) h & 0 \\ 0 & h \end{matrix}], d (k) = ζ (k) h

(10)

4. Model Predictive Control: Principles and Workflow

Model Predictive Control (MPC) is a widely utilized optimization-based control strategy that computes control actions by predicting future system behavior over a finite horizon. Its ability to handle multi-objective optimization while respecting state and input constraints makes it highly effective in applications such as robotics, process control, and autonomous systems [23,24].

4.1. Structure of MPC

MPC’s structure comprises three core components [24], as Figure 2 shows:

(1) Predictive Model: The system model

y_{t + 1} = μ (x_{t}, u_{t}, y_{t})

predicts future states based on control actions. Nonlinear and data-driven models, such as neural networks, are increasingly used to enhance prediction accuracy in complex systems [25].

(2) Objective Function: The objective function encapsulates the control objectives, balancing system performance and energy efficiency. This part is also called cost calculation.

(3) Feedback Correction: A dynamic correction term

e_{k}

compensates for prediction errors, improving robustness and ensuring accurate trajectory tracking.

In the MPC prediction step, the kinematic model of the wheeled mobile robot (WMR) is employed as the predictive dynamic model. This model considers the WMR’s motion constraints, including non-holonomic constraints that prevent lateral slipping and assumes operation in low-speed conditions to maintain model accuracy. Robot state vectors encompassed the position,

(x, y)

coordinates; orientation, which can be defined by angle

θ

; linear velocity, with constant

v

; and angular velocity represented by

ω

.

The control input vector shall be comprised of linear speed

v

and rotational speed

ω

under constraints where physical feasibility and safety supersede other design considerations. State constraints place bounds on the operational workspace, while input constraints limit velocities such that

|v| \leq v_{m a x}

and

|ω| \leq ω_{m a x}

considering all actuator limits and avoiding jerky maneuvers. Moreover, these also impose rate-of-change constraints on

v

v and

ω

to ensure smooth control actions while minimizing excessive jolt and also improving stability for trajectory tracking purposes.

4.2. Principles of MPC

MPC operates on the principle of receding horizon optimization. At each time step, it solves an optimization problem based on a system model to predict future states and compute a sequence of control actions. Only the first action is applied, and the process is repeated at subsequent steps to incorporate feedback and account for disturbances.

Due to the inherent complexity in acquiring the feedback of the state variables, this study employs an optimization solution formula, which is delineated as follows:

m i n J = \sum_{i = 1}^{N_{p}} ‖ y (t + i ∣ t) ‖_{Q}^{2} + \sum_{i = 1}^{N_{c}} ‖ Δ u (t + i ∣ t) ‖_{R}^{2} + {‖y (t + N_{p} ∣ t)‖}_{F}^{2} + ε^{T} ρ ε

(11)

in which

ε

is the relaxation factor and can be assumed as

0 \leq ε \leq M, M

is the upper limit as

10, ρ = 10

. Prediction time horizon

N_{p} \geq 1

, Control time horizon

1 \leq N_{c} \leq N_{p}

.

The core objective is that the error will be 0:

\sum_{i = 1}^{N_{p}} ‖ y (t + i ∣ t) ‖_{Q}^{2}, with

(12)

y (t) = [\begin{array}{l} I & O \end{array}] [\begin{matrix} χ_{e} (t) \\ u_{e} (t - 1) \end{matrix}] = C ξ (t), ξ (t) = [\begin{matrix} e_{g} \\ {\dot{e}}_{g} \\ e_{φ} \\ {\dot{e}}_{φ} \end{matrix}] .

(13)

The error between manipulated variables will be

0 : Δ u (t) = u (t) - u (t - 1)

\sum_{i = 1}^{N_{c}} ‖ Δ u (t + i ∣ t) ‖_{R}^{2}

(14)

The error of the terminal state will be 0, and this part is to make sure that the system can be stable [23]:

{‖y (t + N_{p} ∣ t)‖}_{F}^{2}

(15)

Relaxation factor

ε^{T} ρ ε = ρ ε^{2}

is to prevent the non-feasible solution. Here, the length of the

y

sequence is referred to as the prediction time horizon

N_{p}

.

4.2.1. Feedback Correction for Improved Accuracy

To enhance trajectory tracking and adapt to unmodeled dynamics, MPC is augmented with a feedback correction term. This term adjusts the control input based on observed trajectory deviations:

e_{t}^{'} = y_{t}^{o b s} - y_{t}^{p r e d}

(16)

{\hat{y}}_{t}^{p r e d} = [{\hat{y}}_{t}, {\hat{y}}_{t + 1}, {\hat{y}}_{t + 2}, {\hat{y}}_{t + 3}, \dots, {\hat{y}}_{t + N}] + e_{t}

(17)

where

e_{t}

represents the trajectory error,

y_{t}^{obs}

represents the observed output,

y_{t}^{pred}

is the predicted output. This feedback term

e_{t}

is dynamically updated at each time step, ensuring real-time adaptability to system disturbances and uncertainties. This feedback correction is dynamically updated, ensuring real-time adaptability to disturbances and uncertainties.

The entire prediction process closely resembles that of a typhoon track forecasting system. Essentially, the future path of a typhoon is initially calculated using meteorological dynamics models and subsequently adjusted based on the typhoon’s current actual position. Depending on research requirements, this process can also be divided into coarse-tuning and fine-tuning sections. If influenced by other models or external adjustments, fine-tuning can significantly enhance the system’s effectiveness. Algorithm 1 shows the pseudo-code for applying feedback correction in the hybrid RL-MPC loop.

Algorithm 1: Pseudo-code for Applying Feedback Correction in the Hybrid RL-MPC Loop

# Initialize components
Initialize the MPC model with kinematic equations and constraints.
Initialize the RL agent (TD3 or DDPG) with a pre-trained actor network.
Set initial robot state s_0.

For each time step t, perform the following:
    # Step 1: Observe Current State
    s_t = get_current_state() # Includes [x, y, θ, v, ω] or similar

    # Step 2: Generate Nominal Prediction using MPC
    x_nom_t_plus_1 = mpc_predict(s_t)

    # Step 3: Compute Tracking Error
    e_t = compute_error(s_t, reference_trajectory)

    # Step 4: RL Agent Policy Evaluation
    πϕ = rl_actor_network.predict(s_t) # Outputs scaling factor for correction

    # Step 5: Compute Feedback Correction Term
    Δu_t = compute_feedback_correction(e_t)

    # Step 6: Apply Scaled Feedback Correction to Prediction Model
    x_pred_t_plus_1 = x_nom_t_plus_1 + Δu_t * πϕ

    # Step 7: Solve MPC Optimization with Updated Prediction
    u_t = solve_mpc_optimization(x_pred_t_plus_1, constraints)

    # Step 8: Apply Control Input to Robot
    apply_control(u_t)

    # Step 9: (Optional) Update RL Agent if in Continuous Paradigm
    if continuous_mode:
        store_transition_in_replay_buffer(s_t, u_t, reward, s_t_plus_1)
        train_rl_agent()
end for

4.2.2. Constraints

The constraints are divided into hard constraints, soft constraints, and physical constraints.

Hard constraints are shown as follows:

u_{m i n} (t + i) ⩽ u (t + i) ⩽ u_{m a x} (t + i)

which means the manipulated variables and their errors must meet the request.

Soft constraints are shown as Equation (20) to reach a minimum.

Physical constraints include rotation angle, speed change, smoothness, etc., which need to be adjusted according to the experimental conditions.

4.2.3. Rolling Optimization

Rolling optimization refers to the process of determining a set of optimal inputs for the future, ensuring that the output meets a predetermined value in a specified format.

This process constitutes a fundamental step in MPC. MPC inherently calculates a sequence

\{u_{t}^{*}, \dots, u_{t + (N - 1)}^{*}\}

of optimal inputs for the future, where the length of this sequence is referred to as the control time domain, denoted as

N_{c}

. The optimal input matrix represents the operational strategy for multiple future steps. At any given moment, the controller executes only the first step of this sequence.

The term ‘rolling’ indicates the transition to the next moment. Rather than executing the second step derived from the previous calculation, the controller engages in ‘output prediction’, ‘feedback correction’, and ‘solution optimization’ once more to derive a new set of optimal input sequences. Again, only the first step is executed.

Consequently, at each sampling time, the operational steps of MPC remain consistent: ‘Output prediction’

\to

‘Feedback correction’

\to

‘Solution optimization’

\to

‘Execute the first step.’

4.3. Design of Rolling Optimization

The design of the rolling optimization module primarily involves the use of optimization algorithms to calculate the minimum value.

The goal is to transform the objective function Equation (18) into the form required for the standard quadratic form:

m i n = \frac{1}{2} X^{T} H X + f^{T} X

4.3.1. Goal Function Transforming

m i n J = \sum_{i = 1}^{N_{p}} ‖ y (t + i ∣ t) ‖_{Q}^{2} + \sum_{i = 0}^{N_{c} - 1} ‖ Δ u (t + i ∣ t) ‖_{R}^{2} + {‖y (t + N_{p} ∣ t)‖}_{F}^{2} + ε^{T} ρ ε

(18)

The terminal error can be integrated as

\begin{array}{l} y (t + 1)^{T} Q y (t + 1) + y (t + 2)^{T} Q y (t + 2) \dots y {(t + N_{p} - 1)}^{T} Q y (t + N_{p} - 1) + \\ y {(t + N_{p})}^{T} F y (t + N_{p}) \\ = {[\begin{matrix} y (t + 1) \\ y (t + 2) \\ \dots \\ y (t + N_{c}) \\ \dots \\ y (t + N_{p}) \end{matrix}]}^{T} [\begin{array}{l} Q \\ Q \\ ⋱ \\ F \end{array}] [\begin{matrix} y (t + 1) \\ y (t + 2) \\ \dots \\ y (t + N_{c}) \\ \dots \\ y (t + N_{p}) \end{matrix}] + {[\begin{matrix} Δ u (k) \\ Δ u (t + 1) \\ Δ u (t + 2) \\ Δ u (t + N_{c} - 1) \end{matrix}]}^{T} \\ [\begin{array}{l} R \\ R \\ R \\ ⋱ \\ R \end{array}] [\begin{matrix} Δ u (t) \\ Δ u (t + 1) \\ Δ u (t + 2) \\ Δ u (t + N_{c} - 1) \end{matrix}] + ε^{T} ρ ε \\ = \frac{1}{2} [2 Δ U^{T} 2 (Z^{T} Q_{B} Z + R_{B}) 2 ε^{T} ρ] [\begin{matrix} Δ U \\ ε \end{matrix}] + [\begin{array}{l} 2 G^{T} Q_{B} Z & 0 \end{array}] [\begin{matrix} Δ U \\ ε \end{matrix}] + G^{T} Q_{B} G \\ = Δ U^{T} 2 (Z^{T} Q_{B} Z + R_{B}) Δ U + ε^{T} ρ ε + [\begin{array}{l} 2 G^{T} Q_{B} Z & 0 \end{array}] [\begin{matrix} Δ U \\ ε \end{matrix}] + G^{T} Q_{B} G \\ = \frac{1}{2} X^{T} H X + f^{T} X \end{array}

Consequently,

{[\begin{matrix} Δ U \\ ε \end{matrix}]}^{T} = [Δ U^{T} ε^{T}], {(A^{T})}^{T} = A, G^{T} Q_{B} G

is a constant.

4.3.2. Constraint Transforming

The objective function and constraints are reformulated into the standard quadratic form, while imposing constraints on

X

.

\begin{array}{r} \frac{1}{2} {[\begin{matrix} Δ U \\ ε \end{matrix}]}^{T} [\begin{matrix} 2 (Z^{T} Q_{B} Z + R_{B}) & 0 \\ 0 & 2 ρ \end{matrix}] [\begin{matrix} Δ U \\ ε \end{matrix}] + [2 G^{T} Q_{B} Z 0] [\begin{matrix} Δ U \\ ε \end{matrix}] + G^{T} Q_{B} G \end{array}

\begin{array}{l} u_{\min} (t + i) ⩽ u_{e} (t + i) ⩽ u_{m a x} (t + i) \\ [\begin{matrix} u_{m i n} (t) \\ u_{m i n} (t + 1) \\ u_{m i n} (t + 2) \\ \dots \\ u_{m i n} (t + N_{c} - 1) \end{matrix}] ≪ [\begin{matrix} u (t) \\ u (t + 1) \\ u (t + 2) \\ \dots \\ u (t + N_{c} - 1) \end{matrix}] ⩽ [\begin{matrix} u_{m a x} (t) \\ u_{m a x} (t + 1) \\ u_{m a x} (t + 2) \\ \dots \\ u_{m a x} (t + N_{c} - 1) \end{matrix}] \\ Δ u_{\min} (t + i) ⩽ Δ u (t + i) ⩽ Δ u_{m a x} (t + i) k = 0,1, 2 \dots N_{c} - 1 \\ [\begin{matrix} Δ u_{m i n} (k) \\ Δ u_{m i n} (t + 1) \\ Δ u_{m i n} (t + 2) \\ \dots \\ Δ u_{m i n} (t + N_{c} - 1) \end{matrix}] ≪ [\begin{matrix} Δ u (k) \\ Δ u (t + 1) \\ Δ u (t + 2) \\ \dots \\ Δ u (t + N_{c} - 1) \end{matrix}] ≪ [\begin{matrix} Δ u_{m a x} (k) \\ Δ u_{m a x} (t + 1) \\ Δ u_{m a x} (t + 2) \\ \dots \\ Δ u_{m a x} (t + N_{c} - 1) \end{matrix}] \\ [\begin{matrix} Δ U_{M I N} \\ 0 \end{matrix}] ⩽ [\begin{matrix} Δ U \\ ε \end{matrix}] ⩽ [\begin{matrix} Δ U_{M A X} \\ ε_{M A X} \end{matrix}] \end{array}

4.3.3. Results Solving with Quadratic Programming (QP)

By solving the relevant matrix, the optimized result

X

can be obtained.

X^{*} = {[\begin{matrix} Δ U \\ ε \end{matrix}]}^{*} Δ U = [\begin{matrix} Δ u^{*} (t) \\ Δ u^{*} (t + 1) \\ Δ u^{*} (t + 2) \\ ⋮ \\ Δ u^{*} (t + N_{c} - 1) \end{matrix}]

Control value acting on the system:

\begin{array}{r} Δ u (k) = u (t) - u (t - 1) \\ u (t) = u (t - 1) + Δ u^{*} (t) \end{array}

(19)

The MPC employs a rolling optimization approach, applying only the first control quantity from the solution sequence each time.

4.4. Process Workflow

The MPC workflow (see Figure 3) follows an iterative cycle:

(1) State Estimation: The current state

{\hat{x}}_{0}

is measured or estimated.

(2) Prediction: The predictive model computes future states based on candidate control inputs over a horizon

N_{p}

.

(3) Optimization: The control time horizon

N_{c}, \{u_{t}^{*}, \dots, u_{t + (N_{c} - 1)}^{*}\}

is determined by minimizing the cost function

J

under constraints (Equation (18)).

(4) Implementation: The first control action

u_{0}^{*}

is applied to the system.

(5) Feedback Update: New state measurements and feedback corrections are used to repeat the process.

Part of the controller parameters can be set as Table 1 below.

5. Method Choice of Reinforcement Learning

Twin Delayed Deep Deterministic Policy Gradient (TD3) [26] is an advanced reinforcement learning algorithm designed for continuous action spaces. It is based on the actor-critic framework and improves upon the Deep Deterministic Policy Gradient (DDPG) [27] by introducing several innovations to enhance stability and reduce overestimation bias in Q-value estimation. This section provides a detailed explanation of the principles, structure, and workflow of TD3. Figure 4 can show us a demonstration of the TD3 agent.

Structurally, TD3 incorporates an additional critic network compared to DDPG. Specifically, TD3 employs two critic networks to compute

Q

values and selects the minimum output from these networks. This enhancement aims to mitigate excessive bias in Q value estimation, thereby improving the algorithm’s stability. In TD3, the update frequency of the Actor network is lower than that of the critic network, allowing the critic network more time to optimize before each Actor network update, which reduces the target deviation of the Actor network. Furthermore, TD3 smooths the target policy by introducing noise during the calculation of the target Q value to prevent policy overfitting.

5.1. Principles

TD3 operates within the framework of reinforcement learning, where the goal is to learn an optimal policy

π (s; ϕ)

that maximizes the expected cumulative reward:

J (π) = E_{s \sim p_{π}, a \sim π} [\sum_{t = 0}^{T} γ^{t} r (s_{t}, a_{t})]

(20)

TD3 uses an actor-critic structure with two key components and one small step:

(1) Actor: The actor network

π (s; ϕ)

maps states to actions deterministically.

(2) Critic: Two critic networks

Q_{1} (s, a; θ_{1})

and

Q_{2} (s, a; θ_{2})

estimate the state-action value function, which guides the actor’s updates.

(3) Output: The true output

π_{ϕ}

can be coordinated by a noise

N

to avoid overfitting.

The main principles behind TD3 include:

(1) Using two critic networks to reduce overestimation bias in Q-value estimation.

(2) Updating the actor network less frequently than the critic networks to ensure stable training.

(3) Smoothing the target policy to avoid overfitting to narrow Q -value peaks.

5.2. Structure

In each k iteration, the TD3 algorithm consists of the following components:

(1) Actor Network: A neural network parameterized by

ϕ

that outputs the deterministic action for a given state.

(2) Critic Networks: Two independent neural networks parameterized by

θ_{1}

and

θ_{2}

that estimate the Q-values

Q_{1} (s, a)

and

Q_{2} (s, a)

.

(3) Target Networks: Separate target actor and critic networks

(π^{'} (s; ϕ^{'})

,

Q_{1}^{'} (s, a; θ_{1}^{'}), Q_{2}^{'} (s, a; θ_{2}^{'}))

for computing stable targets during training.

(4) Replay Buffer: A buffer

D

that stores transitions

(s, a, r, s_{k + 1})

for off-policy training.

5.3. Reward Function Design

In this design, only the tracking error penalty and control input penalty are considered. The reward function is expressed as follows:

r = w_{1} \cdot r_{action} + w_{2} \cdot r_{space}

The components are defined as follows:

(1) Action Error Penalty:

r_{action} = e x p [- (w_{v} \cdot {(v_{r e f} - v)}^{2} + w_{ω} \cdot {(ω_{r e f} - ω)}^{2})]

where

v_{ref}

and

ω_{ref}

are the reference values for velocity and yaw rate, respectively,

v

and

ω

are the actual values, and

w_{v}

and

w_{ω}

are the corresponding weighting factors.

(2) Space Error Penalty:

r_{s p a c e} = e x p [- (w_{x} \cdot {(x_{r e f} - x)}^{2} + w_{y} \cdot {(y_{r e f} - y)}^{2})]

where

x_{ref}

is the reference X-coordinate,

x

is the actual X-coordinate, and vice versa.

w_{x}, w_{y}

are the weight factors.

During the lane-changing trajectory tracking process, if the system demonstrates stable and rapid trajectory tracking, the variables will be relatively close to the expected actions and spatial positions. In this case, the reward value will approach the maximum value of 1. Conversely, if the system exhibits poor stability and slower speed during trajectory tracking, the variables will deviate significantly from the expected actions and spatial positions, resulting in the reward value approaching the minimum value of 0. Therefore, the reward value serves as an indicator of the control performance of the actual control system.

The paper examines the performance of the hybrid RL-MPC system according to ITAE and MAE, which reflect how the system tracks, but they may not address necessary aspects such as trajectory smoothness, control effort, and reliability. Therefore, it is suggested to include other measures, such as RMSE, success rate, control effort, and jerk index, for a much broader understanding of real-time system behavior.

Among the setups, the TD3-based decoupled arrangement has proven to perform better regarding handling transient disturbances with comparatively high success rates using much lower energy and mechanical stress. The paper does not clearly define a strategy for tuning weights for the RL reward function. This could be via a systematic process combining baseline initialization, iterative trade-off-based adjustments normalization, optional automated optimization, balancing accuracy, smoothness, and efficiency.

This process would greatly maximize transparency, reproducibility, and effectiveness of training. These discussions under new sections on evaluation metrics and reward design would considerably strengthen the methodological foundation of the study.

5.4. Workflow

The TD3 algorithm alternates between interacting with the environment and updating the networks based on sampled transitions. The detailed workflow will be shown as Algorithm 2.

6. MPC Hybrid Model Architecture Integration with TD3 and DDPG

6.1. Integration Analysis

MPC is a promising approach for handling the lane change trajectory tracking problem. In the control process of MPC, the internal prediction model plays a crucial role in determining the control performance. This prediction model

μ

facilitates model predictive control by forecasting future control sequences; however, it is highly susceptible to external interference factors. Reinforcement learning, with its capacity to interact with the external environment, enhances the accuracy of the MPC prediction model. Furthermore, it enables real-time reflection of the external objective environment. Many studies focus on correcting model outputs [28,29]. This paper takes a step further by directly correcting the prediction model itself. It maintains a parallel relationship with the prediction model, referred to as partial integration.

In order to fully appreciate the unique contributions made by the proposed RL-MPC framework, a comparative analysis of these various integration paradigms will be presented, focusing on three entirely different approaches: full model replacement, feedback-only correction, and partial integration. Full model replacement refers to an entire replacement of the standard predictive model in MPC with an RL component, such as deep deterministic policy gradient (DDPG) or twin delayed deep deterministic policy gradient (TD3). These methodologies exploit RL’s advantage of adaptive learning to synthesize control policies without using explicit system dynamics, thereby rendering them suitable for highly nonlinear and uncertain environments. However, within such complete reliance on RL, there tend to arise interpretability, stability, and computational challenges due to the inherently black-box nature of deep neural networks. On the contrary, the feedback-only correction resorts to the RL in changing the output from a fixed predictive model in the MPC loop which is usually coupled with some error compensation.

This is basically a very generic MPC format while still allowing for on-the-fly correction from disturbances or modeling inaccuracies. This method is computationally efficient and stable, with the detriment of being unable to adapt the internal prediction scheme itself, thus limiting long-term ability to recover. The proposed kingdom of partial integration represents a sort of in-between from these extremes: rather than attempting full replacement of the predictive model, or attempting solely an output correction of its results, the RL agent is now working with an existing model to improve its online predictions using targeted, feedback corrections. This hybrid approach allows the classical MPC to retain its benefits in physical interpretability and constraint handling while at the same time gaining in model accuracy and robustness through the proposed RL adaptation. MPC performance is, therefore, critically dependent on the fidelity of its predictive model, and any model-reality mismatch from unmodeled dynamics, through external disturbances, and variations in the parameters may incur degradation in control performance up to possible instability. Thus, in a partially integrated RL-MPC framework, RL adeptly corrects the model predictions of MPC through feedback in order to eliminate model mismatch impacts. The RL agent learns to compensate systematic errors and disturbances in real time by computing correction terms from the observed tracking errors, effectively aligning the imperfect model predictions with those of the actual system. The conjunction of these two features, as demonstrated through both simulations and real-world experiments, allows these hybrid controllers to maintain both robustness and tracking performance in the presence of such modeling inaccuracies. Table 2 shows contrasts between each integration paradigms.

Partial integration, on the other hand, embraces the idea of retaining structural consistency by allowing adaptive refinement—a capability unlike full model replacement, which gives up model transparency in favor of flexibility. Moreover, partial integration maintains separation in the learned and modeled components to retain interpretability and analyzability by encasing everything within them.

This is extremely important in safety-critical applications where explainability is crucial for validating and certifying processes. The addition of a reinforcement learning (RL) agent in the partial integration scheme directly influences the prediction model, resulting in smooth estimations of future states and improved plans for the long run. Unlike feedback-only correction that changes only the last control input, this partial integration paradigm allows the agent to include the RL agent in the empty space and predict future occurrences while planning improvements in the long run.

According to the simulation results, the new paradigm has improved trajectory tracking performance—reduced the effect of external disturbances—along with uncompensated differentials. One of the instances of applicability to the wheeled mobile robot (WMR) control tasks shows that the TD3-based decoupled configuration surpassed both the standard MPC and DDPG-based continuous configuration over ITAE and MAE, especially in more complex curve tracking cases.

The other achievements include higher real-time responsiveness and reduced computation demands for the decoupling paradigm compared to the continuous paradigm. These make the performance of the decoupled paradigm better than that of the continuous one for real-time embedded control systems with stringent timing bounds. Whereas the decoupled paradigm doesn’t mean partial integration but a method of implementation under partially integrated RL-MPC schemes, it stresses offline learning and fixed-policy application benefits from real-time performance and stability, especially suited to resource-constrained robotic platforms. This shows strength in partial integration as an approach to achieving trade-off adjustment and fidelity in robotic systems applications: adaptable systems without compromising reliability.

Partial integration in the proposed partially integrated RL-MPC architecture is quantitatively defined through a feedback correction mechanism that adapts in real time the MPC prediction model using an RL-generated scaling factor

π_{ϕ} (s_{t})

via the corrective term

x_{t + 1}^{p r e d} = x_{t + 1}^{n o m} + Δ u_{t} \cdot π_{ϕ} (s_{t})

, where

Δ u_{t}

signifies the error-based correction term and

π_{ϕ} (s_{t})

denotes a state-dependent gain obtained from the TD3 or DDPG actor network. The tunable parameters in this framework include the learned weights of the RL policy

ϕ

, the feedback gain matrix for

Δ u_{t}

, which can be set manually or meta-optimized, and the reward function weights

w_{1} - w_{4}

that affect the agent’s prioritization of action and space errors; these parameters altogether allow the balance between the interpretability of the model and adaptive robustness.

This framework integrates MPC with the model part of TD3 and DDPG to adaptively optimize the prediction model

μ

, which means

\hat{y}

can be corrected into

{\hat{y}}_{pred}

by more detailed calibration operations. This process enhanced trajectory tracking accuracy and system robustness. In Figure 5, the integrated control system can be found.

6.2. Integration Method

After receiving the prediction model based on reinforcement learning, the feedback correction module implements the following adjustments:

{\hat{y}}_{pred} (t + j) = \hat{y} (t + j) + π_{ϕ} \times [y_{o b s} (t) - \hat{y} (t)]

(21)

In Equation (21), Parameter

π_{ϕ}

represents the adjustment decision operation derived from Equation (20). When

π_{ϕ}

is small, the feedback correction adjustment amount,

π_{ϕ} \times

[y_{o b s} (t) - \hat{y} (t)]

, is also small, resulting in a longer adjustment time for the hybrid MPC system, but a more stable adjustment process. Conversely, when

π_{ψ}

is large, the feedback correction adjustment amount,

π_{ϕ} \times [y_{o b s} (t) - \hat{y} (t)]

, is also large, leading to a shorter adjustment time, but a higher likelihood of unstable regulation. When

π_{ϕ}

equals 0, the controller refrains from making any adjustments, and when

π_{ϕ}

equals 1, the controller effectively functions as a traditional MPC controller. This step ensures the accuracy of the prediction model, bringing the predicted values closer to the actual values. Configuration details are performed in Table 3.

The pseudocode for the integration of

π ϕ

in the hybrid RL-MPC loop is illustrated in the following Algorithm 2.

Algorithm 2: The pseudocode for integration of πϕ in the hybrid RL-MPC loop

Initialize MPC model and TD3 agent (actor and critic networks)
Set initial state s_0

For each time step t, perform the following:
    # Step 1: State Observation
    Observe the current state s_t from the environment

    # Step 2: Nominal Prediction using MPC
    Predict nominal trajectory x_nom using the standard MPC model

    # Step 3: RL Agent Policy Evaluation
    Compute πϕ = μ(s_t|ϕ) +

𝒩

_t using TD3 actor network

    # Step 4: Feedback Correction
    Compute Δu_t based on tracking error e_t = x_t − x_ref_t

    # Step 5: Apply Scaled Correction
    Adjust prediction: x_pred_t_plus_1 = x_nom_t_plus_1 + Δu_t * πϕ

    # Step 6: Solve MPC Optimization with Updated Prediction
    Generate optimal control input u_t using updated prediction x_pred_t_plus_1

    # Step 7: Apply Control Input
    Execute u_t on the system

    # Step 8: Update RL Agent (if in training phase)
    Store transition (s_t, πϕ, r_t, s_t + 1) in replay buffer
    Sample mini-batch and update TD3 networks using DDPG/TD3 update rules
end for

The term πϕ is crucial in the proposed hybrid RL-MPC system since it allows the RL agent to influence the prediction model directly rather than just the control output. Indeed, being derived from the TD3 actor network, this enables more dynamic and data-driven adjustments to the internal MPC model to improve tracking and robustness. The pseudo-code provided clarifies the implementation steps and illustrates how πϕ connects model-based prediction with reinforcement learning-based adaptation. Future work may look into different policy representations or adaptive scaling mechanisms to further enhance the interpretability and performance of πϕ across different control tasks.

6.3. Configuration Paradigm

During training and configuration, there are different configuration paradigms in reinforcement learning.

(1) Decoupling Paradigm: Training and configuration are separated. During training, it interacts with the environment. After configuration is completed, no policy update is performed. Training and configuration are decoupled. In this paper, it is called the decoupled paradigm.

(2) Continuous Paradigm: Training and configuration are continuous. When training and configuration are completed, the policy is continuously updated, showing a high degree of continuity. In this paper, it is called the continuous paradigm.

Experiments on these two configuration paradigms of different reinforcement learning methods will be mentioned in the following text.

6.4. Stability Analysis

The literature of Guo [22] and Rawlings [23] has demonstrated the stability of the system by constructing Lyapunov functions. However, the rolling optimization mechanism of MPC complicates its stability proof, as the optimal solution of the optimization problem at each sampling instant cannot directly establish the stability of the closed-loop system. Therefore, researchers have introduced the concept of terminal constraints based on this premise. This concept involves solving a Lyapunov equation offline to obtain its terminal weighting matrix, and subsequently driving the system state into a terminal set through finite-horizon control online, thereby ensuring the asymptotic stability of the system. Such a control strategy serves as a relatively good approximation of infinite-horizon optimization [30]. In practical applications, the stability of MPC is typically cross-validated through theoretical analysis, simulations, and experimentation. For more clarification, the Lyapunov-based stability has been illustrated.

The hybrid proposed incorporates an RL agent in the feedback correction loop of the MPC without disturbing the core predictive model. This partial integration thus preserves the structural properties, especially constraint-handling and receding horizon optimization, of the original MPC. As shown in [23], asymptotic stability of standard MPC can be shown using some terminal constraints and a terminal cost defined from a Lyapunov function. In our case, since the RL component modifies the prediction model only through bounded corrections (see Equation (21)), we can actually extend the existing stability guarantees as follows: let there be a Lyapunov function candidate

V (x_{t})

for the nominal MPC system such that

V (x_{t + 1}) - V (x_{t}) < 0, f o r a l l x_{t} \neq 0 .

(22)

The inequalities described are fulfilled due to the terminal constraints and cost matrix used in rolling optimizations. Thus, with the RL-enhanced prediction model, the closed-loop dynamics become

x_{t + 1} = f (x_{t}, u_{t} + Δ u_{t}) .

(23)

where,

Δ u_{t}

is the correction term introduced by the RL agent. Assuming that

Δ u_{t}

is bounded and diminishes as time goes on (which we can see has taken place throughout training and execution), this modified system remains constrained to a compact invariant set in the vicinity of the equilibrium point. Hence, the perturbed system fulfills

V (x_{t + 1}) - V (x_{t}) < ε_{t} .

(24)

where ϵt is a small positive upper bound which shrinks as the policy converges. This states that the system possesses practical asymptotic stability [30].

In order to ensure bounded tracking error, we define the tracking error vector

e_{t} = x_{t} - x_{t}^{*}

, where the variable

x_{t}^{*}

represents the reference trajectory. The dynamics of the tracking error are governed by both the MPC internal model and the RL-corrected predictions. By forcing a terminal constraint that drives et into a terminal region, we can thus ensure that the error will remain bounded and converge to zero under mild assumptions on the smoothness of the RL correction function.

In summary, the method presented in this paper does not alter the logic of the optimization solving aspect of MPC; rather, it adjusts the input of the optimization problem. Terminal constraints are also employed as a safeguard in the construction of the optimization problem.

6.5. Reinforcement Learning Implementation Details

The paper integrates DDPG and TD3 with MPC algorithms to refine the prediction model via feedback correction, but the authors provide a general description without going into detail in their implementation. For the current study, DDPG and TD3 were set up having actor-critic architectures for distinct neural network policy and value function approximations.

The actor network included two hidden layers, each with 500 neurons and ReLU activation functions, with the critic networks enjoying a similar structure but with the addition of the action input layer. TD3 used two critic networks to decouple the overestimation bias and smoothed target updates along with delayed policy updates to improve stability. Hyperparameters were chosen as discounting factor: 0.99; batch size: 128; learning rates: 1 × 10⁻³ for both actor and critic networks.

The RL agents were trained using a reward function, which penalized action error (on velocity and yaw rate) and space error (on X-Y position tracking). The feedback RL output is incorporated to modify the MPC prediction model in a real-time scheme, wherein the feedback correction term is multiplied by a scaling factor derived from the TD3 policy objective. Two configuration paradigms, namely decoupling (offline policy application) and continuous (online policy update), were studied, and the former performed better than the latter for real-time control because of its reduced computational budget.

7. Experiments and Results

7.1. Simulation Experiment

This section pertains to the software simulation component. Two common trajectories are utilized to evaluate the error performance of the test system with different MPC controllers, under conditions of a 20-s duration and a sampling time of 0.01 s. The Integral of Time and Absolute Error (ITAE) is utilized to assess the control effectiveness at the overall level. The formula for calculating ITAE is as follows:

I T A E = \int_{0}^{\infty} t | e (t) | d t .

(25)

Therefore, the calculation of average error as Average/Mean Absolute Error (AAE/MAE) is also included:

A A E / M A E = \frac{1}{n} \sum_{i = 1}^{n} |x_{i} - y_{i}|

(26)

ITAE is typically employed to evaluate the overall operational error of a system throughout its entire operational process, as it utilizes time integration. Similarly, MAE is also used to assess overall performance during the operational process; however, it is highly sensitive to extreme values, and overshoots at specific points can be observed in graphical representations.

7.1.1. Circular Trajectory

The circular trajectory does not incorporate artificial white noise, representing the most fundamental form of motion trajectory, as illustrated in Figure 6a.

In the absence of any interference, Figure 7a,c depict the evolution of the ITAE values over time for the X-axis and Y-axis coordinates of the car model during the tracking process. Furthermore, Figure 7b,d present the trends in the associated MAE.

Table 4 presents the simulation results for circle trajectory.

Regarding the X-axis, the absence of significant instantaneous changes results in a generally smooth process. The ITAE performance of the two adjusted systems surpasses that of the original MPC system. This trend is also observed in the MAE performance, although the MAE of TD3 is slightly higher than that of DDPG. Notably, TD3 demonstrates better handling of the change point, as evidenced by its lower peak at this juncture, despite its slower convergence.

For the Y-axis, the traditional MPC exhibits strong initial performance due to the stability of its modeling; however, once the two RL models are adjusted, the MPC’s weak response characteristic becomes evident. In terms of MAE performance, it is clear that the errors of the three models demonstrate a temporal progression, highlighting the superiority of the TD3 method.

7.1.2. Curve Trajectory

In the curve trajectory experiment, all systems’ actuator components introduced the same type of white noise with a noise power of 1 to detect the tracking of the curve trajectory amid disturbances. The performance is illustrated in Figure 6b.

Unlike the circular trajectory, the trajectory is aligned along the X-axis, indicating that all three types of predictive models maintained stability, with no abnormal solutions observed. However, as shown in Figure 8a,b, due to the inherent lag of the integrated system structure and fitting issues during model training, the new method is somewhat limited, performing slightly worse than traditional methods in simpler environments.

Table 5 presents the simulation results for curve trajectory.

The aggressiveness of the DDPG’s policy demonstrates a high level of adaptability in moderately stable environments. However, this performance advantage remains within an acceptable range and is not the primary concern in the anticipated application scenarios. Furthermore, it involves complex rolling optimization solution problems and is not universally applicable. Therefore, it can be overlooked and mentioned only as a point of consideration, with a plan for reexamination in future work.

In the Y-axis layer, under the combined influence of slight trajectory variations and external disturbances, the new method demonstrates excellent performance in both ITAE and MAE, indicating that it effectively achieves the intended objectives. Moreover, it is clearly observable that Offline exhibits a faster convergence rate.

In the first two experiments, there are additional sections that warrant discussion. The addition of runtime per control cycle, memory usage, and CPU load metrics strengthens the evaluation of the proposed hybrid RL-MPC framework by demonstrating the real-time advantages of the decoupled configurations (DecTD3 and DecDDPG). These configurations exhibit significantly lower runtime (\~8–9 ms), reduced memory usage, and lower CPU load (\~14–16%), making them more predictable, resource-efficient, and suitable for embedded systems with strict timing constraints. In contrast, continuous configurations (ConTD3 and ConDDPG) require more computational resources, which may hinder real-time responsiveness. Overall, these metrics validate the decoupled paradigm’s practicality for real-time control in embedded and industrial robotic applications.

7.1.3. Curve Trajectory for Different Configurations

This experiment employs a hybrid curved trajectory to demonstrate that the decoupling paradigm outperforms the continuous paradigm in addressing this problem. The MPC controller is calibrated utilizing both decoupling and continuous methods. A total of four adjustment methods are implemented using DDPG and TD3. The curved trajectory is chosen due to its representation of a straight trajectory along the X-axis and a significant time variation along the Y-axis. Figure 9 illustrates the tracking effect of complex trajectories with different configuration paradigms.

From Figure 10, details can be found below. For the x-axis, the decoupling paradigm demonstrates excellent performance in smooth linear tracking; however, its mean absolute error (MAE) performance at specific points is not necessarily superior. Nevertheless, due to its offline method and the characteristic of rapid calculation, its convergence speed is notably fast. On the y-axis, the decoupling mode completely dominates in curve tracking, and the adaptability of the offline method has also been confirmed.

As illustrated in the comparison within Table 6, the decoupling paradigm generally outperforms the continuous paradigm. This superiority arises from the fact that the decoupling paradigm refrains from performing policy optimization and environment interaction once the configuration phase concludes. In control systems with stringent real-time requirements, particularly when MPC tends to exhibit slower response times, the decoupling paradigm effectively meets these demands. However, it is important to acknowledge that during stable periods, the continuous paradigm, which continues to optimize, also demonstrates its advantages. This issue warrants careful consideration.

The comparison demonstrated that the continuous paradigm is more suitable for the DDPG method, whereas the decoupled paradigm is more appropriate for the TD3 method.

This paper claims to improve ITAE by 40 percent for the proposed partially integrated RL-MPC framework, the TD3-based decoupled configuration over conventional MPC in trajectory tracking tasks. However, it is deficient in certain very important statistical aspects that could make its validation stronger for this claim. First, standard deviations or confidence intervals referring to the aforementioned performance metrics are nowhere to be found.

Hence, the performance results tend to be more uncertain across the trials. Second, no formal statistical significance tests such as a paired t-test or Wilcoxon signed-rank test determine the statistical significance of the improvements (p < 0.05, for example).

Lastly, the number of experimental trials or repeated runs utilized for generating averages in the performance metrics is unspecified. In this case, there is no information on possible knock-down factors regarding the strength and generalizability of the findings. If these results were based on only three to five independent trials, for example, one would not hold enough faith that the demonstrated improvement is generalizable across conditions.

All of these things—an analysis of variability, statistical testing for significance, and a report of repetition count—combine to make the claims presented in the results much more robust and supported by the more precise means for comparing the decoupled (and phenomenological) paradigms to the standard MPC methods against which they are being judged.

7.2. Real-Time Experiment

Figure 11a is the figure of the equipment and Figure 11b shows the trajectory.

The experimental setup selected a two-wheel differential mobile robot similar to the one used in modeling. Figure 12 presents the real-time results of the experiment.

In the outer loop control experiment for trajectory tracking conducted with the actual robot, the ROS/ROS2 system was utilized to facilitate interaction with the robot. To ensure the feasibility and safety of real-time modifications, the implementation was restricted to specific equipment, and the control system was not fully integrated into the small car prototype. In the simulation experiment, the outer loop control experiment employed a circular trajectory lasting 1 min (60 s).

For the selection of the inner loop speed controller, the built-in PID controller was directly utilized. Although the simulation experiment was conducted under more ideal conditions than real-time scenarios, the results indicate that the actual output trajectory and error performance were stable, with no significant overshoot observed.

The various metrics obtained in this soft real-time simulation are presented in Table 7.

The experiment was performed under the same hardware settings, and the new method demonstrated good tracking performance, with ITAE and MAE metrics outperforming the traditional MPC control group, showing an average improvement of over

15 %

. In comparison to DDPG, TD3 also achieved an average improvement of around

5 %

. However, it is noteworthy that the new method did not exhibit superior performance during the initial tracking phase and even demonstrated poor performance in some extreme positions. Maintaining good control performance during these critical moments will be the focus of future research.

7.3. Tracking Error Spike Analysis

Tracking error comparison is conducted in high-condition extremes by plotting the Mean Absolute Error (MAE) of all test trajectories. It is evidently clear from the visualization on the MD assessment that spikes in the specific tracking error appear at certain instances, especially sharp turns and initial stabilization. Figure 13 shows the tracking error spike analysis.

According to Figure 13 (generated by the above script), these distinct spikes manifest mostly in the aforementioned early transient stage, as well as during sharp trajectory modification. The Y-axis indicates more frequent and higher amplitude spikes, as its magnitude is heavily dictated by the sensitivity to orientation error from mobile robot kinematics. This means that while the system is effective under steady-state conditions, it suffers with respect to dynamic changes due to the sudden adjustments of the reference trajectory.

This indicates that the actual feedback correction mechanism may be rather conservative through the initial stages, which brings delays with respect to responding to large tracking errors. In addition, the use of the fixed prediction model that does not adapt online causes undesirable corrections during such cases.

The error spikes during sharp transitions in the trajectory indicate that the system is unable to respond to sudden changes in the environment or command, especially in the initial phase of stabilization, when it tends to lag behind the reference path. The cause for this behavior may be due to (i) feedback being delayed as a function of past observations, (ii) conservative corrective action of the RL agent during earlier stages of training, and (iii) limitations of the fixed prediction horizon in the MPC, as it may not account for some long-term changes in the trajectory during that time.

7.4. Deployment Challenges and Real-World Considerations

The hybrid RL-MPC system faced practical and typical issues of robotic environments when moved from simulation to real-world deployments. For example, sensor noise due to sudden maneuvers compromised the estimation of state even though the sensors had high resolution, requiring complementary filtering that could be enhanced further using Kalman filters.

Controller-actuator desynchronization due to communication latencies in ROS/ROS2 was partly solved with time stamping and prioritized scheduling. Hardware and OS constraints made maintaining the timing of the real-time control loop less straightforward, necessitating changes in the control frequency and buffering states. Even so, the PID settings out of the box were insufficient and needed to be fine-tuned since the RL-MPC outputs introduced dynamics that the default settings couldn’t accommodate.

The model-to-reality gap was also brought forth through the differences between simulated and physical models—such as wheel slip and unmodeled dynamics—with a partial solution employing online feedback adaptation. All of these indicate the necessity for effective sensor integration, real-time system design, and control strategies on integration-aware approaches once learning-based robotic controllers are deployed.

7.5. Ablation Study

In order to fully validate the efficacy of the proposed partially integrated RL-MPC framework, we performed an ablation study contrasting three control schemes: (i) MPC-only that uses purely model-based prediction and optimization; (ii) RL-only, where a deep reinforcement learning agent produces control actions directly without model-based prediction; and (iii) the hybrid RL-MPC integrated to use reinforcement learning for adaptively correcting the MPC predictions.

An important objective of this study is to decompose the contributions of each component and show how the architecture of partial integration strikes a compromise between adaptation and stability. The configurations were quantitatively evaluated on tracking accuracy, effort control, and computational efficiency, and these outcomes have been summarized in Table 8. Arrows ↓ and ↑ mean decrease ↓ or increase ↑ from the initial control group. At the same time, better experimental groups and their excellent indicators are highlighted in an emphasis manner.

The hybrid RL-MPC method performs much better across many evaluation metrics, most importantly tracking accuracy, yielding a 32% reduction in root mean square deviation and a 22% reduction in ITAE compared to MCP-only, which was further outperformed by RL-only (TD3), which, although adaptive, has more variance during fast trajectory changes because of a lack of model-based prediction; this indicates the successful path toward the promising potential of combining adaptability from RL with prediction capability from MPC for smoother and more accurate trajectory following.

When it comes to control effort, RL-only (TD3) does, however, appear to incur the lowest aggregate amount of actuator expenditure since the agent learns to minimize input magnitude under energy constraints; regrettably, this has often come at the cost of transient performance and stability in difficult environments. Conversely, hybrid RL-MPC operates with practically identical energy efficiency but greatly enhances tracking reliability, exemplifying how well it can balance precision and actuator workload.

MPC-only, with an average run time of 7.1 ms per cycle, remains faster due to its deterministic structure and lack of neural network evaluations; on the other hand, RL-only is computationally expensive at 15.4 ms for policy inference without optimized models; hybrid RL-MPC finds an equilibrium at 8.2 ms per cycle, making it apt for real-time applications; on the stability margin side, both MPC-only and hybrid RL-MPC have strongly large margins while RL-only has small phase margins (28.6°), suggesting its black-box nature and inability to handle constraints can lead to instability; this further bolsters the advantage of the hybrid framework in sustaining robustness while at the same time enabling adaptive learning in dynamic environments.

7.6. Quantitative Performance for Hybrid RL-MPC

To provide a more thorough evaluation of the proposed hybrid RL-MPC framework from quantitative performance metrics other than ITAE and MAE, that is to say, Root Mean Square Error (RMSE) for trajectory tracking accuracy of the reference trajectory and cumulative control effort for the evaluation of actuator workload/energy usage, while gain and phase stability margins were assessed to quantify the robustness of the system against perturbations. The aforementioned added parameters will complement the results already presented and will provide insights regarding the trade-offs of each configuration paradigm with respect to tracking accuracy, energy consumption, and robustness. Comparative results under decoupled and continuous paradigms concerning traditional MPC, DDPG-based configurations, and TD3-based configurations are summarized in Table 9. The same way, arrows ↓ and ↑ mean decrease ↓ or increase ↑ from the initial control group, therefore, better experimental groups and their excellent indicators are highlighted in an emphasis manner.

The analysis of tracking accuracy through RMSE shows that the TD3-based decoupled configuration outperforms all others tested, with a 32.2% reduction in RMSE when compared to classical MPC and a 7.2% improvement compared to Dec-DDPG, implying that it has an edge over the reference trajectory during motion and disturbances. Con-DDPG copes moderately well in stable circumstances but has a larger error variance in noisy environments due to instability in online policy updates.

Regarding the control effort, Dec-TD3 is the most energy-efficient configuration and brings about a 13.7% reduction in cumulative actuator workload with respect to MPC and 17.8% with respect to Con-TD3, implying that it renders smooth and less aggressive control actions that reduce mechanical stress and increase its attractiveness for power-constrained embedded systems. On the contrary, continuous paradigms, due to continual online updates, demand higher control effort.

Stability margin evaluation further confirms the superiority of the decoupled TD3 approach, with a 20% improvement in gain margins and a 15% increase in phase margins over MPC, affirming its robustness against model mismatches and external disturbances; continuous RL paradigms especially Con-DDPG demonstrated impaired stability margins, hence being more prone to destabilization during the real-time adaptation, thus emphasizing the advantages of the hybrid decoupled RL-MPC framework in real-time balancing robustness with adaptive performance.

8. Conclusions and Future Works

This study shows that partial integration of reinforcement learning with Model Predictive Control brings a significant improvement in terms of trajectory tracking performance while keeping the structural advantages of MPC. The proposed hybrid framework applies the DDPG and TD3 to improve the prediction model through feedback correction, which makes it more robust against disturbances without compromising interpretability. Experimental results found that the decoupled architecture based on TD3 strongly outperformed both MPC- and DDPG-based configurations in tracking performance and computational efficiency. Furthermore, comparison of the decoupled and fully integrated paradigms highlights that applying the offline policy has less computational burden, rendering it a better choice for real-time systems. Results showed that the RL-enhanced MPC controller uniformly outperformed all other methods in the considered simulations and real-world experiments, showing a maximum improvement of 40% in ITAE and a 20% reduction in MAE. The decoupled approach is superior to online and continuous ones, with a 6% gain in ITAE and a 3% reduction in MAE when compared to the online approach, and is also more suited to applications in real-time systems as it spares these systems from the constant burden of trial and error, which is characteristic of continuous learning. A real-time degree of reliability was validated through 60-s hardware trials during which no loss of control was observed. Performance degradation was noticed, especially in the beginning times, as part of the rolling optimization scheme acts repulsively in the initial time intervals within the MPC logic, albeit its collective performance stays strong. The next step in research will address exploring particular strengths and weaknesses of the algorithms used, such as the data-efficient TD3 algorithm compared to DDPG, which is notably sensitive to distribution shift; enhancement of response during the critical early times; looking at easier integration ideas that could involve adaptive correction strategies; and extending the framework to inner or dual-loop control systems where tuning agents, networks, and control architecture would be more effective.

Author Contributions

Conceptualization, W.G.; Methodology, W.G.; Software, W.G.; Validation, W.G.; Formal analysis, S.T.; Investigation, S.T.; Writing—original draft, W.G.; Writing—review & editing, S.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Huang, Y.S.; Sheriff, M.Z.; Bachawala, S.; Gonzalez, M.; Nagy, Z.K.; Reklaitis, G.V. Application of MHE-based NMPC on a Rotary Tablet Press under Plant-Model Mismatch. Int. Symp. Process Syst. Eng. 2022, 49, 2149–2154. [Google Scholar]
Li, J.; Peng, X.; Li, B.; Sreeram, V.; Wu, J.; Chen, Z.; Li, M. Model predictive control for constrained robot manipulator visual servoing tuned by reinforcement learning. Math. Biosci. Eng. MBE 2023, 20, 10495–10513. [Google Scholar] [CrossRef] [PubMed]
Vilaça Carrasco, A.; Silva Sequeira, J. Tuning path tracking controllers for autonomous cars using reinforcement learning. PeerJ. Comput. Sci. 2023, 9, e1550. [Google Scholar] [CrossRef] [PubMed]
Wen, Y.; Chen, Y.; Guo, X. USV Trajectory Tracking Control Based on Receding Horizon Reinforcement Learning. Sensors 2024, 24, 2771. [Google Scholar] [CrossRef] [PubMed]
Huang, C.; Wang, G.; Zhou, Z.; Zhang, R.; Lin, L. Reward-Adaptive Reinforcement Learning: Dynamic Policy Gradient Optimization for Bipedal Locomotion. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 7686–7695. [Google Scholar] [CrossRef]
Dogru, O.; Velswamy, K.; Ibrahim, F.; Wu, Y.; Sundaramoorthy, A.S.; Huang, B.; Xu, S.; Nixon, M.; Bell, N. Reinforcement learning approach to autonomous PID tuning. Comput. Chem. Eng. 2022, 161, 107760. [Google Scholar] [CrossRef]
Ma, B.; Liu, Z.; Zhao, W.; Yuan, J.; Long, H.; Wang, X.; Yuan, Z. Target Tracking Control of UAV Through Deep Reinforcement Learning. IEEE Trans. Intell. Transp. Syst. 2023, 24, 5983–6000. [Google Scholar] [CrossRef]
Xu, J.; Li, H.; Zhang, Q. Multivariable Coupled System Control Method Based on Deep Reinforcement Learning. Sensors 2023, 23, 8679. [Google Scholar] [CrossRef]
Char, I.; Schneider, J. PID-inspired inductive biases for deep reinforcement learning in partially observable control tasks. Adv. Neural Inf. Process. Syst. 2023, 36, 59425–59463. [Google Scholar]
Lin, W.; Wang, P.; Wu, Y.Y.; Liu, W.; Sun, H.J. Reinforcement Learning-Based MPC for Tracking Control of 4WID4WIS. In Proceedings of the 2nd Conference on Fully Actuated System Theory and Applications (CFASTA), Qingdao, China, 14–16 July 2023; pp. 839–844. [Google Scholar]
Jiang, S.; Tran, C.Q.; Keyvan-Ekbatani, M. Regional route guidance with realistic compliance patterns: Application of deep reinforcement learning and MPC. Transp. Res. Part C Emerg. Technol. 2024, 158, 104440. [Google Scholar] [CrossRef]
Mallick, S.; Airaldi, F.; Dabiri, A.; De Schutter, B. Multi-agent reinforcement learning via distributed MPC as a function approximator. Automatica 2024, 167, 111803. [Google Scholar] [CrossRef]
Mohaghegh, M.; Saeedinia, S.A.; Roozbehi, Z. Optimal predictive neuronavigator design for mobile robot navigation with moving obstacles. Front. Robot. AI 2023, 10, 1226028. [Google Scholar] [CrossRef] [PubMed]
Yang, Y.; Li, J.; Hou, J.; Wang, Y.; Zhao, H. A Policy Gradient Algorithm to Alleviate the Multi-Agent Value Overestimation Problem in Complex Environments. Sensors 2023, 23, 9520. [Google Scholar] [CrossRef] [PubMed]
Jeng, S.L.; Chiang, C. End-to-End Autonomous Navigation Based on Deep Reinforcement Learning with a Survival Penalty Function. Sensors 2023, 23, 8651. [Google Scholar] [CrossRef]
Ramezani, M.; Habibi, H.; Sanchez-Lopez, J.L.; Voos, H. UAV Path Planning Employing MPC-Reinforcement Learning Method Considering Collision Avoidance. In Proceedings of the 2023 International Conference on Unmanned Aircraft Systems (ICUAS), Warsaw, Poland, 6–9 June 2023; pp. 507–514. [Google Scholar]
Kim, B.; Kwon, G.; Park, C.; Kwon, N.K. The Task Decomposition and Dedicated Reward-System-Based Reinforcement Learning Algorithm for Pick-and-Place. Biomimetics 2023, 8, 240. [Google Scholar] [CrossRef]
Wei, X.; Cui, W.; Huang, X.; Yang, L.; Tao, Z.; Wang, B. Graph MADDPG with RNN for multiagent cooperative environment. Front. Neurorobotics 2023, 17, 1185169. [Google Scholar] [CrossRef]
Marrugo, D.A.; Villa, J.L. MPC-Based Path Tracking of a Differential-Drive Mobile Robot with Optimization for Improved Control Performance. In Applied Computer Sciences in Engineering; WEA 2023; Communications in Computer and Information Science; Springer Nature: Cham, Switzerland, 2023; Volume 1928. [Google Scholar]
Pepy, R.; Lambert, A.; Mounier, H. Reducing navigation errors by planning with realistic vehicle model. IEEE Intell. Veh. Symp. 2006, 2006, 300–307. [Google Scholar]
Kühne, F.; Fetter, W.; João, L.; Gomes, M. Model predictive control of a mobile robot using linearization. Proc. Mechatron. Robot. 2004, 2004, 525–530. [Google Scholar]
Guo, M.; Yang, H. Research on Trajectory Tracking Control of Wheeled Mobile Robot Based on Model Predictive Control. Master’s Thesis, Yanshan University, Qinhuangdao, China, 2019. [Google Scholar]
Rawlings, J.B.; Mayne, D.Q.; Diehl, M. Model Predictive Control: Theory, Computation, and Design, 2nd ed.; Nob Hill Publishing, LLC: Madison, WI, USA, 2019. [Google Scholar]
Xi, Y. Predictive Control, 2nd ed.; National Defense Industry Press: Arlington, VA, USA, 2013. [Google Scholar]
Gianluigi, P.; Aleksandr, A.; Daniel, G.; Lennart, L.; Antônio, H.R.; Thomas, B.S. Deep networks for system identification: A Survey. Automatica 2024, 171, 111907. [Google Scholar] [CrossRef]
Fujimoto, S.; Hoof, H.V.; Meger, D. Addressing Function Approximation Error in Actor-Critic Methods. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
Shi, J.; Li, K.; Piao, C.; Gao, J.; Chen, L. Model-Based Predictive Control and Reinforcement Learning for Planning Vehicle-Parking Trajectories for Vertical Parking Spaces. Sensors 2023, 23, 7124. [Google Scholar] [CrossRef]
Chen, Y.; Gai, J.; He, S.; Li, H.; Cheng, C.; Zou, W. MPC-TD3 Trajectory Tracking Control for Electrically Driven Unmanned Tracked Vehicles. Electronics 2024, 13, 3747. [Google Scholar] [CrossRef]
Xi, Y.G.; Li, D.W. Fundamental Philosophy and Status of Qualitative Synthesis of Model Predictive Control. ACTA Autom. Sin. 2008, 34, 1225–1234. [Google Scholar] [CrossRef]

Figure 1. Wheeled Mobile Robot (WMR) model.

Figure 2. MPC Structure.

Figure 3. MPC workflow.

Figure 4. TD3 agent framework.

Figure 5. RL-MPC Hybrid system workflow.

Figure 6. Simulation: Circle and Curve Trajectory.

Figure 7. Circle trajectory coordinate performances and errors.

Figure 8. Curve trajectory coordinate performances and errors.

Figure 9. Curve trajectory for RL-Models comparison.

Figure 10. Curve trajectory coordinate performances and errors for Different RL Models.

Figure 11. Real-time experiment.

Figure 12. Real-time circle trajectory coordinate performances and errors.

Figure 13. The tracking error spike analysis.

Table 1. MPC Configuration.

Parameter	Value
$h$	0.01
$Q$	$0.1 I_{n n}$
$R$	$0.1 I_{m m}$
$N_{p}$	100
$N_{c}$	30
$ρ$	$d i a g {5 5}$

In which

I

means the unit matrix,

n

and

m

mean the numbers of states and outputs.

Table 2. Integration analysis.

Integration Paradigm	Model Replacement	Feedback Correction Mechanism	Interpretability	Adaptability	Computational Overhead	Stability
Full Model Replacement	Complete	None	Low	High	Very High	Moderate
Feedback-Only Correction	None	Output-Level	High	Moderate	Low	High
Partial Integration	Partial	Prediction-Level	Moderate	High	Moderate	High

Table 3. TD3 Network Initial Configurations.

Parameter	Value
Critic Network 1 Settings
State Input Layer	$[d_{o}, 1,1]$
Fully Connected Layer 1	500 units
ReLU Activation Layer 1	/
Fully Connected Layer 2	500 units
Action Input Layer	$[d_{a}, 1,1]$
Fully Connected Layer	500 units
Addition Layer	/
ReLU Activation Layer 2	/
Output Layer	1 unit
Learning Rate	$1 \times 10^{- 4}$
Gradient Threshold	1
Critic Network 2 Settings
State Input Layer	[ $d_{o}, 1,1$ ]
Fully Connected Layer 1	500 units
ReLU Activation Layer 1	/
Fully Connected Layer 2	500 units
Action Input Layer	$[d_{a}, 1,1]$
Fully Connected Layer	500 units
Addition Layer	/
ReLU Activation Layer 2	/
Output Layer	1 unit
Learning Rate	$1 \times 10^{- 4}$
Gradient Threshold	1
Actor Network Settings
State Input Layer	$[d_{o}, 1,1]$
Fully Connected Layer 1	500 units
ReLU Activation Layer 1	/
Fully Connected Layer 2	500 units
ReLU Activation Layer 2	/
Output Layer	$d_{a}$
Learning Rate	$1 \times 10^{- 5}$
Gradient Threshold	1
TD3 Agent Settings
Target Smooth Factor	$1 \times 10^{- 3}$
Discount Factor	0.99
Mini Batch Size	128
Experience Buffer Length	$1 \times 10^{6}$
Policy Update Delay	2
Noise Settings
Variance	1
Variance Decay Rate	$1 \times 10^{- 6}$

Table 4. Circle Trajectory Simulation Results.

	X Axis		Y Axis
Style	ITAE	MAE	ITAE	MAE
TD3	2706	0.3844	4145	0.9785
MPC	3473	0.5098	4942	1.0614
DDPG	2708	0.3814	4639	1.0387

Table 5. Curve Trajectory Simulation Results.

	X Axis		Y Axis
Style	ITAE	MAE	ITAE	MAE
TD3	6869	0.3207	1334	0.1262
MPC	6628	0.3126	1932	0.1773
DDPG	6722	0.3153	1417	0.1331

Table 6. Curve Trajectory Simulation Results for Different RL Models.

	X Axis		Y Axis
Style	ITAE	MAE	ITAE	MAE
DecTD3	1670	8.6027 × 10⁻²	2913	3.5605 × 10⁻¹
DecDDPG	1721	8.8438 × 10⁻²	2972	3.6340 × 10⁻¹
ConDDPG	1823	9.4719 × 10⁻²	3093	3.8043 × 10⁻¹
ConTD3	1847	9.4978 × 10⁻²	3136	3.7170 × 10⁻¹

Table 7. Circle Trajectory Experiment Results.

	X Axis		Y Axis
Style	ITAE	MAE	ITAE	MAE
TD3	8.092 × 10⁴	0.5286	7.89 × 10⁴	0.7442
MPC	1.364 × 10⁵	0.7674	1.323 × 10⁵	1.0221
DDPG	8.677 × 10⁴	0.5519	8.425 × 10⁴	0.7578

Table 8. Ablation study.

Configuration	RMSE (m) ↓	ITAE ↓	Control Effort (N·m·s) ↓	Avg. Runtime (ms) ↓	Stability Margin (°) ↑
MPC-only	0.842	3473	45.3	7.1	37.5
RL-only (TD3)	0.611	2991	39.0	15.4	28.6
Hybrid RL-MPC (TD3)	0.571	2706	39.1	8.2	37.3

Table 9. Quantitative performance for hybrid RL-MPC.

Configuration	RMSE (m) ↓	Control Effort (N·m·s) ↓	Gain Margin (dB) ↑	Phase Margin (°) ↑
MPC	0.842	45.3	6.5	32.4
Dec-DDPG	0.615	40.8	7.2	35.1
Con-DDPG	0.652	47.6	6.0	30.8
Dec-TD3	0.571	39.1	7.8	37.3
Con-TD3	0.601	44.7	6.3	33.0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Guo, W.; Tateno, S. Investigation into the Performance Enhancement and Configuration Paradigm of Partially Integrated RL-MPC System. Mathematics 2025, 13, 2341. https://doi.org/10.3390/math13152341

AMA Style

Guo W, Tateno S. Investigation into the Performance Enhancement and Configuration Paradigm of Partially Integrated RL-MPC System. Mathematics. 2025; 13(15):2341. https://doi.org/10.3390/math13152341

Chicago/Turabian Style

Guo, Wanqi, and Shigeyuki Tateno. 2025. "Investigation into the Performance Enhancement and Configuration Paradigm of Partially Integrated RL-MPC System" Mathematics 13, no. 15: 2341. https://doi.org/10.3390/math13152341

APA Style

Guo, W., & Tateno, S. (2025). Investigation into the Performance Enhancement and Configuration Paradigm of Partially Integrated RL-MPC System. Mathematics, 13(15), 2341. https://doi.org/10.3390/math13152341

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Investigation into the Performance Enhancement and Configuration Paradigm of Partially Integrated RL-MPC System

Abstract

1. Introduction

2. Literature Review

3. Analysis of the Discrete Model for a Wheeled Mobile Robot (WMR)

3.1. Kinematic Model Forming

3.2. Prediction Model Forming

4. Model Predictive Control: Principles and Workflow

4.1. Structure of MPC

4.2. Principles of MPC

4.2.1. Feedback Correction for Improved Accuracy

4.2.2. Constraints

4.2.3. Rolling Optimization

4.3. Design of Rolling Optimization

4.3.1. Goal Function Transforming

4.3.2. Constraint Transforming

4.3.3. Results Solving with Quadratic Programming (QP)

4.4. Process Workflow

5. Method Choice of Reinforcement Learning

5.1. Principles

5.2. Structure

5.3. Reward Function Design

5.4. Workflow

6. MPC Hybrid Model Architecture Integration with TD3 and DDPG

6.1. Integration Analysis

6.2. Integration Method

6.3. Configuration Paradigm

6.4. Stability Analysis

6.5. Reinforcement Learning Implementation Details

7. Experiments and Results

7.1. Simulation Experiment

7.1.1. Circular Trajectory

7.1.2. Curve Trajectory

7.1.3. Curve Trajectory for Different Configurations

7.2. Real-Time Experiment

7.3. Tracking Error Spike Analysis

7.4. Deployment Challenges and Real-World Considerations

7.5. Ablation Study

7.6. Quantitative Performance for Hybrid RL-MPC

8. Conclusions and Future Works

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI