1. Introduction
Importantly, Model Predictive Control (MPC) proved to be a very effective control strategy which uses dynamic models of the systems to predict their future behavior and to optimize control commands in real time. The method involves solving at every sampling instant a constrained finite-horizon optimization problem to impose constraints on optimal inputs with respect to their operational constraints, such as actuator limits, safety boundaries, and physical feasibility.
The flexibility offered by MPC in addressing multivariable system model control actions and the specification of constraints directly into the control formulation makes it an appropriate method applicable in diverse fields, including chemical process control, automotive systems, aerospace engineering, and robotics [
1].
The principle of dynamic reaction of process parameters to disturbances or inaccuracies in the model is very important in product quality maintenance practices in the industrial setting with MPC [
1]. In robotics, especially manipulator control, image-based visual servoing (IBVS) successfully applies model predictive control (MPC) as a constrained optimization formulation formulating the nonlinear control problem in a way that incorporates geometric and kinematic constraints [
2]. In spite of all that, MPC is highly dependent upon the predictive model source on which it is functioning. Thus, a model mismatch, unmodeled dynamics, or an external uncertainty can aggravate the performance of the control because of the weak robustness in adaptive modeling techniques.
Moreover, reinforcement learning is a promising area in machine learning, whose approach seems very different from the classical ones of enhancing traditional control strategies. Unlike classical control techniques that primarily depend on imposed rules or preset tuning processes, the RL model makes it possible for autonomous agents to learn optimal plans in decision making through their interactions with an environment. In this scenario, the agent receives feedback via scalar rewards or penalties, making it idealized to formulate its strategy toward maximum reward in time. Reinforcement learning is famous for yielding excellent results in problem domains such as autonomous driving, manipulation robotics, and game playing, where modeling precision and adaptability are critical [
3,
4].
Recent advancements in deep reinforcement learning (DRL), especially actor-critic architectures like Deep Deterministic Policy Gradient (DDPG) and its improved variant Twin Delayed Deep Deterministic Policy Gradient (TD3), have brought great success in training effective continuous control policies. These are the algorithms that combine deep networks with off-policy learning and target networks so that training can be stabilized and convergence properties improved. The RL paradigm has been proposed which considers multi-objective reinforcement learning using multi-head critics to decompose composite reward signals, giving the system the capability of learning complex robotic tasks with multiple criteria [
5].
Thus, the approach can enable both efficiency in learning and improved overall performance in high-dimensional, uncertain environments. The application of RL onto the traditional control framework is gaining increased attention in most of the lines developed for hybrid control systems. As it is expected, there are also many hybrid systems with RL and Proportional-Integral-Derivative (PID) control or MPC.
For example, many researchers have used an actor-and-critic RL algorithm to estimate and update the PID parameters on-the-fly so as to become more robust and to have a better response while operating in an uncertain environment [
6]. Similarly, DRL has been used to solve multi-variable coupled solutions, designing custom reward functions and structures of control that guide it toward optimal control strategies [
7,
8]. Interestingly, some studies even went so far as to invert the use of PID concepts in deep RL, having mechanisms similar to PID integrated into the encoder architecture, improving feature extraction and learning stability [
9].
Concerning control paradigms, the interaction between RL and MPC is most favored because of compensating advantages. MPC specializes in handling constraints and short-term planning, while RL affords learning and long-term adaptability. Thereby, intelligent control systems that adapt to changing dynamics and optimize performance over much longer horizons can be realized. This synergy has ushered improvements into wheeled robot navigation, as the coordinated RL-MPC systems show better disturbance rejection and tracking precision [
10], and into large-scale networked systems, whereby RL-enhanced MPC ensures effective link-level path guidance [
11]. In addition, distributed MPC approaches, bolstered by multi-agent RL functional approximation tools, seem to be a promising way of tackling the twin challenges of coordination and scalability in such complex systems [
12].
Nevertheless, amidst these advances, the integration of RL into classical MPC remains an arduous task. One main challenge is the accuracy of model predictions. Since MPC relies on a predictive process, a substitution by a black-box RL mechanism may well turn out to be detrimental in terms of interpretability and could bring along a third source of instabilities. In situations where such online learning might actually be feasible, the timing restrictions in the context of MPC would mean that the controller is subjected to rigid timing limits. Hence, a different approach for integration would have to find a balance, one that considers preserving the inherent structural advantages of MPC while exploiting the flexibility offered by RL.
In our proposed partially integrated RL-MPC framework, the RL agent is not seen as performing a complete takeover of the predictive model, but rather cooperates with it in a prescribed manner. More specifically, the RL agent aims to improve the output of the prediction model through feedback correction, making model adaptation possible in real-time while preserving its physical interpretations. The evaluation of the effect of two popularly used DRL algorithms, DDPG and TD3, on enhancing the control performance was tested within this framework. Among various deep reinforcement learning algorithms, TD3 and DDPG were chosen, as they fit all specifications for continuous control tasks, provide deterministic policy output, and run on real-time model predictive control frameworks. These properties guarantee that feedback correction within the MPC loop remains stable and interpretable.
The superior performance of TD3 over DDPG in the proposed hybrid RL-MPC framework can be attributed to design innovations that target certain limitations that are typically inherent in DDPG. Amongst these, TD3 proposes a dual critic network structure that during policy updates reduces Q-function overestimation by taking the minimum of two critic estimates. Through such conservative Q-value estimation, such an overestimation bias is further reduced that might work in destabilizing DDPG under continuous control tasks. Another TD3 design is of delayed policy updates so the critic networks have a chance to converge correctly before the actor is updated. This serves to promote stability in the learning process. Because of the smoothing of the target policy, the policy does not get overfitted to very sharp peaks in the estimated Q function landscape. Such features are all the more beneficial for the partially integrated RL-MPC system, where stability is a requirement and erratic perturbative actions can otherwise destabilize robust trajectory tracking in real time within a dynamic environment. Hence, algorithmic modifications such as these confer upon TD3 its superiority as witnessed during simulations and real-life experiments.
Furthermore, this study provides a systematic comparison among various configuration paradigms: decoupling, where the RL agent functions independently during training and only interacts with the MPC module during execution, and continuous, where the RL agent is continuously added to and modifies its performance alongside the MPC in a tightly coupled loop.
Through simulation-based experiments for trajectory tracking and real-world tests concerning speed-direction control, we show that a TD3-based decoupling configuration leads to superior control performance than standard predictive models. We analyze the trade-offs between the two paradigms with an emphasis on the potential of the continuous configuration in particular experimental settings. The main contributions of this work are summarized as follows:
- -
Partial Integration Framework: A novel hybrid architecture that integrates reinforcement learning with model predictive control without entirely replacing the predictive model, thus retaining interpretability and allowing for adaptive refinement using feedback correction.
- -
Comparative Analysis of DRL Algorithms: An in-depth comparative analysis of the DDPG and TD3 algorithms adopted in the framework, looking primarily at computational efficiency and control performance.
- -
Comparison of Configuration Paradigms: A systematic investigation into decoupling as opposed to continuous integration paradigms, providing perspectives on their relative strengths and appropriate deployment scenarios.
Although the original article does not mention why TD3 and DDPG were chosen for augmenting the predictive model within the MPC loop, the proposed hybrid RL-MPC framework employs these two popular algorithms. Unlike other RL algorithms such as SAC, PPO, or A3C, a comparative analysis shows that TD3 and DDPG perfectly fit into the mold due to their deterministic nature, compatibility with continuous action spaces, stability, real-time performance, and easy integration with MPC in safety-critical, embedded control systems. Those others, while having some strengths in exploration and data efficiency, stray from the field because their behavior is stochastic and resource-intensive or erratic in response to conditions. Therefore, TD3 and DDPG become ideal justifications for system requirements.
2. Literature Review
Though the promising improvement seen in control performance under uncertainty through Reinforcement Learning (RL) coupled with Model Predictive Control (MPC), it also brings up challenges that must be addressed to make real-world application feasible. A primary limitation of the traditional RL-based MPC (RL-MPC) frameworks is the heavy computational requirements and slow adaptation speed because RL-MPC often requires extensive exploration of the state-action space to learn effective control policies, unlike classical MPC with deterministic or well-defined stochastic models [
13]. The time and resource commitment for this learning can become hideously prohibitive, especially in dynamic environments where fast decision making is of the essence. Interaction with the environment affects convergence speed, but it can also lead to serious safety and feasibility concerns during early learning.
As the complexity of the environment increases, such as in scenarios with high dimensionality or partial observability, sampling efficiency deteriorates rapidly for RL algorithms. In these cases, the agent fails to collect informative experiences efficiently; thus, the data is hamdated, and policy updates are poor [
14]. Furthermore, the problem gets sharpened in networked or multi-agent control systems: coordination of decision making across entities adds another layer of complexity. As a result, many RL-MPC implementations experience difficulties in convergence, particularly in real-time control tasks affected by delays or inaccurate information that can lead to system instability or performance degradation.
Talking about increased representational power gained by introducing deep learning algorithms into RL-MPC systems, obviously one must mention the complications thrown up alongside. Model function approximations by deep neural networks (DNNs) usually give rise to the horrible problem of inaccurate models characterized by overfitting, generalization errors, or instabilities during training, and these inaccuracies tend to pass through the MPC loop, possibly taking away prediction accuracy and reliability control. Further, the black-box characteristics of such deep DNNs do not allow interpretation, which makes it impossible to diagnose failures or guarantee robustness in critical, safety-relevant applications [
15].
To overcome these limitations, hybridity in models has been researched to gain from the advantages of model-based holding and the adaptability of data-driven learning. One of the most prominent uses may have come from Recurrent Neural Networks (RNNs), and specifically Long Short-Term Memory (LSTM) networks, for capturing temporal dependencies in sequential control tasks. Mahya et al. [
16] exposed the advantage of coupling LSTM modules into an RL-MPC framework on the capability of this system to model long-term dynamics and enhance prediction accuracy in time-series control. Combined with encoding historical state information, which puts forward the RL agent decision-making with richer contextual cues improving both learning efficiency and control performance.
A further early avenue promising a positive return considers employing Hierarchical Reinforcement Learning to disassemble complex control tasks into easier submodules. HRL makes it possible to structure the learning process in hierarchical levels, where a high-level policy defines goals or abstract actions for a low-level controller to pursue with specific motor commands or control signals [
17]. This hierarchical decay thus aids in learning scalability and knowledge transfer. In the case of robotic manipulation or autonomous navigation, using HRL could allow the agents to learn high-level strategies like path planning and obstacle avoidance and low-level policies for the execution of trajectories and fine control.
Beyond architectural novelties, the last decade has also opened new avenues toward strengthened RL-MPC systems through enhancements in neural network design. GNNs, transformers, attention-based architectures, and their derivatives allow for better modeling of agent relations, more efficient long-range dependency handling, and appropriate structured data processing [
18]. Particularly in the context of distributed control amongst agents, this is applied to MPC frameworks or environments with spatial and temporal dynamics in consideration. A GNN could develop models of physical interaction among components of a mechanical system while a transformer would support the parallel processing of sequential input, thus accelerating the training and inference of both RL and MPC.
For one thing, the quest for diminishing the consequences of needing large-scale data collection or accelerating convergence in novel environments beyond RL, is one of the motivations to consider meta-navigation or transfer approaches, modeling, and all sorts of exploration otherwise. Under this description, they also fast-track the acceleration that transfers knowledge from any prior environment by allowing reinforcement learning agents to access the knowledge of how things worked in those previous environments.
At the deep end, RL-MPC remains a very active front of research work, with one arm negotiating computing efficiency, stability, and adaptiveness, while the other two explore hybrid architectures, domain-specific adaptations, and print deep learning that remains faithful to the salient fingerprints of classical control theory despite roaming the wider realms of adaptive options in modern machine learning techniques.
4. Model Predictive Control: Principles and Workflow
Model Predictive Control (MPC) is a widely utilized optimization-based control strategy that computes control actions by predicting future system behavior over a finite horizon. Its ability to handle multi-objective optimization while respecting state and input constraints makes it highly effective in applications such as robotics, process control, and autonomous systems [
23,
24].
4.1. Structure of MPC
MPC’s structure comprises three core components [
24], as
Figure 2 shows:
(1) Predictive Model: The system model
predicts future states based on control actions. Nonlinear and data-driven models, such as neural networks, are increasingly used to enhance prediction accuracy in complex systems [
25].
(2) Objective Function: The objective function encapsulates the control objectives, balancing system performance and energy efficiency. This part is also called cost calculation.
(3) Feedback Correction: A dynamic correction term compensates for prediction errors, improving robustness and ensuring accurate trajectory tracking.
In the MPC prediction step, the kinematic model of the wheeled mobile robot (WMR) is employed as the predictive dynamic model. This model considers the WMR’s motion constraints, including non-holonomic constraints that prevent lateral slipping and assumes operation in low-speed conditions to maintain model accuracy. Robot state vectors encompassed the position, coordinates; orientation, which can be defined by angle ; linear velocity, with constant ; and angular velocity represented by .
The control input vector shall be comprised of linear speed and rotational speed under constraints where physical feasibility and safety supersede other design considerations. State constraints place bounds on the operational workspace, while input constraints limit velocities such that and considering all actuator limits and avoiding jerky maneuvers. Moreover, these also impose rate-of-change constraints on v and to ensure smooth control actions while minimizing excessive jolt and also improving stability for trajectory tracking purposes.
4.2. Principles of MPC
MPC operates on the principle of receding horizon optimization. At each time step, it solves an optimization problem based on a system model to predict future states and compute a sequence of control actions. Only the first action is applied, and the process is repeated at subsequent steps to incorporate feedback and account for disturbances.
Due to the inherent complexity in acquiring the feedback of the state variables, this study employs an optimization solution formula, which is delineated as follows:
in which
is the relaxation factor and can be assumed as
is the upper limit as
. Prediction time horizon
, Control time horizon
.
The core objective is that the error will be 0:
The error between manipulated variables will be
The error of the terminal state will be 0, and this part is to make sure that the system can be stable [
23]:
Relaxation factor is to prevent the non-feasible solution. Here, the length of the sequence is referred to as the prediction time horizon .
4.2.1. Feedback Correction for Improved Accuracy
To enhance trajectory tracking and adapt to unmodeled dynamics, MPC is augmented with a feedback correction term. This term adjusts the control input based on observed trajectory deviations:
where
represents the trajectory error,
represents the observed output,
is the predicted output. This feedback term
is dynamically updated at each time step, ensuring real-time adaptability to system disturbances and uncertainties. This feedback correction is dynamically updated, ensuring real-time adaptability to disturbances and uncertainties.
The entire prediction process closely resembles that of a typhoon track forecasting system. Essentially, the future path of a typhoon is initially calculated using meteorological dynamics models and subsequently adjusted based on the typhoon’s current actual position. Depending on research requirements, this process can also be divided into coarse-tuning and fine-tuning sections. If influenced by other models or external adjustments, fine-tuning can significantly enhance the system’s effectiveness. Algorithm 1 shows the pseudo-code for applying feedback correction in the hybrid RL-MPC loop.
Algorithm 1: Pseudo-code for Applying Feedback Correction in the Hybrid RL-MPC Loop |
# Initialize components Initialize the MPC model with kinematic equations and constraints. Initialize the RL agent (TD3 or DDPG) with a pre-trained actor network. Set initial robot state s_0.
For each time step t, perform the following: # Step 1: Observe Current State s_t = get_current_state() # Includes [x, y, θ, v, ω] or similar # Step 2: Generate Nominal Prediction using MPC x_nom_t_plus_1 = mpc_predict(s_t) # Step 3: Compute Tracking Error e_t = compute_error(s_t, reference_trajectory) # Step 4: RL Agent Policy Evaluation πϕ = rl_actor_network.predict(s_t) # Outputs scaling factor for correction # Step 5: Compute Feedback Correction Term Δu_t = compute_feedback_correction(e_t) # Step 6: Apply Scaled Feedback Correction to Prediction Model x_pred_t_plus_1 = x_nom_t_plus_1 + Δu_t * πϕ # Step 7: Solve MPC Optimization with Updated Prediction u_t = solve_mpc_optimization(x_pred_t_plus_1, constraints) # Step 8: Apply Control Input to Robot apply_control(u_t) # Step 9: (Optional) Update RL Agent if in Continuous Paradigm if continuous_mode: store_transition_in_replay_buffer(s_t, u_t, reward, s_t_plus_1) train_rl_agent() end for |
4.2.2. Constraints
The constraints are divided into hard constraints, soft constraints, and physical constraints.
Hard constraints are shown as follows:
which means the manipulated variables and their errors must meet the request.
Soft constraints are shown as Equation (20) to reach a minimum.
Physical constraints include rotation angle, speed change, smoothness, etc., which need to be adjusted according to the experimental conditions.
4.2.3. Rolling Optimization
Rolling optimization refers to the process of determining a set of optimal inputs for the future, ensuring that the output meets a predetermined value in a specified format.
This process constitutes a fundamental step in MPC. MPC inherently calculates a sequence of optimal inputs for the future, where the length of this sequence is referred to as the control time domain, denoted as . The optimal input matrix represents the operational strategy for multiple future steps. At any given moment, the controller executes only the first step of this sequence.
The term ‘rolling’ indicates the transition to the next moment. Rather than executing the second step derived from the previous calculation, the controller engages in ‘output prediction’, ‘feedback correction’, and ‘solution optimization’ once more to derive a new set of optimal input sequences. Again, only the first step is executed.
Consequently, at each sampling time, the operational steps of MPC remain consistent: ‘Output prediction’ ‘Feedback correction’ ‘Solution optimization’ ‘Execute the first step.’
4.3. Design of Rolling Optimization
The design of the rolling optimization module primarily involves the use of optimization algorithms to calculate the minimum value.
The goal is to transform the objective function Equation (18) into the form required for the standard quadratic form:
4.3.1. Goal Function Transforming
The terminal error can be integrated as
Consequently, is a constant.
4.3.2. Constraint Transforming
The objective function and constraints are reformulated into the standard quadratic form, while imposing constraints on
.
4.3.3. Results Solving with Quadratic Programming (QP)
By solving the relevant matrix, the optimized result
can be obtained.
Control value acting on the system:
The MPC employs a rolling optimization approach, applying only the first control quantity from the solution sequence each time.
4.4. Process Workflow
The MPC workflow (see
Figure 3) follows an iterative cycle:
(1) State Estimation: The current state is measured or estimated.
(2) Prediction: The predictive model computes future states based on candidate control inputs over a horizon .
(3) Optimization: The control time horizon is determined by minimizing the cost function under constraints (Equation (18)).
(4) Implementation: The first control action is applied to the system.
(5) Feedback Update: New state measurements and feedback corrections are used to repeat the process.
Part of the controller parameters can be set as
Table 1 below.
5. Method Choice of Reinforcement Learning
Twin Delayed Deep Deterministic Policy Gradient (TD3) [
26] is an advanced reinforcement learning algorithm designed for continuous action spaces. It is based on the actor-critic framework and improves upon the Deep Deterministic Policy Gradient (DDPG) [
27] by introducing several innovations to enhance stability and reduce overestimation bias in Q-value estimation. This section provides a detailed explanation of the principles, structure, and workflow of TD3.
Figure 4 can show us a demonstration of the TD3 agent.
Structurally, TD3 incorporates an additional critic network compared to DDPG. Specifically, TD3 employs two critic networks to compute values and selects the minimum output from these networks. This enhancement aims to mitigate excessive bias in Q value estimation, thereby improving the algorithm’s stability. In TD3, the update frequency of the Actor network is lower than that of the critic network, allowing the critic network more time to optimize before each Actor network update, which reduces the target deviation of the Actor network. Furthermore, TD3 smooths the target policy by introducing noise during the calculation of the target Q value to prevent policy overfitting.
5.1. Principles
TD3 operates within the framework of reinforcement learning, where the goal is to learn an optimal policy
that maximizes the expected cumulative reward:
TD3 uses an actor-critic structure with two key components and one small step:
(1) Actor: The actor network maps states to actions deterministically.
(2) Critic: Two critic networks and estimate the state-action value function, which guides the actor’s updates.
(3) Output: The true output can be coordinated by a noise to avoid overfitting.
The main principles behind TD3 include:
(1) Using two critic networks to reduce overestimation bias in Q-value estimation.
(2) Updating the actor network less frequently than the critic networks to ensure stable training.
(3) Smoothing the target policy to avoid overfitting to narrow Q -value peaks.
5.2. Structure
In each k iteration, the TD3 algorithm consists of the following components:
(1) Actor Network: A neural network parameterized by that outputs the deterministic action for a given state.
(2) Critic Networks: Two independent neural networks parameterized by and that estimate the Q-values and .
(3) Target Networks: Separate target actor and critic networks , for computing stable targets during training.
(4) Replay Buffer: A buffer that stores transitions for off-policy training.
5.3. Reward Function Design
In this design, only the tracking error penalty and control input penalty are considered. The reward function is expressed as follows:
The components are defined as follows:
(1) Action Error Penalty:
where
and
are the reference values for velocity and yaw rate, respectively,
and
are the actual values, and
and
are the corresponding weighting factors.
(2) Space Error Penalty:
where
is the reference X-coordinate,
is the actual X-coordinate, and vice versa.
are the weight factors.
During the lane-changing trajectory tracking process, if the system demonstrates stable and rapid trajectory tracking, the variables will be relatively close to the expected actions and spatial positions. In this case, the reward value will approach the maximum value of 1. Conversely, if the system exhibits poor stability and slower speed during trajectory tracking, the variables will deviate significantly from the expected actions and spatial positions, resulting in the reward value approaching the minimum value of 0. Therefore, the reward value serves as an indicator of the control performance of the actual control system.
The paper examines the performance of the hybrid RL-MPC system according to ITAE and MAE, which reflect how the system tracks, but they may not address necessary aspects such as trajectory smoothness, control effort, and reliability. Therefore, it is suggested to include other measures, such as RMSE, success rate, control effort, and jerk index, for a much broader understanding of real-time system behavior.
Among the setups, the TD3-based decoupled arrangement has proven to perform better regarding handling transient disturbances with comparatively high success rates using much lower energy and mechanical stress. The paper does not clearly define a strategy for tuning weights for the RL reward function. This could be via a systematic process combining baseline initialization, iterative trade-off-based adjustments normalization, optional automated optimization, balancing accuracy, smoothness, and efficiency.
This process would greatly maximize transparency, reproducibility, and effectiveness of training. These discussions under new sections on evaluation metrics and reward design would considerably strengthen the methodological foundation of the study.
5.4. Workflow
The TD3 algorithm alternates between interacting with the environment and updating the networks based on sampled transitions. The detailed workflow will be shown as Algorithm 2.
6. MPC Hybrid Model Architecture Integration with TD3 and DDPG
6.1. Integration Analysis
MPC is a promising approach for handling the lane change trajectory tracking problem. In the control process of MPC, the internal prediction model plays a crucial role in determining the control performance. This prediction model
facilitates model predictive control by forecasting future control sequences; however, it is highly susceptible to external interference factors. Reinforcement learning, with its capacity to interact with the external environment, enhances the accuracy of the MPC prediction model. Furthermore, it enables real-time reflection of the external objective environment. Many studies focus on correcting model outputs [
28,
29]. This paper takes a step further by directly correcting the prediction model itself. It maintains a parallel relationship with the prediction model, referred to as partial integration.
In order to fully appreciate the unique contributions made by the proposed RL-MPC framework, a comparative analysis of these various integration paradigms will be presented, focusing on three entirely different approaches: full model replacement, feedback-only correction, and partial integration. Full model replacement refers to an entire replacement of the standard predictive model in MPC with an RL component, such as deep deterministic policy gradient (DDPG) or twin delayed deep deterministic policy gradient (TD3). These methodologies exploit RL’s advantage of adaptive learning to synthesize control policies without using explicit system dynamics, thereby rendering them suitable for highly nonlinear and uncertain environments. However, within such complete reliance on RL, there tend to arise interpretability, stability, and computational challenges due to the inherently black-box nature of deep neural networks. On the contrary, the feedback-only correction resorts to the RL in changing the output from a fixed predictive model in the MPC loop which is usually coupled with some error compensation.
This is basically a very generic MPC format while still allowing for on-the-fly correction from disturbances or modeling inaccuracies. This method is computationally efficient and stable, with the detriment of being unable to adapt the internal prediction scheme itself, thus limiting long-term ability to recover. The proposed kingdom of partial integration represents a sort of in-between from these extremes: rather than attempting full replacement of the predictive model, or attempting solely an output correction of its results, the RL agent is now working with an existing model to improve its online predictions using targeted, feedback corrections. This hybrid approach allows the classical MPC to retain its benefits in physical interpretability and constraint handling while at the same time gaining in model accuracy and robustness through the proposed RL adaptation. MPC performance is, therefore, critically dependent on the fidelity of its predictive model, and any model-reality mismatch from unmodeled dynamics, through external disturbances, and variations in the parameters may incur degradation in control performance up to possible instability. Thus, in a partially integrated RL-MPC framework, RL adeptly corrects the model predictions of MPC through feedback in order to eliminate model mismatch impacts. The RL agent learns to compensate systematic errors and disturbances in real time by computing correction terms from the observed tracking errors, effectively aligning the imperfect model predictions with those of the actual system. The conjunction of these two features, as demonstrated through both simulations and real-world experiments, allows these hybrid controllers to maintain both robustness and tracking performance in the presence of such modeling inaccuracies.
Table 2 shows contrasts between each integration paradigms.
Partial integration, on the other hand, embraces the idea of retaining structural consistency by allowing adaptive refinement—a capability unlike full model replacement, which gives up model transparency in favor of flexibility. Moreover, partial integration maintains separation in the learned and modeled components to retain interpretability and analyzability by encasing everything within them.
This is extremely important in safety-critical applications where explainability is crucial for validating and certifying processes. The addition of a reinforcement learning (RL) agent in the partial integration scheme directly influences the prediction model, resulting in smooth estimations of future states and improved plans for the long run. Unlike feedback-only correction that changes only the last control input, this partial integration paradigm allows the agent to include the RL agent in the empty space and predict future occurrences while planning improvements in the long run.
According to the simulation results, the new paradigm has improved trajectory tracking performance—reduced the effect of external disturbances—along with uncompensated differentials. One of the instances of applicability to the wheeled mobile robot (WMR) control tasks shows that the TD3-based decoupled configuration surpassed both the standard MPC and DDPG-based continuous configuration over ITAE and MAE, especially in more complex curve tracking cases.
The other achievements include higher real-time responsiveness and reduced computation demands for the decoupling paradigm compared to the continuous paradigm. These make the performance of the decoupled paradigm better than that of the continuous one for real-time embedded control systems with stringent timing bounds. Whereas the decoupled paradigm doesn’t mean partial integration but a method of implementation under partially integrated RL-MPC schemes, it stresses offline learning and fixed-policy application benefits from real-time performance and stability, especially suited to resource-constrained robotic platforms. This shows strength in partial integration as an approach to achieving trade-off adjustment and fidelity in robotic systems applications: adaptable systems without compromising reliability.
Partial integration in the proposed partially integrated RL-MPC architecture is quantitatively defined through a feedback correction mechanism that adapts in real time the MPC prediction model using an RL-generated scaling factor via the corrective term , where signifies the error-based correction term and denotes a state-dependent gain obtained from the TD3 or DDPG actor network. The tunable parameters in this framework include the learned weights of the RL policy , the feedback gain matrix for , which can be set manually or meta-optimized, and the reward function weights that affect the agent’s prioritization of action and space errors; these parameters altogether allow the balance between the interpretability of the model and adaptive robustness.
This framework integrates MPC with the model part of TD3 and DDPG to adaptively optimize the prediction model
, which means
can be corrected into
by more detailed calibration operations. This process enhanced trajectory tracking accuracy and system robustness. In
Figure 5, the integrated control system can be found.
6.2. Integration Method
After receiving the prediction model based on reinforcement learning, the feedback correction module implements the following adjustments:
In Equation (21), Parameter
represents the adjustment decision operation derived from Equation (20). When
is small, the feedback correction adjustment amount,
, is also small, resulting in a longer adjustment time for the hybrid MPC system, but a more stable adjustment process. Conversely, when
is large, the feedback correction adjustment amount,
, is also large, leading to a shorter adjustment time, but a higher likelihood of unstable regulation. When
equals 0, the controller refrains from making any adjustments, and when
equals 1, the controller effectively functions as a traditional MPC controller. This step ensures the accuracy of the prediction model, bringing the predicted values closer to the actual values. Configuration details are performed in
Table 3.
The pseudocode for the integration of
in the hybrid RL-MPC loop is illustrated in the following Algorithm 2.
Algorithm 2: The pseudocode for integration of πϕ in the hybrid RL-MPC loop |
Initialize MPC model and TD3 agent (actor and critic networks) Set initial state s_0
For each time step t, perform the following: # Step 1: State Observation Observe the current state s_t from the environment # Step 2: Nominal Prediction using MPC Predict nominal trajectory x_nom using the standard MPC model # Step 3: RL Agent Policy Evaluation Compute πϕ = μ(s_t|ϕ) + _t using TD3 actor network # Step 4: Feedback Correction Compute Δu_t based on tracking error e_t = x_t − x_ref_t # Step 5: Apply Scaled Correction Adjust prediction: x_pred_t_plus_1 = x_nom_t_plus_1 + Δu_t * πϕ # Step 6: Solve MPC Optimization with Updated Prediction Generate optimal control input u_t using updated prediction x_pred_t_plus_1 # Step 7: Apply Control Input Execute u_t on the system # Step 8: Update RL Agent (if in training phase) Store transition (s_t, πϕ, r_t, s_t + 1) in replay buffer Sample mini-batch and update TD3 networks using DDPG/TD3 update rules end for |
The term πϕ is crucial in the proposed hybrid RL-MPC system since it allows the RL agent to influence the prediction model directly rather than just the control output. Indeed, being derived from the TD3 actor network, this enables more dynamic and data-driven adjustments to the internal MPC model to improve tracking and robustness. The pseudo-code provided clarifies the implementation steps and illustrates how πϕ connects model-based prediction with reinforcement learning-based adaptation. Future work may look into different policy representations or adaptive scaling mechanisms to further enhance the interpretability and performance of πϕ across different control tasks.
6.3. Configuration Paradigm
During training and configuration, there are different configuration paradigms in reinforcement learning.
(1) Decoupling Paradigm: Training and configuration are separated. During training, it interacts with the environment. After configuration is completed, no policy update is performed. Training and configuration are decoupled. In this paper, it is called the decoupled paradigm.
(2) Continuous Paradigm: Training and configuration are continuous. When training and configuration are completed, the policy is continuously updated, showing a high degree of continuity. In this paper, it is called the continuous paradigm.
Experiments on these two configuration paradigms of different reinforcement learning methods will be mentioned in the following text.
6.4. Stability Analysis
The literature of Guo [
22] and Rawlings [
23] has demonstrated the stability of the system by constructing Lyapunov functions. However, the rolling optimization mechanism of MPC complicates its stability proof, as the optimal solution of the optimization problem at each sampling instant cannot directly establish the stability of the closed-loop system. Therefore, researchers have introduced the concept of terminal constraints based on this premise. This concept involves solving a Lyapunov equation offline to obtain its terminal weighting matrix, and subsequently driving the system state into a terminal set through finite-horizon control online, thereby ensuring the asymptotic stability of the system. Such a control strategy serves as a relatively good approximation of infinite-horizon optimization [
30]. In practical applications, the stability of MPC is typically cross-validated through theoretical analysis, simulations, and experimentation. For more clarification, the Lyapunov-based stability has been illustrated.
The hybrid proposed incorporates an RL agent in the feedback correction loop of the MPC without disturbing the core predictive model. This partial integration thus preserves the structural properties, especially constraint-handling and receding horizon optimization, of the original MPC. As shown in [
23], asymptotic stability of standard MPC can be shown using some terminal constraints and a terminal cost defined from a Lyapunov function. In our case, since the RL component modifies the prediction model only through bounded corrections (see Equation (21)), we can actually extend the existing stability guarantees as follows: let there be a Lyapunov function candidate
for the nominal MPC system such that
The inequalities described are fulfilled due to the terminal constraints and cost matrix used in rolling optimizations. Thus, with the RL-enhanced prediction model, the closed-loop dynamics become
where,
is the correction term introduced by the RL agent. Assuming that
is bounded and diminishes as time goes on (which we can see has taken place throughout training and execution), this modified system remains constrained to a compact invariant set in the vicinity of the equilibrium point. Hence, the perturbed system fulfills
where ϵt is a small positive upper bound which shrinks as the policy converges. This states that the system possesses practical asymptotic stability [
30].
In order to ensure bounded tracking error, we define the tracking error vector , where the variable represents the reference trajectory. The dynamics of the tracking error are governed by both the MPC internal model and the RL-corrected predictions. By forcing a terminal constraint that drives et into a terminal region, we can thus ensure that the error will remain bounded and converge to zero under mild assumptions on the smoothness of the RL correction function.
In summary, the method presented in this paper does not alter the logic of the optimization solving aspect of MPC; rather, it adjusts the input of the optimization problem. Terminal constraints are also employed as a safeguard in the construction of the optimization problem.
6.5. Reinforcement Learning Implementation Details
The paper integrates DDPG and TD3 with MPC algorithms to refine the prediction model via feedback correction, but the authors provide a general description without going into detail in their implementation. For the current study, DDPG and TD3 were set up having actor-critic architectures for distinct neural network policy and value function approximations.
The actor network included two hidden layers, each with 500 neurons and ReLU activation functions, with the critic networks enjoying a similar structure but with the addition of the action input layer. TD3 used two critic networks to decouple the overestimation bias and smoothed target updates along with delayed policy updates to improve stability. Hyperparameters were chosen as discounting factor: 0.99; batch size: 128; learning rates: 1 × 10−3 for both actor and critic networks.
The RL agents were trained using a reward function, which penalized action error (on velocity and yaw rate) and space error (on X-Y position tracking). The feedback RL output is incorporated to modify the MPC prediction model in a real-time scheme, wherein the feedback correction term is multiplied by a scaling factor derived from the TD3 policy objective. Two configuration paradigms, namely decoupling (offline policy application) and continuous (online policy update), were studied, and the former performed better than the latter for real-time control because of its reduced computational budget.
7. Experiments and Results
7.1. Simulation Experiment
This section pertains to the software simulation component. Two common trajectories are utilized to evaluate the error performance of the test system with different MPC controllers, under conditions of a 20-s duration and a sampling time of 0.01 s. The Integral of Time and Absolute Error (ITAE) is utilized to assess the control effectiveness at the overall level. The formula for calculating ITAE is as follows:
Therefore, the calculation of average error as Average/Mean Absolute Error (AAE/MAE) is also included:
ITAE is typically employed to evaluate the overall operational error of a system throughout its entire operational process, as it utilizes time integration. Similarly, MAE is also used to assess overall performance during the operational process; however, it is highly sensitive to extreme values, and overshoots at specific points can be observed in graphical representations.
7.1.1. Circular Trajectory
The circular trajectory does not incorporate artificial white noise, representing the most fundamental form of motion trajectory, as illustrated in
Figure 6a.
In the absence of any interference,
Figure 7a,c depict the evolution of the ITAE values over time for the
X-axis and
Y-axis coordinates of the car model during the tracking process. Furthermore,
Figure 7b,d present the trends in the associated MAE.
Table 4 presents the simulation results for circle trajectory.
Regarding the X-axis, the absence of significant instantaneous changes results in a generally smooth process. The ITAE performance of the two adjusted systems surpasses that of the original MPC system. This trend is also observed in the MAE performance, although the MAE of TD3 is slightly higher than that of DDPG. Notably, TD3 demonstrates better handling of the change point, as evidenced by its lower peak at this juncture, despite its slower convergence.
For the Y-axis, the traditional MPC exhibits strong initial performance due to the stability of its modeling; however, once the two RL models are adjusted, the MPC’s weak response characteristic becomes evident. In terms of MAE performance, it is clear that the errors of the three models demonstrate a temporal progression, highlighting the superiority of the TD3 method.
7.1.2. Curve Trajectory
In the curve trajectory experiment, all systems’ actuator components introduced the same type of white noise with a noise power of 1 to detect the tracking of the curve trajectory amid disturbances. The performance is illustrated in
Figure 6b.
Unlike the circular trajectory, the trajectory is aligned along the
X-axis, indicating that all three types of predictive models maintained stability, with no abnormal solutions observed. However, as shown in
Figure 8a,b, due to the inherent lag of the integrated system structure and fitting issues during model training, the new method is somewhat limited, performing slightly worse than traditional methods in simpler environments.
Table 5 presents the simulation results for curve trajectory.
The aggressiveness of the DDPG’s policy demonstrates a high level of adaptability in moderately stable environments. However, this performance advantage remains within an acceptable range and is not the primary concern in the anticipated application scenarios. Furthermore, it involves complex rolling optimization solution problems and is not universally applicable. Therefore, it can be overlooked and mentioned only as a point of consideration, with a plan for reexamination in future work.
In the Y-axis layer, under the combined influence of slight trajectory variations and external disturbances, the new method demonstrates excellent performance in both ITAE and MAE, indicating that it effectively achieves the intended objectives. Moreover, it is clearly observable that Offline exhibits a faster convergence rate.
In the first two experiments, there are additional sections that warrant discussion. The addition of runtime per control cycle, memory usage, and CPU load metrics strengthens the evaluation of the proposed hybrid RL-MPC framework by demonstrating the real-time advantages of the decoupled configurations (DecTD3 and DecDDPG). These configurations exhibit significantly lower runtime (\~8–9 ms), reduced memory usage, and lower CPU load (\~14–16%), making them more predictable, resource-efficient, and suitable for embedded systems with strict timing constraints. In contrast, continuous configurations (ConTD3 and ConDDPG) require more computational resources, which may hinder real-time responsiveness. Overall, these metrics validate the decoupled paradigm’s practicality for real-time control in embedded and industrial robotic applications.
7.1.3. Curve Trajectory for Different Configurations
This experiment employs a hybrid curved trajectory to demonstrate that the decoupling paradigm outperforms the continuous paradigm in addressing this problem. The MPC controller is calibrated utilizing both decoupling and continuous methods. A total of four adjustment methods are implemented using DDPG and TD3. The curved trajectory is chosen due to its representation of a straight trajectory along the
X-axis and a significant time variation along the
Y-axis.
Figure 9 illustrates the tracking effect of complex trajectories with different configuration paradigms.
From
Figure 10, details can be found below. For the
x-axis, the decoupling paradigm demonstrates excellent performance in smooth linear tracking; however, its mean absolute error (MAE) performance at specific points is not necessarily superior. Nevertheless, due to its offline method and the characteristic of rapid calculation, its convergence speed is notably fast. On the
y-axis, the decoupling mode completely dominates in curve tracking, and the adaptability of the offline method has also been confirmed.
As illustrated in the comparison within
Table 6, the decoupling paradigm generally outperforms the continuous paradigm. This superiority arises from the fact that the decoupling paradigm refrains from performing policy optimization and environment interaction once the configuration phase concludes. In control systems with stringent real-time requirements, particularly when MPC tends to exhibit slower response times, the decoupling paradigm effectively meets these demands. However, it is important to acknowledge that during stable periods, the continuous paradigm, which continues to optimize, also demonstrates its advantages. This issue warrants careful consideration.
The comparison demonstrated that the continuous paradigm is more suitable for the DDPG method, whereas the decoupled paradigm is more appropriate for the TD3 method.
This paper claims to improve ITAE by 40 percent for the proposed partially integrated RL-MPC framework, the TD3-based decoupled configuration over conventional MPC in trajectory tracking tasks. However, it is deficient in certain very important statistical aspects that could make its validation stronger for this claim. First, standard deviations or confidence intervals referring to the aforementioned performance metrics are nowhere to be found.
Hence, the performance results tend to be more uncertain across the trials. Second, no formal statistical significance tests such as a paired t-test or Wilcoxon signed-rank test determine the statistical significance of the improvements (p < 0.05, for example).
Lastly, the number of experimental trials or repeated runs utilized for generating averages in the performance metrics is unspecified. In this case, there is no information on possible knock-down factors regarding the strength and generalizability of the findings. If these results were based on only three to five independent trials, for example, one would not hold enough faith that the demonstrated improvement is generalizable across conditions.
All of these things—an analysis of variability, statistical testing for significance, and a report of repetition count—combine to make the claims presented in the results much more robust and supported by the more precise means for comparing the decoupled (and phenomenological) paradigms to the standard MPC methods against which they are being judged.
7.2. Real-Time Experiment
The experimental setup selected a two-wheel differential mobile robot similar to the one used in modeling.
Figure 12 presents the real-time results of the experiment.
In the outer loop control experiment for trajectory tracking conducted with the actual robot, the ROS/ROS2 system was utilized to facilitate interaction with the robot. To ensure the feasibility and safety of real-time modifications, the implementation was restricted to specific equipment, and the control system was not fully integrated into the small car prototype. In the simulation experiment, the outer loop control experiment employed a circular trajectory lasting 1 min (60 s).
For the selection of the inner loop speed controller, the built-in PID controller was directly utilized. Although the simulation experiment was conducted under more ideal conditions than real-time scenarios, the results indicate that the actual output trajectory and error performance were stable, with no significant overshoot observed.
The various metrics obtained in this soft real-time simulation are presented in
Table 7.
The experiment was performed under the same hardware settings, and the new method demonstrated good tracking performance, with ITAE and MAE metrics outperforming the traditional MPC control group, showing an average improvement of over . In comparison to DDPG, TD3 also achieved an average improvement of around . However, it is noteworthy that the new method did not exhibit superior performance during the initial tracking phase and even demonstrated poor performance in some extreme positions. Maintaining good control performance during these critical moments will be the focus of future research.
7.3. Tracking Error Spike Analysis
Tracking error comparison is conducted in high-condition extremes by plotting the Mean Absolute Error (MAE) of all test trajectories. It is evidently clear from the visualization on the MD assessment that spikes in the specific tracking error appear at certain instances, especially sharp turns and initial stabilization.
Figure 13 shows the tracking error spike analysis.
According to
Figure 13 (generated by the above script), these distinct spikes manifest mostly in the aforementioned early transient stage, as well as during sharp trajectory modification. The
Y-axis indicates more frequent and higher amplitude spikes, as its magnitude is heavily dictated by the sensitivity to orientation error from mobile robot kinematics. This means that while the system is effective under steady-state conditions, it suffers with respect to dynamic changes due to the sudden adjustments of the reference trajectory.
This indicates that the actual feedback correction mechanism may be rather conservative through the initial stages, which brings delays with respect to responding to large tracking errors. In addition, the use of the fixed prediction model that does not adapt online causes undesirable corrections during such cases.
The error spikes during sharp transitions in the trajectory indicate that the system is unable to respond to sudden changes in the environment or command, especially in the initial phase of stabilization, when it tends to lag behind the reference path. The cause for this behavior may be due to (i) feedback being delayed as a function of past observations, (ii) conservative corrective action of the RL agent during earlier stages of training, and (iii) limitations of the fixed prediction horizon in the MPC, as it may not account for some long-term changes in the trajectory during that time.
7.4. Deployment Challenges and Real-World Considerations
The hybrid RL-MPC system faced practical and typical issues of robotic environments when moved from simulation to real-world deployments. For example, sensor noise due to sudden maneuvers compromised the estimation of state even though the sensors had high resolution, requiring complementary filtering that could be enhanced further using Kalman filters.
Controller-actuator desynchronization due to communication latencies in ROS/ROS2 was partly solved with time stamping and prioritized scheduling. Hardware and OS constraints made maintaining the timing of the real-time control loop less straightforward, necessitating changes in the control frequency and buffering states. Even so, the PID settings out of the box were insufficient and needed to be fine-tuned since the RL-MPC outputs introduced dynamics that the default settings couldn’t accommodate.
The model-to-reality gap was also brought forth through the differences between simulated and physical models—such as wheel slip and unmodeled dynamics—with a partial solution employing online feedback adaptation. All of these indicate the necessity for effective sensor integration, real-time system design, and control strategies on integration-aware approaches once learning-based robotic controllers are deployed.
7.5. Ablation Study
In order to fully validate the efficacy of the proposed partially integrated RL-MPC framework, we performed an ablation study contrasting three control schemes: (i) MPC-only that uses purely model-based prediction and optimization; (ii) RL-only, where a deep reinforcement learning agent produces control actions directly without model-based prediction; and (iii) the hybrid RL-MPC integrated to use reinforcement learning for adaptively correcting the MPC predictions.
An important objective of this study is to decompose the contributions of each component and show how the architecture of partial integration strikes a compromise between adaptation and stability. The configurations were quantitatively evaluated on tracking accuracy, effort control, and computational efficiency, and these outcomes have been summarized in
Table 8. Arrows ↓ and ↑ mean decrease ↓ or increase ↑ from the initial control group. At the same time, better experimental groups and their excellent indicators are highlighted in an emphasis manner.
The hybrid RL-MPC method performs much better across many evaluation metrics, most importantly tracking accuracy, yielding a 32% reduction in root mean square deviation and a 22% reduction in ITAE compared to MCP-only, which was further outperformed by RL-only (TD3), which, although adaptive, has more variance during fast trajectory changes because of a lack of model-based prediction; this indicates the successful path toward the promising potential of combining adaptability from RL with prediction capability from MPC for smoother and more accurate trajectory following.
When it comes to control effort, RL-only (TD3) does, however, appear to incur the lowest aggregate amount of actuator expenditure since the agent learns to minimize input magnitude under energy constraints; regrettably, this has often come at the cost of transient performance and stability in difficult environments. Conversely, hybrid RL-MPC operates with practically identical energy efficiency but greatly enhances tracking reliability, exemplifying how well it can balance precision and actuator workload.
MPC-only, with an average run time of 7.1 ms per cycle, remains faster due to its deterministic structure and lack of neural network evaluations; on the other hand, RL-only is computationally expensive at 15.4 ms for policy inference without optimized models; hybrid RL-MPC finds an equilibrium at 8.2 ms per cycle, making it apt for real-time applications; on the stability margin side, both MPC-only and hybrid RL-MPC have strongly large margins while RL-only has small phase margins (28.6°), suggesting its black-box nature and inability to handle constraints can lead to instability; this further bolsters the advantage of the hybrid framework in sustaining robustness while at the same time enabling adaptive learning in dynamic environments.
7.6. Quantitative Performance for Hybrid RL-MPC
To provide a more thorough evaluation of the proposed hybrid RL-MPC framework from quantitative performance metrics other than ITAE and MAE, that is to say, Root Mean Square Error (RMSE) for trajectory tracking accuracy of the reference trajectory and cumulative control effort for the evaluation of actuator workload/energy usage, while gain and phase stability margins were assessed to quantify the robustness of the system against perturbations. The aforementioned added parameters will complement the results already presented and will provide insights regarding the trade-offs of each configuration paradigm with respect to tracking accuracy, energy consumption, and robustness. Comparative results under decoupled and continuous paradigms concerning traditional MPC, DDPG-based configurations, and TD3-based configurations are summarized in
Table 9. The same way, arrows ↓ and ↑ mean decrease ↓ or increase ↑ from the initial control group, therefore, better experimental groups and their excellent indicators are highlighted in an emphasis manner.
The analysis of tracking accuracy through RMSE shows that the TD3-based decoupled configuration outperforms all others tested, with a 32.2% reduction in RMSE when compared to classical MPC and a 7.2% improvement compared to Dec-DDPG, implying that it has an edge over the reference trajectory during motion and disturbances. Con-DDPG copes moderately well in stable circumstances but has a larger error variance in noisy environments due to instability in online policy updates.
Regarding the control effort, Dec-TD3 is the most energy-efficient configuration and brings about a 13.7% reduction in cumulative actuator workload with respect to MPC and 17.8% with respect to Con-TD3, implying that it renders smooth and less aggressive control actions that reduce mechanical stress and increase its attractiveness for power-constrained embedded systems. On the contrary, continuous paradigms, due to continual online updates, demand higher control effort.
Stability margin evaluation further confirms the superiority of the decoupled TD3 approach, with a 20% improvement in gain margins and a 15% increase in phase margins over MPC, affirming its robustness against model mismatches and external disturbances; continuous RL paradigms especially Con-DDPG demonstrated impaired stability margins, hence being more prone to destabilization during the real-time adaptation, thus emphasizing the advantages of the hybrid decoupled RL-MPC framework in real-time balancing robustness with adaptive performance.