Hierarchical Reinforcement Learning–Based Optimal Control for Model-Free Linear Systems

Zhang, Yong; Yan, Xiangrui; Yang, Weiqing; Zhou, Yuyang

doi:10.3390/math14050895

Open AccessArticle

Hierarchical Reinforcement Learning–Based Optimal Control for Model-Free Linear Systems

by

Yong Zhang

^1,2

,

Xiangrui Yan

^1,2,

Weiqing Yang

^1,2 and

Yuyang Zhou

^3,*

¹

School of Automation and Electrical Engineering, Inner Mongolia University of Science and Technology, Baotou 014010, China

²

Key Laboratory of Synthetical Automation for Process Industries at Universities of Inner Mongolia Autonomous Region, Inner Mongolia University of Science and Technology, Baotou 014010, China

³

School of Computing, Engineering and the Built Environment, Edinburgh Napier University, Edinburgh EH10 5DT, UK

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(5), 895; https://doi.org/10.3390/math14050895

Submission received: 11 February 2026 / Revised: 2 March 2026 / Accepted: 4 March 2026 / Published: 6 March 2026

(This article belongs to the Special Issue Dynamic Modeling and Simulation for Control Systems, 3rd Edition)

Download

Browse Figures

Versions Notes

Abstract

A novel model-free hierarchical reinforcement learning (HRL)–based Linear Quadratic Regulator (LQR) control framework with adaptive weight selection is proposed to address the reliance of conventional LQR methods on accurate system models and manual parameter tuning. The proposed approach adopts a two-level learning architecture in which a high-level meta-agent adaptively optimizes the LQR weighting matrices Q and R through entropy-based trajectory evaluation, while a low-level base-agent performs model-free policy iteration to update the state-feedback control law under unknown system dynamics. By decoupling weight optimization from control-law learning, the framework enables simultaneous adaptation of the cost-function parameters and the feedback gain without requiring explicit model information. To enhance learning stability and exploration during weight adaptation, Gaussian noise and an experience replay mechanism are incorporated into the learning process. Numerical simulations on second- and third-order linear systems demonstrate that the proposed HRL-based LQR method achieves effective control performance, reliable convergence, and improved adaptability in model-free environments.

Keywords:

optimal control; reinforcement learning; model-free; Q-learning; hierarchical reinforcement learning

MSC:

93E20; 68T05; 93B47

1. Introduction

With the continuous advancement of technology [1,2], control systems have grown significantly in scale and complexity, placing increasing demands on performance, robustness, and adaptability. Modern optimal control methods, such as LQR and model predictive control (MPC) [3,4], have achieved remarkable success in a wide range of engineering applications, including high-end manufacturing and process control. By explicitly incorporating system dynamics and optimization objectives into the control design, these methods are capable of delivering high control accuracy and desirable closed-loop performance. Despite their effectiveness, the practical deployment of LQR and MPC is fundamentally constrained by their reliance on accurate system models. In real-world applications [5,6], obtaining precise mathematical models often entails high identification costs, long development cycles, and inevitable modeling errors. Model mismatch, unmodeled dynamics, and external disturbances can significantly degrade control performance, thereby limiting the applicability of model-based optimal control methods in complex and uncertain environments.

In recent years, the limitations imposed by model dependence have motivated growing interest in data-driven and learning-based control approaches. A variety of intelligent control strategies have been developed [7], including adaptive dynamic programming (ADP), reinforcement learning (RL), and related model-free control techniques. These methods aim to approximate optimal control policies directly from data without requiring explicit knowledge of system dynamics. For example, in ref. [8], a generalized value-iteration-based ADP algorithm was proposed that allows initialization with any positive semi-definite function and transforms the optimal tracking problem into a regulation problem. Similarly, ref. [9] introduced a model-free RL algorithm based on actor–critic neural networks and online learning control to handle systems with unknown dynamics through value iteration and adaptive critic learning.

However, in most existing learning-based control studies, the performance index of the controller is manually specified and remains fixed during operation [10,11]. The weighting parameters in the cost function are typically selected based on empirical experience and engineering intuition. As a result, the controller may apply identical control strategies under varying operating conditions or require repeated manual tuning of performance weights, which restricts adaptability and increases reliance on expert knowledge.

To alleviate the dependence on manual weight specification, research efforts have explored behavior imitation and trajectory-tracking frameworks that infer performance weights from data. Representative approaches include learning from demonstrations (LfD) [12,13] and imitation learning, in which an expert system first provides reference trajectories, and a learning agent attempts to reproduce the observed behavior. A widely adopted methodology is inverse reinforcement learning (IRL) [14,15,16], which assumes that the expert follows an optimal control policy and seeks to recover the underlying reward-function weights from observed trajectories. In early work ref. [17] laid the foundation for IRL, while ref. [14] provided a comprehensive survey of its theoretical developments and applications.

Nevertheless, most existing IRL-based and imitation-learning approaches focus on discrete or episodic decision-making problems, where closed-loop stability [18] is not explicitly guaranteed. Moreover, the inferred performance weights are typically obtained offline prior to controller deployment, making them unsuitable for real-time adaptive weight tuning in continuous control systems. In practical industrial applications, systems often operate under time-varying conditions with significant uncertainty and incomplete model knowledge, further limiting the effectiveness of offline-trained or fixed-weight control strategies. Although reinforcement learning frameworks offer greater flexibility [19,20,21,22,23], their performance metrics still rely heavily on manual engineering design, which hinders large-scale practical deployment.

These limitations motivate the development of model-free control frameworks that can adaptively adjust performance weights online [24,25,26], maintain closed-loop stability, and operate without accurate system models.

HRL has been extensively investigated in control applications. For example, one study proposed an ANDC–DDRL fusion strategy to achieve adaptive and stable cardiac pacing by modeling the cardiovascular system as a Markov decision process (MDP) [27]. An electromechanical dynamic model comprising multiple subsystems—such as pacemaker and heartbeat dynamics—was established, and a Double Deep Q-Network (DDQN) was employed to calibrate the ANDC controller parameters online, thereby addressing the nonlinearity and physiological variability of the cardiac system. Another study introduced a biomimetic HRL framework to overcome limitations of conventional hierarchical architectures, including unstable inter-layer coordination, inefficient subgoal scheduling, delayed responses, and limited interpretability [28]. The framework integrates two key mechanisms: timed subgoal scheduling (TS) and a neural dynamic biomimetic circuit network (NDBCNet), thereby enhancing coordination efficiency and structural transparency across hierarchical levels. In addition, a separate line of research combined unsupervised reinforcement learning with a prompt-driven foundational model of humanoid robot behavior [29]. This approach effectively mitigates major drawbacks of traditional humanoid control strategies, such as task specificity, poor adaptability, and the absence of a unified target interface. Consequently, zero-shot multi-task execution and rapid single-policy adaptation with limited samples were achieved on real humanoid robotic platforms.

Collectively, these studies demonstrate the growing importance of reinforcement learning in advanced control systems. However, many existing approaches primarily focus on complex nonlinear applications and rely on deep neural network approximators, which may introduce increased computational complexity and training overhead. In addition, structured optimal control integration and online cost-function adaptation within classical control frameworks have received comparatively less attention. These observations motivate the development of model-free, hierarchically structured reinforcement learning methods that preserve control-theoretic interpretability while enabling online adaptive optimization.

To address the aforementioned challenges, this paper proposes a model-free HRL–based LQR control framework that jointly optimizes the state-feedback control law and the cost-function weighting matrices. The proposed framework adopts a two-level learning architecture, in which the overall control task is decomposed into a meta-agent and a base-agent with complementary responsibilities. At the lower level, the base-agent is responsible for real-time control execution and model-free learning of the state-feedback gain matrix. A data-driven Q-learning–based policy iteration method is employed to update the control law online without requiring explicit knowledge of the system dynamics, thereby eliminating the dependence of conventional LQR on known model parameters. At the higher level, the meta-agent focuses on adaptive optimization of the LQR weighting matrices Q and R. Using state and control trajectories generated by the base-agent during closed-loop operation, an entropy-based optimization mechanism is introduced to evaluate control performance and guide the online adjustment of the cost-function parameters. This hierarchical structure enables automatic weight tuning, reduces reliance on manual parameter selection, and enhances adaptability to system uncertainties. The main contributions of this paper are summarized as follows:

1: By integrating reinforcement learning with the LQR control algorithm, the proposed framework effectively addresses the challenge of controlling systems with unknown parameters, eliminating the need for precise knowledge of the system model.
2: Trajectory entropy is utilized to optimize the controller’s performance parameters, thereby reducing reliance on manual tuning and expanding the applicability of the control system.
3: Through a hierarchical reinforcement learning architecture, adaptive optimization of the control performance weight matrices is achieved during the control process, enabling the system to adapt to dynamic changes and select improved control trajectories.

The remainder of this paper is organized as follows. Section 2 presents the detailed formulation of the control problem, introduces the implementation and limitations of the traditional LQR, and describes a model-free control approach based on Q-learning. Section 3 focuses on the design of an entropy-based method for adapting the weights of the control performance index and provides the corresponding implementation algorithm within an HRL framework. Section 4 demonstrates the effectiveness of the proposed controller through numerical simulations. Finally, Section 5 summarizes the paper and discusses potential directions for future research.

2. Problem Description

2.1. Traditional LQR Control Algorithm

The LQR is a fundamental problem in optimal control theory, where the primary objective is to design a state-feedback control law that minimizes predefined quadratic performance metrics subject to system dynamic constraints.

Consider the following linear system subject to Gaussian noise:

{\dot{x}}_{k + 1} = A x_{k} + B u_{k} + ν_{k}

(1)

where

x \in R^{n}

denotes the system state vector,

A \in R^{n \times n}

is the system matrix,

B \in R^{n \times m}

represents the input matrix,

u \in R^{m}

is the control input, and

v_{k} \in (μ, σ^{2})

denotes Gaussian noise. Among them,

μ

and

σ^{2}

represent the mean and variance of the noise added to the system.

Based on optimal control theory, the following performance index is formulated for the infinite-horizon regulation problem:

J = \sum_{k = 0}^{\infty} (x_{k}^{T} Q x_{k} + u_{k}^{T} R u_{k})

(2)

where

Q ⩾ 0

and

R ⩾ 0

are given weighting matrices, with Q being positive semidefinite and R being positive definite. It is further assumed that the pair

(A, \sqrt{Q})

is observable. Under the linear system dynamics, the Hamilton–Jacobi–Bellman (HJB) equation is given as follows:

H (x, u) = {(A x + B u)}^{T} P x + x^{T} P (A x + B u) + x^{T} Q x + u^{T} R u = 0 .

(3)

The performance metric is defined as the following:

V^{*} = x^{T} P x .

(4)

The resulting optimal control law is expressed as follows:

\begin{matrix} u = - K x, \\ K = R^{- 1} B^{T} P \end{matrix}

(5)

where

P > 0

is a symmetric positive definite matrix that satisfies the following algebraic Riccati equation (ARE):

A^{T} P + P A + Q - P B R^{- 1} B^{T} P = 0 .

(6)

The above control law relies on the availability of the system matrices A and B. When the system parameters are unknown, however, the associated ARE is no longer tractable. To address this limitation, a model-free enhanced Q-learning–based optimal control method is developed for systems with unknown parameters.

2.2. Data-Driven Reinforcement Learning for LQR Problems

Motivated by the limitation identified in the previous section, this section develops a reinforcement learning–based Q-learning framework for optimal control under unknown system dynamics. The proposed approach achieves tracking of the desired weighted states by iteratively learning the optimal state–action value function, leading to improved computational efficiency compared with conventional reinforcement learning methods. The detailed methodology is presented below.

Within the reinforcement learning framework, the value function is defined as follows:

\begin{matrix} V (x_{k}) & = \sum_{i = k}^{\infty} γ^{i - k} [{x_{i}}^{T} Q x_{i} + {u_{i}}^{T} R u_{i}] \\ = \sum_{i = k}^{\infty} ρ_{i} \\ = ρ_{k} + γ \sum_{i = k + 1}^{\infty} ρ_{i} \end{matrix}

(7)

where

ρ_{k} = {x_{k}}^{T} Q x_{k} + {u_{k}}^{T} R u_{k}

denotes the instantaneous cost,

γ

is the reinforcement learning discount factor, and Q and R are symmetric weighting matrices. Based on the above formulation, the corresponding Bellman equation can be derived as follows:

V (x_{k}) = ρ_{k} + γ V (x_{k + 1})

(8)

where

V (x_{k + 1})

is evaluated at the future state

x_{k + 1}

. By expressing

x_{k + 1}

in terms of the current state

x_{k}

via Equation (1), Equation (8) can be further expressed as

V (x_{k}) = E {ρ_{k} + γ {(A x_{k} + B u_{k} + ν_{k})}^{T} P (A x_{k} + B u_{k} + ν_{k})}

(9)

where

E

denotes the mathematical expectation with respect to the noise

{ν (0), ν (1), \dots}

. Accordingly, the Q-function is defined as follows:

Q (x_{k}, u_{k}) = E \{ρ_{k} + γ {(A x_{k} + B u_{k} + v_{k})}^{T} P (A x_{k} + B u_{k} + v_{k})\} .

(10)

By substituting

ρ_{k} = {x_{k}}^{T} Q x_{k} + {u_{k}}^{T} R u_{k}

into the equation, we obtain the following:

Q (x_{k}, u_{k}) = E {x_{k}^{T} Q x_{k} + u_{k}^{T} R u_{k} + γ {(A x_{k} + B u_{k} + v_{k})}^{T} P (A x_{k} + B u_{k} + v_{k})} .

(11)

The expression can be further rewritten as follows:

Q (x_{k}, u_{k}) = E \{{[\begin{matrix} x_{k} \\ u_{k} \end{matrix}]}^{T} [\begin{matrix} Q + γ A^{T} P A & γ A^{T} P B \\ γ B^{T} P A & R + γ B^{T} P B \end{matrix}] [\begin{matrix} x_{k} \\ u_{k} \end{matrix}]\} .

(12)

The intermediate matrix is defined as the H-kernel matrix as follows:

\begin{matrix} Q (x_{k}, u_{k}) = E \{{[\begin{matrix} x_{k} \\ u_{k} \end{matrix}]}^{T} [\begin{matrix} H_{x x} & H_{x u} \\ H_{u x} & H_{u u} \end{matrix}] [\begin{matrix} x_{k} \\ u_{k} \end{matrix}]\} \\ = E \{{[\begin{matrix} x_{k} \\ u_{k} \end{matrix}]}^{T} H [\begin{matrix} x_{k} \\ u_{k} \end{matrix}]\} \end{matrix}

(13)

H_{x x} = Q + γ A^{T} P A,

(14)

H_{x u} = γ A^{T} P B,

(15)

H_{u x} = γ B^{T} P A,

(16)

H_{u u} = R + γ B^{T} P B .

(17)

Define the extended state

Z_{k}

as follows:

Z_{k} = [\begin{matrix} x_{k} \\ u_{k} \end{matrix}] .

(18)

Substituting Equation (18) into Equation (13) allows the Q-function to be expressed in the following form:

Q (X_{k}, u_{k}) = E \{{[\begin{matrix} x_{k} \\ u_{k} \end{matrix}]}^{T} H [\begin{matrix} x_{k} \\ u_{k} \end{matrix}]\} = E \{Z_{k}^{T} H Z_{k}\} .

(19)

By setting the gradient of the Q-function to zero, the optimal control law can be derived, yielding a Kati-type solution to the optimal control problem. Applying the condition

\frac{\partial Q (x_{k}, u_{k})}{\partial u_{k}} = 0

to Equation (13), we obtain the following:

u_{k} = - {(H_{u u})}^{- 1} H_{u x} x_{k} .

(20)

Substituting Equations (16) and (17) into Equation (20), the following result can be obtained,

u_{k} = - {(R + γ B^{T} P B)}^{- 1} γ (B^{T} P A) x_{k} .

(21)

Upon convergence of the Riccati equation, we obtain the following:

Z_{k}^{T} H Z_{k} = x_{k}^{T} Q x_{k} + u_{k}^{T} R u_{k} + γ Z_{k + 1}^{T} H Z_{k + 1} .

(22)

The proposed algorithm follows a two-step iterative procedure consisting of strategy evaluation and strategy improvement.

2.2.1. Strategic Evaluation

Given a fixed control policy, the associated Q-function is defined as follows:

Q (x_{k}, u_{k}) = E \{{[\begin{matrix} x_{k} \\ u_{k} \end{matrix}]}^{T} H [\begin{matrix} x_{k} \\ u_{k} \end{matrix}]\} = E \{Z_{k}^{T} H Z_{k}\} .

(23)

The kernel matrix H is estimated using data collected under the current control policy via (23).

2.2.2. Strategy Update

Based on the estimated kernel matrix, the control policy is updated by minimizing the Q-function with respect to the control input, yielding

u_{k}^{j + 1} = - {(H_{u u}^{j})}^{- 1} H_{u x}^{j} X_{k} .

(24)

The policy iteration is implemented using a least-squares (LS) method over collected state–input data tuples, leading to an iterative improvement of the control strategy.

Through this iterative procedure, the Q-learning algorithm progressively refines both the value function approximation and the corresponding control policy without requiring explicit knowledge of the underlying system model. Multiplicative noise is incorporated into the learning process through data-driven estimation of the kernel matrix. As a result, this approach not only enhances the flexibility of the control system but also improves its robustness in handling practical changes.

The step-by-step procedure of the proposed control framework is summarized in Algorithm 1.

Algorithm 1 Model-free tracking control

1:: Input: Kernel matrix $H_{0}$ that needs to be updated
2:: Output: State of the controlled object
3:: Select a stable control strategy $u_{0}$
4:: Initialize: Q and R in Q-function, kernel matrix $Z_{0}$ , discount factor $γ$ , allowable error $σ$
5:: for $k = 0$ do
6:: Update $x_{k + 1}$ using Equation (1)
7:: Update $H_{k + 1}$ using Equation (23)
8:: Update Equation (23) by Least squares method
9:: Update $u_{k + 1}$ using Equation (24)
10:: if $∥ x_{k} - r_{k} ∥ > σ$ then
11:: $k = k + 1$ ; return to step 6
12:: else
13:: Break; Learning ends
14:: end if
15:: end for
16:: Return

With unknown system dynamics, the optimal state-feedback gain within the LQR framework is obtained through iterative updates of a baseline control policy. To support the learning of the performance index weights, an experience replay mechanism is employed to store state–input data collected during system operation. This mechanism improves data utilization efficiency and enhances the numerical stability of the learning process.

3. Architecture Design of Hierarchical Reinforcement Learning

In optimal control design, the choice of the weighting matrices Q and R plays a critical role in shaping closed-loop performance, stability margins, and control effort. In practice, these weights are often selected through manual tuning or heuristic rules, which limits adaptability under changing operating conditions. Conventional reinforcement learning methods, while effective for policy optimization, do not directly address the systematic selection of optimal control policies across varying Q and R matrices, particularly in long-horizon decision-making problems with high-dimensional state–action spaces and sparse reward signals. To overcome these limitations, it is desirable to develop an autonomous optimization mechanism that updates the weighting matrices Q and R online according to system operating conditions, while preserving trajectory stability and simultaneously minimizing both the quadratic performance index and system entropy.

To this end, HRL is adopted to decompose the control task into multiple decision layers. In the proposed framework, a meta-agent is responsible for long-term planning and performance weight adaptation, while a base-agent executes low-level control actions. This hierarchical structure enables temporal abstraction, reduces the effective search space of individual subtasks, and improves sample efficiency and policy generalization.

3.1. Weight Matrix Adaptive Update Strategy

This section focuses on the adaptive adjustment of weighting parameters in the LQR. Traditional LQR parameter tuning typically relies on empirical trial and error, as illustrated in Figure 1, where the sensitivity of the weight matrices in complex environments limits the controller’s ability to generalize. Specifically, selecting appropriate parameters for the optimal control cost function remains a significant challenge in control theory. Below is a detailed explanation of the underlying theory, solution steps, and key properties of the LQR. Within a learning-based control setting, adaptive weight tuning requires a principled mechanism to regulate the trade-off between performance optimization and exploration under uncertainty. Reinforcement learning methods grounded in the maximum entropy principle provide a rigorous theoretical framework for balancing exploration and exploitation. Motivated by this principle, an entropy-based optimization mechanism is introduced into the adaptive tuning of the LQR weighting matrices. Specifically, an information-entropy–driven weight update strategy is proposed to dynamically adjust the state weighting matrix Q and the control weighting matrix R during system operation.

From an information-theoretic viewpoint, entropy [30] is a fundamental measure of uncertainty, originally introduced by Claude Shannon in 1948 and commonly referred to as Shannon entropy. It quantifies the degree of randomness in a system through a probability distribution. A higher entropy value indicates greater uncertainty and lower predictability, whereas a lower entropy value corresponds to more deterministic information. The mathematical expression of information entropy is given by:

S E (X) = - \sum_{i = 1}^{n} p (x_{i}) {log}_{2} p (x_{i})

(25)

where

p (x_{i})

denotes the probability of the occurrence of event

x_{i}

, where i represents the step size in the controller loop engineering. This expression represents the average uncertainty associated with the random variable X. Based on the definition and interpretation of Shannon entropy in (25), a trajectory-level information-entropy-based reward function is constructed to guide the adaptive policy for weight tuning.

To ensure numerical consistency and comparability across different state and input channels, the collected data are first normalized. The Euclidean

(ℓ_{2})

norm is then applied to the state vector and the control input variables as follows, yielding a scalar representation suitable for entropy-based reward evaluation. Where j represents the step size of the loop in searching for the minimum entropy.

‖ x_{i} ‖ = \sqrt{\sum_{j} x_{i j}^{2}} (i = 1, 2, \dots, n),

(26)

‖ u_{i} ‖ = \sqrt{\sum_{j} u_{i j}^{2}} (i = 1, 2, \dots, n) .

(27)

Subsequently, the probability distribution is estimated using histogram-based statistical analysis as follows,

P_{x} = \frac{histcounts (x_{norms})}{n},

(28)

P_{u} = \frac{histcounts (u_{norms})}{n}

(29)

where n denotes the maximum number of steps in the low-level policy execution loop. According to Equation (25), the trajectory information entropy-based reward function for the high-level policy is defined as follows:

E n t r o p y (X, U) = - \sum_{i = 1}^{n} p (x_{i}) {log}_{2} p (x_{i}) - \sum_{i = 1}^{n} p (u_{i}) {log}_{2} p (u_{i}) .

(30)

The entropy change is defined as the difference between the historical entropy and the current entropy:

Δ E n t r o p y = E n t r o p y_{k - 1} - E n t r o p y_{k} .

(31)

The optimization objective is to minimize the entropy change; therefore, the corresponding gradient term is given by:

\nabla_{θ} = - η Δ E n t r o p y

(32)

where

η

denotes the discount factor for the entropy weights, and

θ

represents the parameter matrix Q or R. The discount factor

η

balances the trade-off between control performance cost and exploration gain, and the entropy minimization objective is achieved by updating the parameters along the negative gradient direction.

To mitigate convergence to local optima, the parameter update rule is augmented with Gaussian noise [31], expressed as follows:

Q_{new} = Q - η Δ E n t r o p y + ϵ_{Q},

(33)

R_{new} = R - η Δ E n t r o p y + ϵ_{R} .

(34)

The exploration noise consists of Gaussian perturbations

ε_{Q} \sim N (0, σ_{Q}^{2})

and

ε_{R} \sim N (0, σ_{R}^{2})

, whose variances are selected according to the exploration–exploitation trade-off principle in reinforcement learning.

In this paper, we assume that the injected exploration noise is norm-bounded in implementation [32]. Specifically, there exists a finite constant

\bar{ε} > 0

such that

∥ ε_{Q} ∥ \leq \bar{ε}, ∥ ε_{R} ∥ \leq \bar{ε} .

(35)

In practice, the exploration perturbation is kept small relative to the norm of the weight update step, while its variance is scaled proportionally to the inherent process noise of the controlled system. This design enables quantitative regulation of the exploration intensity based on both optimization stability considerations and system uncertainty.

The primary role of the Gaussian exploration noise is to introduce moderate perturbations during the weight optimization process of the meta-agent, thereby reducing the risk of convergence to local optima. To avoid excessive stochastic disturbance, the standard deviation of the exploration noise is selected heuristically to scale with the entropy-based weight update magnitude and the norm of the weighting matrices, i.e.,

σ_{Q} \propto {η max (| Δ Entropy |) ∥ Q ∥}_{F}, σ_{R} \propto {η max (| Δ Entropy |) ∥ R ∥}_{F} .

(36)

Under this setting, the injected noise acts as a small exploratory perturbation rather than dominating the optimization trajectory.

Furthermore, to maintain consistency with system uncertainty, a linear proportional relationship is established between the variance of the exploration noise and the variance of the system process noise

v_{k} \sim N (0, σ_{v}^{2})

, defined as

σ_{Q}^{2} = σ_{R}^{2} = κ σ_{v}^{2}

(37)

where the proportional coefficient

κ

is selected as

0.1

based on empirical simulation studies to balance exploration effectiveness and closed-loop smoothness.

In the proposed framework, closed-loop stability is contingent upon the low-level Q-learning state feedback structure. Under conditions of bounded interference and zero-mean exploration noise, the learning process consistently converges to a stable feedback gain based on prior experiences, as demonstrated in all simulations. The meta-agent updates weights using an entropy-based method at a slower time scale, allowing gradual rather than abrupt adjustments to the weight matrix. As a result, each update yields a new Q-learning feedback structure, ensuring that the stability margin remains intact over time.

To optimize the weighting matrices Q and R in the LQR framework, a high-level entropy-based optimization strategy is developed to enable adaptive adjustment of the cost function parameters in response to system operating conditions. Building on this strategy, an HRL architecture is introduced to further improve overall control performance by coordinating weight adaptation and control policy execution across different decision layers.

3.2. Hierarchical Reinforcement Learning–Based Controller Framework

Building upon the previously developed model-free control framework and adaptive weight update strategy, this section constructs a two-level HRL architecture to address the optimal control of model-free linear systems [33]. The proposed framework decomposes the control task into two functional layers: a meta-agent responsible for weight adaptation and a base-agent responsible for real-time control execution.

The meta-agent operates at a higher decision level and performs long-term planning and adaptive regulation of the LQR weighting matrices. To stabilize the learning process and prevent excessive parameter variations, a trust-region mechanism inspired by proximal policy optimization is incorporated into the meta-agent update scheme. The inputs to the meta-agent consist of state and control trajectories generated by the base-agent over a finite time horizon and stored in an experience replay buffer. These trajectories capture the operating conditions of the system and implicitly reflect the influence of disturbances, noise, and external uncertainties.

Based on the collected trajectory information, the meta-agent aims to minimize the entropy of system trajectories, thereby promoting smoother and more stable closed-loop behaviour. To this end, the meta-agent outputs adaptively updated LQR weighting matrices Q and R. By regulating these cost-function parameters, the meta-agent indirectly shapes the control objectives of the base-agent, enabling long-term performance regulation without direct intervention in the control loop.

The base-agent operates at the lower level and is responsible for real-time control and local optimal feedback regulation. Given the current system state and the updated weighting matrices Q and R provided by the meta-agent, the base-agent computes the state-feedback gain matrix K through direct interaction with the environment. By iteratively sampling system states, applying control inputs, and receiving instantaneous reward feedback, the base-agent updates the gain matrix K to achieve near-optimal local dynamic performance.

Under the high-level guidance of the meta-agent, the base-agent adapts rapidly to environmental variations while satisfying real-time operational constraints, ensuring local stability and fast transient response of the controlled system.

Overall, the proposed HRL framework overcomes key limitations of single-layer reinforcement learning approaches in complex control settings, including excessive search spaces, slow convergence, and sensitivity to local optima. The meta-agent focuses on long-term regulation of global performance parameters through entropy minimization, while the base-agent concentrates on real-time state-feedback control. The two layers operate in a complementary manner, jointly ensuring global stability and fast local dynamic response.

The overall control architecture is illustrated in Figure 2, which depicts the information flow and interaction between the meta-agent and the base-agent. Specifically, the meta-agent extracts historical trajectory data from the experience replay buffer to update the weighting matrices Q and R. These updated parameters are then transmitted to the base-agent, which computes the corresponding state-feedback gain matrix K and applies the resulting control input to the system, generating new system states and reward signals.

To ensure the proposed control framework can be effectively implemented, the step-by-step procedure is summarized in Algorithm 2.

To provide a clearer and more intuitive overview of the proposed algorithmic framework, the corresponding program flowchart is presented in Figure 3. The flowchart depicts the sequential execution procedure, module interactions, and data flow within the hierarchical control architecture, thereby offering a structured visualization of the overall implementation process.

Algorithm 2 Hierarchical reinforcement learning-based model-free linear system optimal control (HRL-LQR)

1:: Initialization: $Q_{0}$ , $R_{0}$ , $K_{0}$ , system initial state $x_{0}$ , high-level learning rate $α_{h i g h}$ , low-level learning rate $α_{l o w}$ , maximum outer iteration number $N_{o u t e r}$ , maximum inner iteration number $N_{i n n e r}$ , convergence threshold $ϵ$ , experience replay buffer Buff, select appropriate penalty threshold.
2:: Set $Q \leftarrow Q_{0}$ , $R \leftarrow R_{0}$ , $K \leftarrow K_{0}$
3:: for $i = 0$ to $N_{o u t e r}$ do
4:: Randomly sample a batch of experience data from buffer B
5:: Calculate the entropy value, Entropy, of the current system trajectory using sampled data
6:: Update high-level parameters according to the entropy descent principle
7:: Pass the updated $Q_{i}$ and $R_{i}$ to the low-level policy
8:: for $j = 0$ to $N_{i n n e r}$ do
9:: Observe current state $x_{j}$ , compute control input $u_{j}$
10:: Substitute $u_{j}$ into the system update formula to obtain the next state $x_{j + 1}$ and the current reward $r_{j}$
11:: Store data into the experience replay buffer Buff
12:: Update feedback gain K according to Q-Learning (using least squares update) rules
13:: if $∥ K_{j} - K_{j - 1} ∥ \leq ϵ$ then
14:: Break; End of inner loop
15:: else
16:: Update K
17:: end if
18:: end for
19:: end for
20:: Return

In the proposed HRL framework, the meta-agent focuses on long-term planning and adaptive regulation of the LQR weighting matrices, while the base-agent executes real-time state-feedback control by computing the optimal gain matrix. This separation of responsibilities enables efficient model-free optimal control and enhances the overall adaptability of the closed-loop system. At the same time, this design approach mitigates the adverse effects of time-varying LQR weights on the feedback gain, thereby preventing undesired fluctuations in the control signal. First, a hierarchical time-scale separation strategy is adopted. The updated frequency of the weighting matrices Q and R by the meta-agent is significantly lower than that of the low-level control loop. Consequently, frequent perturbations to the steady-state behavior of the closed-loop system are avoided, which in turn reduces abrupt variations in the control input. Second, the entropy-based optimization is trajectory-oriented rather than driven by instantaneous performance metrics. Because the adaptation process evaluates the joint distribution of states and control inputs over a finite time horizon, weight updates tend to be gradual and globally consistent, instead of aggressive short-term adjustments. Third, experience replay and trust-region constraints are incorporated into the weight update process. Batch-based sampling attenuates sensitivity to transient disturbances, while step-size limitations on parameter updates prevent excessive variations in the weighting matrices, thereby restricting sudden changes in the resulting feedback gains. Through these architectural and optimization mechanisms, the proposed framework effectively balances adaptability and control smoothness. As a result, potential actuator stress is alleviated, while the benefits of online adaptive weight adjustment are retained.

4. Simulation Result

This section is divided into two parts. First, numerical simulations are conducted on second- and third-order systems to validate the effectiveness of the proposed algorithm. Next, the proposed method is systematically compared with several representative existing algorithms to evaluate its relative performance.

4.1. Numerical Simulations of Second- and Third-Order Systems

This section evaluates the effectiveness of the proposed HRL–LQR algorithm through comparative numerical simulations against the conventional LQR method. Two simulation examples are presented to assess control performance in a model-free system environment.

Specifically, a controllable second-order system is considered to demonstrate the feasibility and performance of the proposed model-free hierarchical reinforcement learning–based optimal control approach.

The coefficient matrix of the system model is expressed by the following equation:

x_{k + 1} = A x_{k} + B u_{k} + v_{k},

(38)

where the system matrix

A = [\begin{matrix} 1.2 & 0.5 \\ 0.3 & 0.8 \end{matrix}]

, and the input matrix

B = {[\begin{matrix} 0.5 & 0.2 \end{matrix}]}^{T}

. The process noise follows a Gaussian distribution

v_{k} \sim N (0, 0.01)

. The matrices Q, R, and K are randomly initialized, while the parameters A and B are assumed to be unknown. The learning rate of the meta-agent is set to

α_{high} = 0.01

, and the learning rate of the base-agent is

α_{low} = 0.1

. The convergence threshold for the base-agent is

σ = 1 \times 10^{- 3}

, and the maximum number of iterations for the low-level policy is

N_{inner} = 200

. The initial state is

x_{0} = {[\begin{matrix} 2 & - 2 \end{matrix}]}^{T}

, and the penalty threshold is set to

1 \times 10^{5}

.

The simulation results are presented in Figure 4 and Figure 5. Figure 4 illustrates the control performance of the LQR after training. The blue and green solid curves represent the system states

x_{1}

and

x_{2}

, respectively, while the red dashed curve denotes the reference trajectory. In the presence of noise disturbances, the controller is able to drive the system states to closely track the reference value of zero. The selected weighting matrices are

Q = [\begin{matrix} 0.3087 & 0 \\ 0 & 2.3716 \end{matrix}]

and

R = 0.6774

, with a corresponding entropy value of 4.9698. These results indicate that the HRL-LQR algorithm successfully identified the most stable combination of Q and R based on the entropy criterion, and the resulting low-level controller achieved the desired control performance.

In Figure 5, the blue solid line represents the control performance of the proposed algorithm, while the red dashed line shows the performance of the LQR controller using manually selected weight matrices Q and R under a known system model. By comparing the two results, the former corresponds to model-free control, whereas the latter represents model-based control. The final control performances are highly similar, demonstrating the effectiveness of the proposed method in automatically tuning the weight matrices Q and R. These results indicate that the proposed algorithm achieves control performance comparable to that of model-based approaches, despite the absence of an accurate system model. In the context of industrial linear control systems, the storage overhead introduced by experience replay is on the order of megabytes, which is substantially below the memory limits of typical industrial PLC/DCS platforms. To balance data validity and storage efficiency, a fixed-capacity first-in–first-out (FIFO) buffer is employed. The associated computational overhead is minimal, consisting primarily of microsecond-level operations such as data writing, uniform random sampling, and lightweight preprocessing. Consequently, the adoption of more complex prioritized experience replay (PER) mechanisms is unnecessary in this setting, as uniform random sampling is sufficient to satisfy the unbiased estimation requirements of linear control systems. Furthermore, by leveraging the hierarchical time-scale decoupling characteristic of the HRL–LQR framework, the low-level base-agent performs small-batch sampling compatible with industrial control cycles ranging from 1 ms to 1 s. In contrast, the high-level meta-agent updates the weight matrices using trajectory-based sampling over a slower time scale. Combined with practical engineering strategies—such as edge deployment, signal noise filtering, and condition-triggered buffer resetting—the computational and storage costs of the experience replay mechanism remain fully manageable. Therefore, the proposed framework satisfies the real-time operational requirements of industrial control applications.

Figure 6 compares the state error trajectories obtained using the proposed method and the conventional LQR controller. The dashed red curve corresponds to the LQR controller, while the solid blue curve represents the proposed method. As shown, the proposed approach yields smaller tracking errors and improved transient performance.

Consistent with this improvement in closed-loop behaviour, Figure 7 further illustrates the evolution of the controller value function during training. The dashed red curve corresponds to the case where the weighting matrices Q and R are fixed, whereas the solid blue curve represents the adaptive weight update strategy. It is observed that, despite the presence of disturbances, the value function increases monotonically and converges within a bounded range, indicating stable and effective learning.

The preceding simulation results have confirmed the feasibility of the proposed algorithm. To more closely approximate real-world industrial conditions and further validate its effectiveness, a new MIMO third-order system is investigated for verification.

In the model, the system matrix

A = [\begin{matrix} 0.555 & - 0.098 & - 0.041 \\ - 0.1 & - 0.734 & 0.181 \\ - 0.292 & 0.02 & 0.291 \end{matrix}]

, and the input matrix

B = {[\begin{matrix} 0.275 & 0.302 & 0.302 \end{matrix}]}^{T} .

The process noise follows a Gaussian distribution

v_{k} \sim N (0, 0.004)

. The matrices Q, R, and K are randomly initialized, while the parameters A and B are assumed to be unknown. The learning rate of the meta-agent is set to

α_{high} = 0.01

, and the learning rate of the base-agent is

α_{low} = 0.1

. The convergence threshold for the base-agent is

σ = 1 \times 10^{- 3}

, and the maximum number of iterations for the low-level policy is

N_{inner} = 200

. The initial state is

x_{0} = {[\begin{matrix} 2 & - 2 & - 1 \end{matrix}]}^{T}

, and the penalty threshold is set to

1 \times 10^{5}

.

The simulation results are presented in Figure 8 and Figure 9. Figure 8 illustrates the control performance of the LQR after training. The blue, green, and purple solid curves represent the system states

x_{1}

,

x_{2}

, and

x_{3}

, respectively, while the red dashed curve denotes the reference trajectory. In the presence of noise disturbances, the controller is able to drive the system states to closely track the reference value of zero. The selected weighting matrices are

Q = [\begin{matrix} 0.5414 & 0 & 0 \\ 0 & 0.4185 & 0 \\ 0 & 0 & 0.4483 \end{matrix}]

and

R = 0.5646

, with a corresponding entropy value of 6.5917. These results indicate that the HRL-LQR algorithm successfully identified the most stable combination of Q and R based on the entropy criterion, and the resulting low-level controller achieved the desired control performance.

In Figure 9, the blue solid line represents the control performance of the proposed algorithm, while the red dashed line shows the performance of the LQR controller using manually selected weight matrices Q and R under a known system model. By comparing the two results, the former corresponds to model-free control, whereas the latter represents model-based control. The final control performances are highly similar, demonstrating the effectiveness of the proposed method in automatically tuning the weight matrices Q and R. This indicates that the proposed algorithm can achieve control performance comparable to model-based approaches even in the absence of an accurate system model.

To provide a more intuitive comparison between the proposed method and the conventional LQR controller, Figure 10 presents the comparison of their state error trajectories. The red dashed line represents the error produced by the LQR controller, while the blue solid line denotes the state error of the proposed method. It can be observed that, in higher-order systems, the proposed method demonstrates smoother and faster convergence of trajectory errors compared to traditional models with complete knowledge of system dynamics.

To more intuitively demonstrate the improvement of the controller value function achieved by the proposed method, Figure 11 illustrates the evolution of the value function during the training process. The red dashed line represents the value function when the matrices Q and R are not updated, while the blue solid line represents the value function when Q and R are adaptively updated. It can be observed that, even in the presence of disturbances, the system value function continues to increase and eventually converges within a bounded range.

The experimental results further validate the effectiveness of the proposed hierarchical reinforcement learning framework for controller parameter optimization. By adaptively updating the weighting matrices Q and R, the proposed method accommodates variations in system operating conditions and improves adaptability without manual parameter tuning. The lower-level policy satisfies the Lyapunov stability condition, thereby ensuring closed-loop stability of the controlled system. Moreover, the hierarchical structure mitigates value-function overestimation during policy iteration, leading to improved convergence reliability of the overall optimization process.

4.2. Comparison with Traditional Intelligent Algorithms

To provide a more intuitive and rigorous comparison with existing approaches, we evaluate the proposed method against heuristic optimization algorithms based on Particle Swarm Optimization (PSO) and Ant Colony Optimization (ACO). These methods fundamentally perform offline searches in the LQR weight space, where composite objective weights must be manually specified, and the optimization process relies heavily on expert knowledge. Moreover, their search procedures typically exhibit at least quadratic computational complexity with respect to the solution dimension or population size, rendering them unsuitable for online, real-time adaptation in dynamic control environments. In contrast, the proposed trajectory-entropy-based criterion requires only histogram statistics and entropy computation over historical state trajectories, resulting in linear time complexity. Consequently, the optimization process can be synchronized with the control loop, enabling online and incremental adaptation of the Q and R matrices. Furthermore, because trajectory entropy is defined over the probability distribution of system trajectories, stochastic perturbations—such as Gaussian noise commonly observed in industrial systems—are statistically smoothed during entropy estimation. This property mitigates premature convergence to local optima and enhances robustness against noise. Therefore, the proposed method simultaneously exhibits four essential characteristics: model-free implementation, global performance evaluation, real-time adaptability, and robustness to disturbances. Unlike conventional adaptive strategies that optimize local performance indices or merely automate manual parameter tuning, this approach constructs a scientifically grounded optimization criterion derived from the global dynamic behavior of the system. This distinction constitutes its core theoretical contribution relative to existing adaptive control methods. As illustrated in Figure 12, compared with PSO and ACO, the proposed method achieves faster training convergence and demonstrates improved stability after system convergence.

5. Conclusions

This paper proposes a novel model-free HRL–based LQR control framework with adaptive weight selection to overcome the reliance of conventional LQR methods on known system models. The proposed approach employs a two-level learning architecture, in which a high-level meta-agent adaptively optimizes the LQR weighting matrices through entropy-based trajectory evaluation, while a low-level base-agent performs model-free policy iteration to update the state-feedback control law under unknown system dynamics. By decoupling weight optimization from control-law learning, the framework enables simultaneous adaptation of the cost-function parameters and the feedback gain without explicit model information. Numerical simulations demonstrate that the proposed HRL-based LQR method achieves effective control performance, stable learning behaviour, and reliable convergence in model-free environments. In several scenarios, the proposed approach attains performance comparable to, or exceeding, that of model-based LQR controllers with manually tuned parameters.

Author Contributions

Conceptualization, Y.Z. (Yong Zhang); Methodology, X.Y.; Software, X.Y.; Formal analysis, Y.Z. (Yuyang Zhou); Investigation, W.Y.; Resources, Y.Z. (Yuyang Zhou); Data curation, X.Y.; Writing—original draft, X.Y.; Writing—review & editing, Y.Z. (Yong Zhang); Visualization, W.Y. and Y.Z. (Yuyang Zhou); Supervision, Y.Z. (Yuyang Zhou); Project administration, Y.Z. (Yuyang Zhou); Funding acquisition, Y.Z. (Yong Zhang). All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant 62263026, the Fundamental Research Funds for Inner Mongolia University of Science and Technology under Grant 2024QNJS003, the Inner Mongolia Natural Science Foundation under Grant 2025MS06024, the Program for Young Talents of Science and Technology in Universities of Inner Mongolia Autonomous Region under Grant NJYT25032, and the Inner Mongolia Autonomous Region Control Science and Engineering Quality Improvement and Cultivation Discipline Construction Project.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Reid, A.; MacDonald, C.; James, A.; Davidson, A. Advanced Control Systems in Smart Manufacturing: IoT and AI Integration. 2025. Available online: https://www.researchgate.net/publication/391392598_Advanced_Control_Systems_in_Smart_Manufacturing_IoT_and_AI_Integration (accessed on 11 February 2026).
Daoutidis, P.; Megan, L.; Tang, W. The future of control of process systems. Comput. Chem. Eng. 2023, 178, 108365. [Google Scholar] [CrossRef]
Chacko, S.J.; Neeraj, P.C.; Abraham, R.J. Optimizing LQR controllers: A comparative study. Results Control Optim. 2024, 14, 100387. [Google Scholar] [CrossRef]
Akbari, B.; Frank, J.; Greeff, M. Tiny learning-based MPC for multirotors: Solver-aware learning for efficient embedded predictive control. Mechatronics 2026, 115, 103452. [Google Scholar] [CrossRef]
Herzallah, R. A fully probabilistic design for stochastic systems with input delay. Int. J. Control 2021, 94, 2934–2944. [Google Scholar] [CrossRef]
Sykora, H.; Sadeghpour, M.; Ge, J. On the moment dynamics of stochastically delayed linear control systems. Int. J. Robust Nonlinear Control 2020, 30, 8074–8097. [Google Scholar] [CrossRef]
Xie, K.; Zheng, Y.; Jiang, Y.; Lan, W.; Yu, X. Optimal dynamic output feedback control of unknown linear continuous-time systems by adaptive dynamic programming. Automatica 2024, 163, 11160. [Google Scholar] [CrossRef]
Wei, Q.; Liu, D.; Xu, Y. Neuro-optimal tracking control for a class of discrete-time nonlinear systems via generalized value iteration adaptive dynamic programming approach. Soft Comput. 2016, 20, 697–706. [Google Scholar] [CrossRef]
Abouheaf, M.; Gueaieb, W.; Lewis, F. Online model-free reinforcement learning for the automatic control of a flexible wing aircraft. IET Control Theory Appl. 2020, 14, 73–84. [Google Scholar] [CrossRef]
Liu, W.; Fan, J.L.; Xue, W.Q. Linear Quadratic Optimal Control Method Based on Output Feedback Inverse Reinforcement Q-Learning. Control Theory Appl. 2024, 41, 1469–1479. [Google Scholar]
Lewis, F.L.; Vrabie, D.; Syemos, V. Optimal Control, 3rd ed.; Wiley: New York, NY, USA, 2012. [Google Scholar]
Choi, S.; Kim, S.; Jin, K.H. Inverse reinforcement learning control for trajectory tracking of a multirotor UAV. Int. J. Control. Autom. Syst. 2017, 15, 1826–1834. [Google Scholar] [CrossRef]
Ho, J.; Ermon, S. Generative adversarial imitation learning. In Conference on Neural Information Processing Systems; Curran Associates: Red Hook, NY, USA, 2016; pp. 4565–4573. [Google Scholar]
Shao, Z.; Joo, E.M. A review of inverse reinforcement learning theory and recent advances. In IEEE Congress on Evolutionary Computation; IEEE: New York, NY, USA, 2012; pp. 1–8. [Google Scholar]
Aza, N.A.; Shahmansoorian, A.; Davoudi, M. From inverse optimal control to inverse reinforcement learning: A historical review. Annu. Rev. Control 2020, 50, 119–138. [Google Scholar]
Self, R.; Abudia, M.; Kamalapurkar, R. Online inverse reinforcement learning for systems with disturbances. In American Control Conference (ACC); IEEE: New York, NY, USA, 2020; pp. 1118–1123. [Google Scholar]
Ng, A.Y.; Russell, S. Algorithms for inverse reinforcement learning. In Proceedings of the 17th International Conference on Machine Learning; Morgan Kaufmann Publishers: San Francisco, CA, USA, 2000; pp. 663–670. [Google Scholar]
Molloy, T.L.; Ford, J.J.; Perez, T. Online inverse optimal control on infinite horizons. In IEEE Conference on Decision and Control (CDC); IEEE: New York, NY, USA, 2018; pp. 1663–1668. [Google Scholar]
Lu, M.-K.; Ge, M.-F.; Yan, Z.-C.; Ding, T.-F.; Liu, Z.-W.; Herzallah, R. An integrated decision-execution framework of cooperative control for multi-agent systems via reinforcement learning. Syst. Control Lett. 2024, 193, 105949. [Google Scholar] [CrossRef]
Liu, D.; Yang, G.-H. Performance-based data-driven model-free adaptive sliding mode control for a class of discrete-time nonlinear processes. J. Process Control 2018, 68, 86–194. [Google Scholar] [CrossRef]
Badfar, E.; Tavassoli, B. Model-free optimal control for discrete-time Markovian Jump Linear Systems: A Q-learning approach. J. Franklin Inst. 2025, 362, 107784. [Google Scholar] [CrossRef]
Shulman, J.; Malatino, F.; Gunaratne, G.H. Model-free network control. Phys. D Nonlinear Phenom. 2020, 408, 132467. [Google Scholar] [CrossRef]
Peng, Y.; Chen, Q.; Sun, W. Reinforcement Q-Learning Algorithm for H∞ Tracking Control of Unknown Discrete-Time Linear Systems. IEEE Trans. Syst. Man Cybern. Syst. 2020, 50, 4109–4122. [Google Scholar] [CrossRef]
Yuvapriya, T.; Lakshmi, P.; Elumalai, V.K. Experimental Validation of LQR Weight Optimization Using Bat Algorithm Applied to Vibration Control of Vehicle Suspension System. IETE J. Res. 2023, 69, 8142–8152. [Google Scholar] [CrossRef]
Sun, Z.; Wen, Z.; Xu, L.; Gong, G.; Xie, X.; Sun, Z. LQR Control Method based on Improved Antlion Algorithm. In Proceedings of the 42nd Chinese Control Conference (CCC), Tianjin, China, 24–26 July 2023; pp. 663–668. [Google Scholar]
Fan, X.; Wang, J.; Wang, H.; Yang, L.; Xia, C. LQR Trajectory Tracking Control of Unmanned Wheeled Tractor Based on Improved Quantum Genetic Algorithm. Machines 2023, 11, 62. [Google Scholar] [CrossRef]
Ayadi, W.; Alkhazraji, E.; Khaled, H.; Bouteraa, Y.; Abedini, M.; Mohammadzadeh, A. Adaptive heartbeat regulation using double deep reinforcement learning in a Markov decision process framework. Sci. Rep. 2025, 15, 35347. [Google Scholar] [CrossRef]
Li, Z.; Shan, Y.; Mo, H. TBC-HRL: A Bio-Inspired Framework for Stable and Interpretable Hierarchical Reinforcement Learning. Biomimetics 2025, 10, 715. [Google Scholar] [CrossRef]
Li, Y.; Luo, Z.; Zhang, T.; Dai, C.; Kanervisto, A.; Tirinzoni, A.; Weng, H.; Kitani, K.; Guzek, M.; Touati, A.; et al. Bfm-zero: A promptable behavioral foundation model for humanoid control using unsupervised reinforcement learning. arXiv 2025, arXiv:2511.04131. [Google Scholar] [CrossRef]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the International Conference on Machine Learning PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 1861–1870. [Google Scholar]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; The MIT Press: Cambridge, MA, USA, 2018; pp. 26–43. [Google Scholar]
Nemirovski, A.; Juditsky, A.; Lan, G.; Shapiro, A. Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 2009, 19, 1574–1609. [Google Scholar] [CrossRef]
Lai, J.; Wei, J.Y.; Chen, X.L. A survey of hierarchical reinforcement learning. Comput. Eng. Appl. 2021, 57, 72–79. [Google Scholar]

Figure 1. Comparison of LQR control performance under different Q and R settings.

Figure 2. The HRL framework diagram.

Figure 3. Algorithmic program flowchart.

Figure 4. State evolution of the second-order system.

Figure 5. Comparison of HRL-LQR and conventional LQR methods in a second-order system.

Figure 6. Comparison of second-order system state error trajectories.

Figure 7. Comparison of historical changes in the value function of second-order systems.

Figure 8. State evolution of the third-order system.

Figure 9. Comparison of HRL-LQR and conventional LQR methods in a third-order system.

Figure 10. Comparison of third-order system state error trajectories.

Figure 11. Comparison of historical changes in the value function of third-order systems.

Figure 12. Multiple benchmark methods are employed for comparative evaluation.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, Y.; Yan, X.; Yang, W.; Zhou, Y. Hierarchical Reinforcement Learning–Based Optimal Control for Model-Free Linear Systems. Mathematics 2026, 14, 895. https://doi.org/10.3390/math14050895

AMA Style

Zhang Y, Yan X, Yang W, Zhou Y. Hierarchical Reinforcement Learning–Based Optimal Control for Model-Free Linear Systems. Mathematics. 2026; 14(5):895. https://doi.org/10.3390/math14050895

Chicago/Turabian Style

Zhang, Yong, Xiangrui Yan, Weiqing Yang, and Yuyang Zhou. 2026. "Hierarchical Reinforcement Learning–Based Optimal Control for Model-Free Linear Systems" Mathematics 14, no. 5: 895. https://doi.org/10.3390/math14050895

APA Style

Zhang, Y., Yan, X., Yang, W., & Zhou, Y. (2026). Hierarchical Reinforcement Learning–Based Optimal Control for Model-Free Linear Systems. Mathematics, 14(5), 895. https://doi.org/10.3390/math14050895

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Hierarchical Reinforcement Learning–Based Optimal Control for Model-Free Linear Systems

Abstract

1. Introduction

2. Problem Description

2.1. Traditional LQR Control Algorithm

2.2. Data-Driven Reinforcement Learning for LQR Problems

2.2.1. Strategic Evaluation

2.2.2. Strategy Update

3. Architecture Design of Hierarchical Reinforcement Learning

3.1. Weight Matrix Adaptive Update Strategy

3.2. Hierarchical Reinforcement Learning–Based Controller Framework

4. Simulation Result

4.1. Numerical Simulations of Second- and Third-Order Systems

4.2. Comparison with Traditional Intelligent Algorithms

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI