Wide-Range Variable Cycle Engine Control Based on Deep Reinforcement Learning

Ding, Yaoyao; Wang, Fengming; Mu, Yuanwei; Sun, Hongfei

doi:10.3390/aerospace12050424

Open AccessArticle

Wide-Range Variable Cycle Engine Control Based on Deep Reinforcement Learning

by

Yaoyao Ding

¹,

Fengming Wang

²,

Yuanwei Mu

² and

Hongfei Sun

^1,*

¹

School of Aerospace Engineering, Xiamen University, Xiamen 361102, China

²

Aero Engine Academy of China, Beijing 101304, China

^*

Author to whom correspondence should be addressed.

Aerospace 2025, 12(5), 424; https://doi.org/10.3390/aerospace12050424

Submission received: 23 March 2025 / Revised: 6 May 2025 / Accepted: 7 May 2025 / Published: 10 May 2025

Download

Browse Figures

Versions Notes

Abstract

In this paper, a controller design method based on deep reinforcement learning is proposed for a wide-range variable cycle engine with a turbine interstage mixed architecture. The PID controller is subject to limitations, including single-input single-output limitations, low regulation efficiency, and poor adaptability when confronted with contemporary variable cycle engines that exhibit complex and multi-variable operating conditions. To solve this problem, this paper adopts a deep reinforcement learning method based on a deep deterministic policy gradient algorithm, and it applies an action space pruning technique to optimize the controller, which significantly improves the convergence speed of network training. In order to verify the control performance, two typical flight conditions are selected for simulation experiments as follows: in the first scenario, H = 0 km and Ma = 0, while in the second scenario, H = 10 km and Ma = 0.9. A comparison of the simulation results shows that the proposed deep reinforcement learning controller effectively addresses the engine’s multi-variable coupling control problem. In addition, it reduces response time by an average of 44.5%, while maintaining a similar overshoot level to that of the PID controller.

Keywords:

wide-range; variable cycle engine; PID control; deep reinforcement learning; action space pruning

1. Introduction

The Variable Cycle Engine (VCE) is a revolutionary aircraft engine that utilizes a sophisticated system to regulate its thermodynamic parameters. This system involves the manipulation of the engine’s variable geometry components, altering their shape, size, and position. The primary objective of this innovation is to expand the engine’s operational speed range, ensuring optimal performance across a wide spectrum of flight regimes, including subsonic, transonic, supersonic, and even hypersonic speeds. In comparison to conventional engines, variable cycle engines are characterized by higher specific thrust, greater circulation capacity, and a broader operational speed range. For this reason, they are a primary focus of aerospace engineering research [1,2].

Since the introduction of the VCE concept by General Electric in the 1960s, significant advancements have been made in the field of variable cycle engines [3]. Typical VCE configurations include the single-externality VCE [4], the double-externality VCE with a core-driven fan stage [5,6], and the triple-externality VCE [7,8]. Compared to the multi-external VCE, the single-externality VCE has relatively fewer variable geometry components. Among them, the Interstage turbine mixed architecture (ITMA) of the single-externality variable cycle engine has received special attention due to its unique advantages. A comparative study in the literature [9] shows that this configuration exhibits a 12.75% reduction in fuel consumption at subsonic cruising speeds, accompanied by exceptional propulsive efficiency. Consequently, it is anticipated that this configuration will serve as the primary architectural framework for the subsequent generation of fighter variable cycle engines. This makes it one of the main architectures for the next generation of variable cycle engines for fighter aircraft.

The integration of variable geometry components within the ITMA architecture has been demonstrated to enhance performance, yet it concomitantly introduces challenges related to control. A systematic study in the literature [10] points out that there are strong nonlinear dynamic characteristics and strong coupling effects when the engine operating conditions change. These phenomena result in the performance limitations of traditional control methods, such as PID control. To address this challenge, artificial intelligence has been shown to have unique advantages. The successful application of neural networks in robot control has been documented in the literature [11,12]. The breakthrough of AI technology in UAV anti-jamming has been demonstrated in the literature [13,14]. The innovative practice of deep reinforcement learning in multi-objective optimization for autonomous driving has been described in the literature [15,16]. In particular, the efficacy of reinforcement learning technology, a field of artificial intelligence that is currently experiencing significant research activity, in addressing multi-constraint problems of nonlinear systems has been substantiated through theoretical analyses documented in the extant literature [17]. The deep reinforcement learning (DRL) framework proposed in the literature provides a novel paradigm for complex control systems [18,19].

As demonstrated in the extant literature, the DRL method has been shown to be a highly adaptable intelligent control method. It has been determined to be particularly suitable for aero-engine systems that exhibit strong nonlinearity, multi-variate coupling, and high-dimensional dynamic characteristics [20]. Consequently, numerous scholars have employed this methodology in the domain of aero-engine control, attaining noteworthy outcomes. For example, Zheng et al. [21] proposed a control system for a conventional turbofan engine that was designed using an online Q-learning algorithm, with an online sliding window deep neural network used to estimate the action value function. Liu et al. [22] have investigated the reinforcement learning control of a variable geometry inlet aspirated hypersonic vehicle based on the Lyapunov function. The transition process control of turbofan engines was addressed by Fang et al. [23], who applied deep reinforcement learning across the entire flight envelope through similarity transformation methods. Tao et al. [24] designed a multi-variable control law for a variable cycle engine and employed deep reinforcement learning algorithms to online optimize the control law, resulting in a fuel consumption rate lower than that achieved by traditional optimization algorithms. Gao et al. [25] employed the Deep Deterministic Policy Gradient (DDPG) algorithm to construct a controller for the transition process of a typical turbofan engine. Additionally, a complementary integrator was introduced to mitigate the steady-state error caused by the approximation error of the deep neural network. However, the majority of the aforementioned DRL research is grounded in the domain of ordinary turbofan turbojet engines. Consequently, there is fewer research focusing on the implementation of DRL in complex control systems for wide-area variable-cycle engines.

In summary, DRL has become an important research direction in the field of aero-engine control, owing to its ability to address complex nonlinearities, multi-variate coupling, and high-dimensional dynamic characteristics. Specifically, the ITMA engine architecture demonstrates considerable promise for advancement; however, the implementation of DRL control methodologies in complex, variable-cycle engine control systems remains under-explored. In this paper, a DRL controller is developed for the ITMA engine control problem. This controller is based on the deep deterministic policy gradient algorithm and action space pruning method. These techniques solve the single-input single-output limitation of the PID controller as well as the coupling between control quantities. At the same time, this controller improves the response speed of the engine during acceleration and deceleration.

The primary contributions of this paper are as follows:

A methodology for the design of controllers is proposed. This methodology is based on deep reinforcement learning. The objective of this methodology is to address the complex control problem associated with a wide-area variable-cycle engine that features an interstage turbine mixed architecture.
The Deep Deterministic Policy Gradient (DDPG) algorithm is employed to effectively address the nonlinearity, multi-variable coupling, and high-dimensional dynamic characteristics of variable cycle engines.
Combined with the action space pruning technique, the performance of the controller is optimized and the convergence speed of training is improved. The efficacy of the method in addressing multi-variate coupling problems is substantiated by simulation verification.

The rest of the paper is organized as follows: Section 2 introduces the preliminary knowledge and the mathematical description of the problem in this paper; Section 3 combines the methods of deep reinforcement learning and action space pruning to design the DRL controllers; Section 4 gives the corresponding simulation results and their analysis; and Section 5 concludes the paper.

2. Preliminary Knowledge and Problem Description

2.1. Study Objects

Figure 1 shows a wide-range variable-cycle engine architecture of the Interstage Turbine Mixed Architecture (ITMA), consisting of an inlet duct, fan, compressor, turbine, combustion chamber, and tailpipe. The architecture employs a multi-stage low-pressure turbine configuration to achieve a higher low-pressure turbine expansion ratio, thereby improving engine airflow and reducing fuel consumption. Furthermore, the single-container configuration of the ramjet stage, in conjunction with the distinct exhaust architecture of the inner and outer containers, effectively decouples the fan from the nozzle. Increasing the fan speed at high Mach numbers enables the fan to operate at a more optimal operating point. Finally, the internal and external bleed air intake area is employed for the interstage turbine mixed architecture. This process serves to mitigate mixing losses and regulate the low-pressure turbine airflow to enhance its efficiency. This architecture has been shown to enhance propulsive efficiency during subsonic flight, and it demonstrates excellent performance during supersonic cruise [9]. A description of the symbols used in this paper for ITMA engines is given in Table A1.

Engine control is achieved by adjusting several adjustable variables during the operation of the ITMA wide-range variable cycle engine. The aforementioned variables encompass the following: the main combustion chamber fuel flow rate denoted by

w_{f}

, the fuel flow rate of the afterburning chamber denoted by

w_{f, a f t e r}

, the guide vane angle of the high-pressure compressor denoted by

α_{C}

, the nozzle throat area denoted by

A_{8}

, the bypass nozzle throat area denoted by

A_{8, o u t e r}

, and the internal and external bleed air intake area denoted by

A_{m i x}

. The range of parameter variation for these variables is demonstrated in Table 1.

The ITMA engine control system is designed to ensure that the pilot is able to move the throttle stick unrestrictedly over the entire engine operating envelope without causing surge, over-temperature, over-rotation, or exceeding of the operating limits. In order to accomplish this, the reference command values corresponding to each throttle stick position must be calculated. The engine operates in the full speed domain and the culvert ratio is approximately 0.6 under most working conditions. As a result, the low-pressure relative speed, denoted by

n_{R L}

, is designated as one of the controlled quantities. Secondly, the nozzle throat area, designated as

A_{8}

, exerts a direct influence on the flow rate, thrust output, and efficiency of the core engine. Consequently, the connotation drop ratio, denoted by

π

, is selected as the second controlled quantity. The regulation of the bypass nozzle of the ITMA engine is also crucial. This regulation plays a dual role as follows: it prevents the occurrence of the surge phenomenon and directly affects the engine’s thrust and the fuel consumption rate, so the outer culvert boost ratio, denoted by

π_{B}

, is the third controlled quantity. Therefore, the indirect control of the thrust is realized by controlling

n_{R L}

,

π

, and

π_{B}

. The above variables are represented as follows:

\begin{matrix} n_{R L} = \frac{n_{L} / \sqrt{288.15 / T_{t 2}}}{{(n_{L} / \sqrt{288.15 / T_{t 2}})}_{d}} \end{matrix}

(1)

\begin{matrix} π = \frac{P_{t 6}}{P_{s 31}} \end{matrix}

(2)

\begin{matrix} π_{B} = \frac{P_{t 16}}{P_{t 0}} \end{matrix}

(3)

where

n_{L}

is the fan speed,

T_{t 2}

is the engine intake temperature,

P_{t 6}

is the post-turbine pressure,

P_{t 16}

is the pre-afterburner chamber pressure,

P_{s 31}

is the high-pressure post-compressor pressure, and

P_{t 0}

is the atmospheric static pressure. The subscript d denotes the operating point on the compressor characteristic curve.

2.2. Overview of Deep Reinforcement Learning

Reinforcement learning aims to learn the optimal policy during the continuous interaction between the agent and the environment so as to generate the best sequence of actions for obtaining the maximum reward. In this process, the main components are the agent and the environment model. In the framework shown in Figure 2, the environment is a place where the agent’s actions take effect and generate rewards and observations. The agent consists of a policy and a learning algorithm. The policy is a function that maps observations to actions, and the learning algorithm is an optimization method used to find the policy.

THe Markov Decision Process (MDP) is a theoretical framework for reinforcement learning. The agent selects the corresponding action

a_{t} \in A (s)

, and S and

A (s)

are the set of states and the set of actions, respectively, based on the observed state of the environment

s_{t} \in S

at a discrete time t. After executing the action, the agent will observe the new state

s_{t + 1}

and reward

r_{t} \in R

at the next moment. Based on the above information, a history trajectory

h_{t} = {[s_{0}, a_{0}, r_{1}, s_{1}, a_{1} \dots r_{t}, s_{t}]}^{T}

can be obtained and the state transfer probability at moment t + 1 is calculated

P (s^{'} | s, a)

.

P (s^{'} | s, a) = P r (s_{t + 1} = s^{'}, a_{t} = a | s_{t} = s, a_{t} = a)

(4)

The rewards are defined as

G_{t} = \sum_{t = 0}^{\infty} γ^{t} r_{t + 1}

(5)

where

γ \in [0, 1]

is the discount factor, which indicates how much future rewards are discounted from current rewards. Higher discounts indicate a greater emphasis on future rewards.

Policy

π (a | s)

determines the selection probability of the next action based on the current state

s_{t} = s

. Its computation is based on the state value function

V^{π} (s)

and the action value function

Q^{π} (s, a)

. The state value function

V^{π} (s)

is defined as the expected cumulative reward when starting from state s and following policy

π

.

V^{π} (s) = E_{a \sim π (\cdot | s)} [R (s, a) + γ \sum_{s^{'}} P (s^{'} | s, a) V^{π} (s^{'})]

(6)

The expected reward when taking action a in state s and then acting according to policy

π

is the action value function

Q^{π} (s, a)

as follows:

Q^{π} (s, a) = E_{s^{'} \sim P (\cdot | s, a)} [R (s, a, s^{'}) + γ \sum_{a^{'}} π (a^{'} | s^{'}) Q^{π} (s^{'}, a^{'})]

(7)

The ultimate goal of reinforcement learning is to find an optimal policy

π^{*}

that maximizes the cumulative reward of taking that policy in any state. The optimal policy satisfies the following:

π^{*} (s) = \arg \max_{a} Q^{*} (s, a)

(8)

The optimal equation under the optimal policy can be expressed as follows:

V^{*} (s) = \max_{π} V^{π} (s)

(9)

Q^{*} (s, a) = max_{π} Q^{π} (s, a)

(10)

DRL combines Deep Learning and Reinforcement Learning by approximating the value functions, policy functions, etc. in Reinforcement Learning through deep neural networks to cope with more complex environments and tasks. The difference between DRL and Reinforcement Learning is that the value functions or policy functions in Reinforcement Learning are represented through tabular methods or linear models, and DRL approximates these functions through deep neural networks.

2.3. Description of the Problem

The engine thrust demand is the reference command for the main control loop, which is obtained by calculating the throttle stick position set by the pilot. Considering the reason that the engine thrust is not measurable in practical applications, the indirect control of thrust is realized by controlling

n_{R L}

,

π

, and

π_{B}

in conjunction with the performance requirements of the engine operation.

In addition, modern aero-engines require the control system to ensure the stability and optimal solution of the engine working state. This system is also responsible for adhering to the safety constraints of the aerodynamic stability and strength. These constraints specifically encompass the following: the relative rotational speeds of high and low pressure must be maintained at a maximum of 105%, and the pre-turbine temperature must be maintained at a maximum of 1750 K. It is imperative that the high-pressure pressurized gas turbine surge margins, denoted by

S M_{C}

, and the fan surge margins, denoted by

S M_{F}

, are maintained at a minimum of 5%. Consequently, the objective of this study is to design a controller that will address the following problems:

(2.3a) Speed tracking control: Ensure that the tracking error of the the low-pressure relative speed eventually converges to zero.
(2.3b) Connotation drop ratio tracking control: Ensure that the tracking error of the connotation drop ratio eventually converges to zero.
(2.3c) Outer culvert boost ratio tracking control: Ensure that the tracking error of the outer culvert boost ratio eventually converges to zero.
(2.3d) Limiting protection control: No over-temperature, over-rotation, or surge occur under any flight condition.

We define the tracking error of the engine as follows:

\{\begin{matrix} δ_{n_{R L}} & = | n_{R L, cmd} - n_{R L} | \\ δ_{π} & = | π_{cmd} - π | \\ δ_{π_{B}} & = | π_{B, cmd} - π_{B} | \end{matrix}

(11)

where

δ_{n_{R L}}

,

δ_{π}

, and

δ_{π_{B}}

denote the deviation between

n_{R L}

,

π

, and

π_{B}

and their respective target instructions.

The mathematical description of the specific problem is as follows. We design feasible control laws that allow the engine to be operated in different flight environments by adjusting

μ_{t} = [W_{f}, A_{8}, A_{8, o u t e r}, w_{f, a f t e r}, α_{C}, A_{m i x}]

, so that the following holds:

\{\begin{matrix} lim_{t \to \infty} δ_{n_{R L}} = 0 \\ lim_{t \to \infty} δ_{π} = 0 \\ lim_{t \to \infty} δ_{π_{B}} = 0 \end{matrix},

(12)

and,

\{\begin{matrix} S M_{C} & \geq 5 % \\ S M_{F} & \geq 5 % \\ T_{t 41} & \leq 1750 K \\ n_{R H} & \leq 1.05 \\ n_{R L} & \leq 1.05 \end{matrix} .

(13)

The thrust tracking control problem of the ITMA engine is transformed into the multi-variate tracking control problem with the aforementioned constraints. In order to solve this problem, this paper adopts a DRL approach to solve the multi-variate tracking control problem by utilizing its idea of learning optimal strategies.

3. Design of the Deep Reinforcement Learning Controller

3.1. Control System Structure

The controller outputs of the ITMA engine are six variables, namely, the main combustion chamber fuel flow

w_{f}

, fuel flow of the afterburner

w_{f, a f t e r}

, guide vane angle of the high-pressure compressor

α_{C}

, nozzle throat area

A_{8}

, bypass nozzle throat area

A_{8, o u t e r}

, and internal and external bleed air intake area

A_{m i x}

.

In the engine control process, we employ a hierarchical control architecture to precisely regulate the key parameters. The core control loop achieves the tracking control of the low-pressure relative rotational speed

n_{R L}

, connotation drop ratio

π

, and outer culvert boost ratio

π_{B}

by adjusting

w_{f}

,

A_{8}

, and

A_{8, o u t e r}

. Simultaneously,

α_{C}

,

w_{f, a f t e r}

, and

A_{m i x}

are regulated to ensure the engine operates within safe operating limits. Therefore,

w_{f}

,

A_{8}

, and

A_{8, o u t e r}

are controlled in a closed-loop manner, with the DRL control strategy primarily focuses on learning and optimizing

w_{f}

,

A_{8}

, and

A_{8, o u t e r}

. The overall control block diagram is shown in Figure 3.

The afterburner has specific operating conditions, outlined as follows: it only activates when the ambient pressure reaches a preset threshold and the throttle angle

P L A

exceeds 75 deg.

w_{f, a f t e r}

is obtained by applying correction algorithms based on the Mach number, pressure, temperature, etc., starting from the base fuel.

The control of

α_{C}

is implemented using a lookup table based on the converted high-pressure speed, with the control law shown in Figure 4a. The control system monitors the corrected speed in real time and uses a one-dimensional interpolation method to obtain

α_{C}

from the predefined curve.

A_{m i x}

is mainly used for high Mach numbers and extreme flight cases, and

A_{m i x}

is determined by the Mach number; the specific functional relationship is shown in Figure 4b.

3.2. Designing the Agent

3.2.1. Agent Structure

In this paper, the Deep Deterministic Policy Gradient (DDPG) algorithm [26] is utilized to design the agent. The structure of the DDPG is schematically shown in Figure 5, which consists of a target network and a training network. Both networks belong to the Actor–Critic architecture, where the Actor computes the actions of the agent based on the state of the environment through a policy gradient and decides which actions to perform in the current state to obtain positive rewards. The Critic scores the actions performed by the agent and computes a value function based on the actions, which affects the probability distribution of the actions.

The DDPG contains four neural networks, namely, the Actor network, the Critic network, the Target Actor network, and the Target Critic network. The parameters of the Critic network are denoted by

θ^{ω}

and the parameters of the Actor network are denoted by

θ^{μ}

. The Actor network is used to output the action, and the Critic network is used to estimate the Q-value of the action, i.e.,

Q (s, a | θ^{ω})

. After that, the Actor network calculates the gradient to adjust its action output strategy based on the Q value, updating the Actor’s network parameter

θ^{μ}

. The Critic network is optimized by fitting the future target value based on the reward value and the Q value of the next step, i.e.,

Q^{'}

. Then, the output of the Critic network is made to approximate the objective value. However,

Q^{'}

in the target value is a prediction value, so the target value is unstable. Therefore, the DDPG constructs the Target Actor network and the Target Critic network, respectively, where the Target Critic network calculates the

Q^{'}

in the target value, and the next action

a^{'}

needed by

Q^{'}

is output from the Target Actor network.

(1) Network structure

The Actor network structure is shown in Figure 6. For the Actor network, the network structure is determined as one input layer, four hidden layers, and one output layer, considering the variation of multiple variables. Among them, the input layer is used to receive state inputs and the intermediate layer is used to complete the mapping from state to action. We choose ReLU as the activation function. As argued in the literature [27], the Relu function can effectively improve the sparsity of the network and mitigate the overfitting problem. At the same time, it can effectively avoid the phenomenon of gradient vanishing. Considering that the controller output has to satisfy certain physical constraints, we refer to the research results from the literature [28] and adopt the tanh activation function in the output layer, whose smoothness and bounded output range of [−1, 1] can ensure that the generated control instructions are always within the physically realizable range.

The Critic network structure is shown in Figure 7. The Critic network is designed to evaluate the value of a particular action by outputting a Q-value function using the current state and the action as inputs. The Critic network consists of a state path and an action path. The state path is structured with one input layer, two hidden layers and one output layer to capture the relevant features of the input state. The action path is also designed with one input layer, three hidden layers, and one output layer, and the ReLu activation function is used to handle the high dimensional action information and the coupling relationship of each component.

(2) Hyper-parameters

Based on the network structure design described above, the key hyper-parameters of the network are shown in Table 2, which lists the settings of key network parameters in the DRL controller, including the number of hidden layers, the number of nodes in each layer, the learning rate, and the soft update rate for both the Actor network and the Critic network. In addition, the size of the Replay Buffer, the batch size, and the discount factor are specified.

3.2.2. Principles of Policy Optimization

The DDPG contains a total of four neural networks. The Target Actor network predicts the next action

a^{'} = π (s_{t + 1} | θ^{μ^{'}})

, and the Target Critic network can compute the target Q-value function given the current state

Q^{'} (s_{t + 1}, π (s_{t + 1} | θ^{μ^{'}}) | θ^{ω^{'}})

, which in turn can obtain the target value as follows:

\begin{matrix} y_{i} = r_{i} + γ Q^{'} (s_{t + 1}, π (s_{t + 1} | θ^{μ^{'}}) | θ^{ω^{'}}) \end{matrix}

(14)

The Critic network outputs the Q-value function of the state-action at the current moment

Q (s_{t}, a_{t} | θ^{ω})

, which is used to evaluate the current strategy as good or bad. The parameters of the Critic network are updated by minimizing the loss value (mean square error loss), defining the loss function when the Critic network is updated as follows:

\begin{matrix} L = \frac{1}{N} {\sum_{i} (y_{i} - Q (s_{t}, a_{t} | θ^{ω}))}^{2} \end{matrix}

(15)

where

a_{i} = π (s_{i} | θ^{μ})

. Using the Q-value of the Critic network, the Actor network updates the parameters by a gradient ascent method to maximize the expected reward. The gradient of the policy at the time of updating can be expressed as follows:

\begin{matrix} \nabla_{θ^{μ}} J (θ^{μ}) = \frac{1}{N} \sum_{i} \nabla_{θ^{μ}} π (s_{t} | θ^{μ}) \nabla_{a} Q (s_{t}, a | θ^{ω}) |_{a = π (s_{t})} \end{matrix}

(16)

For the update of the Target network parameters

θ^{μ^{'}}

and

θ^{ω^{'}}

, the DDPG slowly approximates the parameters of the current network through a soft update mechanism, thus stably following the changes of the training network.

\begin{matrix} θ^{ω^{'}} \leftarrow τ ω + (1 - τ) θ^{ω^{'}} \end{matrix}

(17)

\begin{matrix} θ^{μ^{'}} \leftarrow τ θ + (1 - τ) θ^{μ^{'}} \end{matrix}

(18)

The DDPG algorithm combines the features of value function-based and policy-based approaches, which allows deep reinforcement learning to deal with continuous action spaces with some exploration capabilities. Therefore the DDPG optimization process is shown in Algorithm 1.

Algorithm 1: DDPG pseudocode

3.3. Algorithm Setup

(1) Observations

Considering the aerodynamic stabilization conditions as well as the intensity limitations during engine operation, eight quantities were selected as the engine state observation,

y = {[n_{R H}, n_{R L}, S M_{C}, S M_{f}, T_{t 41}, δ_{n_{R L}}, δ_{π}, δ_{π_{B}}]}^{T}

.

(2) Action space construction

The DRL controller replaces the

w_{f}

,

A_{8}

, and

A_{8, o u t e r}

controllers in the PID controller, so the action space is represented as

μ_{t} = {[w_{f}, A_{8}, A_{8, o u t e r}]}^{T}

.

The Action Space Pruning (ASP) method involved in the literature [29] is also used to customize the strategy for pruning. There are various ways to implement the Action Space Pruning strategy, such as the empirical formula method, polynomial fitting method, neural network fitting method, etc. In this paper, the neural network fitting method is chosen for the design of the Action Space Pruning strategy.

The PID controller data acquisition is carried out first. By carrying out large-scale simulation experiments under a variety of operating conditions, the altitude values, Mach numbers, throttle stick angles under different operating conditions, and the corresponding PID controller outputs of important controllable parameters such as fuel flow, nozzle area, and the target commands of the engine model are recorded. These data are organized into a structured dataset, which serves as the basis for subsequent neural network training.

These data were then fitted using a Backpropagation (BP) neural network to obtain a neural network that outputs controllable parameters based on the altitude, Mach number, and thrust commands, with specific neural network inputs and outputs as shown in Table 3.

Taking the output of the trained BP neural network as a benchmark, the action outputs obtained by the agent through the network computation are limited to ±30% of the benchmark value, thus reducing the range of exploration, increasing the convergence speed, and focusing more on effective strategies. As training progresses, the agent will make careful adjustments around the baseline values, thus improving the quality of the training data, reducing invalid or abnormal training data, and ensuring that the network better fits the target control strategy.

(3) Reward function setting

In the whole control system, the process of setting the reward function is equivalent to the process of formulating learning objectives for the agent. For the closed-loop control of the main combustion chamber fuel flow and nozzle throat area of the ITMA engine, the control objective is to minimize the error between the controlled variables and the target command by adjusting the input

μ_{t} = [W_{f}, A_{8}, A_{8, o u t e r}, w_{f, a f t e r}, α_{C}, A_{m i x}]

of the aero-engine under different flight environments. At the same time, we ensure that the key performance parameters of the engine cannot exceed the limiting values in extreme operating environments. Therefore, the reward function is set as follows.

r_{1} = \{\begin{matrix} 5, & δ_{n_{R L}} < 0.0001 \\ 0, & 0 \leq δ_{n_{R L}} < 0.0003 \\ - 1, & 0.0003 \leq δ_{n_{R L}} < 0.0005 \\ - 50, & 0.0005 \leq δ_{n_{R L}} < 0.001 \\ - 100, & δ_{n_{R L}} \geq 0.01 \end{matrix}

(19)

The first part is the speed error reward function, which is designed as a stepwise reward function to achieve the target speed. The smaller the error, the larger the reward value.

r_{2} = \{\begin{matrix} 2, & δ_{π} < 0.001 \\ 0, & 0 \leq δ_{π} < 0.003 \\ - 1, & 0.003 \leq δ_{π} < 0.005 \\ - 50, & 0.005 \leq δ_{π} < 0.01 \\ - 100, & δ_{π} \geq 0.01 \end{matrix}

(20)

r_{3} = \{\begin{matrix} 2, & δ_{π_{B}} < 0.001 \\ 0, & 0 \leq δ_{π_{B}} < 0.003 \\ - 1, & 0.003 \leq δ_{π_{B}} < 0.005 \\ - 50, & 0.005 \leq δ_{π_{B}} < 0.01 \\ - 100, & δ_{π_{B}} \geq 0.01 \end{matrix}

(21)

Similarly, the second and third parts are the error reward functions for the internal and external pressure ratios, respectively, which require lower tracking accuracy.

r_{4} = \sum_{i} ω_{i} [max (c_{i} (t) - c_{i, l} (t), 0)]

(22)

The total reward is given by the following:

r = r_{1} + r_{2} + r_{3} + r_{4}

(23)

In the last section,

c_{i} (t)

is the engine’s key performance parameter, which contains the engine speed, the surge margin, and the pre-turbine temperature.

ω_{i}

is a constant representing the weight associated with each critical engine performance parameter in the reward function. Specifically, it is used to amplify negative rewards when critical performance parameters (such as engine speed, surge margin, or pre-turbine temperature) exceed the limits.

ω_{i} = - 500

. When the critical performance parameters do not exceed the limit values, this section is zero and the control strategy is not affected. When one of the key performance parameters exceeds the limit, the environment generates a large negative reward so that the agent is influenced to avoid making actions that exceed the limit.

3.4. Network Training

3.4.1. Overall Training Framework and Data Interaction

The network training architecture is shown in Figure 8, which presents a complete picture of the overall training framework of the DRL control system and its key components. The operation of the engine control system during the network training process is divided into two phases. The first phase is the engine startup phase, and the first 30 s are designated as the engine startup phase in the mathematical simulation model. This phase is taken over by the conventional PID controller during the training process in order to eliminate the interference of the startup process on the DRL training. After successful startup, the control is handed over to the DRL system. After that, training is performed based on the DDPG algorithm as follows: the Actor network outputs the action, the Critic network estimates the Q-value of the action, the Target Critic network calculates the Q-value in the target value, and the Target actor network outputs the next desired action. In this process, the samples

(s_{t}, a_{t}, r_{t}, s_{t + 1})

obtained in the current step are put into the Replay Buffer. When the parameters are updated, N samples are randomly selected from the Replay Buffer to form a small batch of data, and the DDPG algorithm is strictly followed to update the network parameters. In addition, in order to increase the robustness of the network, we add a certain degree of random noise when outputting the action.

The engine model and the agent interact with each other through data to realize the single-step update of the agent network parameters as well as the training optimization mentioned in Section 3.2.2, and the specific interaction process is as follows:

Firstly, we initialize the network parameters of the agent, determine the flight environment of the engine, and obtain the target command by calculation. We start the engine and switch to the DRL controller after 30s. Through the target command and engine output, we calculate the state input of the agent

s_{t}

, obtain the benchmark value through the ASP strategy, limit the action to within ±30% of the benchmark value, train to obtain the action output, input it into the ITMA engine model, calculate the state input of the agent in the next step

s_{t + 1}

, obtain the reward value

r_{t}

through the pre-set reward function calculation module, and put the experience sample

(s_{t}, a_{t}, r_{t}, s_{t + 1})

obtained in the current step into the Replay Buffer. Finally, we update the network parameters, N samples are randomly sampled in small batches from the Replay Buffer, and the network parameters are updated according to the DDPG algorithm. The specific interaction table is shown in Table 4. In this training process, the training-related hyper-parameters are shown in Table 5.

3.4.2. Training Results and Analysis

Figure 9 shows a plot of the reward value versus the number of training epochs of the network. The horizontal axis of the graph represents the training epochs and the vertical axis shows the reward in

10^{5}

. From the graph, it can be seen that the training technique using ASP technology (red solid line) has a faster convergence rate compared to the base method (blue solid line). The following key conclusions can be drawn from the analysis: the pure DRL curve shows a slow upward trend and requires more training cycles to reach a steady state. The combined DRL + ASP curve exhibits a steeper upward slope in the early training phase, i.e., the ASP technique allows the system to reach the final performance level of pure DRL in about one-third of the training cycles.

4. Simulation Results and Analysis

Two typical working conditions—H = 0 km, Ma = 0; and H = 10 km, Ma = 0.9—are selected for the control system simulation test, and the results are as follows.

4.1. Simulation Results and Analysis for H = 0 km, Ma = 0

Figure 10 and Figure 11 show the PID control and DRL control simulation for H = 0 km, Ma = 0. Respectively, Figure (a) reflects the variation of the throttle stick, and Figures (b), (c), and (d) show the curve tracking of the low-pressure relative rotational speed, connotation drop ratio, and the outer boost ratio, respectively. Figures (e) and (f) show the variation of engine pre-turbine temperature, high-pressure compressor surge margin, and fan surge margin during the variation of the throttle stick according to (a).

Table 6 shows the response time and overshoot of the PLA under the PID and DRL methods during the process of change shown in Figure 10a. Based on the data in Table 6, the following conclusions can be drawn: the dynamic tracking effect of

n_{R L}

is shown in Figure 10b and Figure 11b, respectively. In each acceleration and deceleration phase, the DRL controller can reduce the response time by an average of 43.65% while maintaining a comparable level of overshooting with PID (for most operating conditions). Figure 10c and Figure 11c show the control effect of the connotation drop ratio under two different controllers, respectively. Overall, the difference in response time between the DRL controller and the PID controller is small, but the DRL controller has an overall lower overshoot, especially during the rapid deceleration of the PLA from 30 to 20, which shows higher stability. The control effect of the outer boost ratio is shown in Figure 10d and Figure 11d. Overall, the response time of the DRL controller is generally shorter than that of the PID controller. In addition, (e) and (f) of Figure 10 and Figure 11 show the variation of

T_{t 41}

, SM, respectively, and the red dashed line in the figure indicates the maximum or minimum limit value of this parameter; both controllers are able to keep the engine in a safe operating condition during acceleration and deceleration.

4.2. Simulation Results and Analysis for H = 10 km, Ma = 0.9

Figure 12 and Figure 13, respectively, show the control renderings of PID control and DRL control under H = 10 km and Ma = 0.9 conditions.

From Figure 12c,d, it can be seen that under the environment of H = 10 km and Ma = 0.9, the target pressure ratio of the PLA remains unchanged above 65 deg, and the target relative speed of the engine starts to maintain a constant value after the PLA is above 90 deg in Figure 13a,b. As far as the data in the Table 7 are concerned, the actual values of the low-pressure relative speed and pressure ratio under the PID controller fluctuate significantly, with large increases or decreases several times, which deviate from the target values. The DRL controller can reduce the response time by an average of 45.34%, with relatively less fluctuations.

5. Conclusions

In this paper, a control method based on deep reinforcement learning (DRL) is innovatively proposed for a wide-range variable-cycle aero-engine with a turbine interstage mixed architecture (ITMA). By constructing a deep reinforcement learning control framework, it focuses on solving the control limitation problem of PID control under multi-variate coupling conditions. Two typical flight conditions (H = 0 km, Ma = 0 and H = 10 km, Ma = 0.9) are selected for the study to conduct digital simulation comparison experiments, and the results are discussed.

The proposed DRL control method breaks through the architectural single-input single-output (SISO) limitation of the traditional ITMA controller, and it effectively solves the strong coupling problem among multiple control variables through the feature extraction capability of deep neural networks. In terms of action space design, an innovative action space pruning method is adopted, which significantly reduces the ineffective exploration range of the agent, resulting in a significant increase in the convergence speed, at the same time ensuring the effectiveness of the strategy search.

The simulation data show that compared with the PID control, the DRL controller reduces the dynamic response time by an average of 44.5% while maintaining a similar amount of overshooting, especially in the transition state condition, which exhibits better dynamic tracking performance.

The fast response characteristics and stable control accuracy of this control method are particularly suitable for the control needs of high-mobility military aero-engines. This study provides a new technical route for the DRL control of aero-engines, and subsequent studies will focus on algorithm optimization, including (1) introducing migration learning to improve the control accuracy, (2) developing a lightweight network structure to improve the computational efficiency, and (3) expanding to more complex multi-task collaborative control scenarios in order to meet the needs of the future engineering applications of adaptive cyclic engines.

Author Contributions

Conceptualization, Y.D. and H.S.; methodology, Y.D., F.W. and Y.M.; software, Y.D.; validation, Y.D. and H.S.; formal analysis, Y.D.; investigation, H.S. and Y.D.; resources, H.S. and Y.D.; data curation, Y.D.; writing—original draft preparation, Y.D.; writing—review and editing, H.S. and Y.D.; visualization, Y.D.; supervision, H.S., F.W. and Y.M.; project administration, F.W., Y.M. and H.S.; funding acquisition, F.W., Y.M. and H.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Dataset available on request from the authors. The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

VCE	Variable Cycle Engine
ITMA	Interstage Turbine Mixed Architecture
DDPG	Deep Deterministic Policy Gradient
DRL	Deep Reinforcement Learning

Appendix A

Table A1. Notation.

Symbol	Adjustable Variable Name	Unit
$A_{8}$	Nozzle throat area (inner)	$m^{2}$
$A_{8, o u t e r}$	Nozzle throat area (outer)	$m^{2}$
$A_{m i x}$	Internal and external bleed air intake area	$m^{2}$
$n_{L}$	Fan physical speed	r/min
$n_{C H}$	High-pressure corrected speed	r/min
$n_{R L}$	Low-pressure relative speed	\
$n_{R H}$	High-pressure relative speed	\
$P_{0}$	Atmospheric static pressure	Pa
$P_{t 2}$	Engine intake pressure	Pa
$P_{s 31}$	Pressure after high-pressure compressor	Pa
$P_{t 6}$	Post-turbine pressure	Pa
$P_{t 16}$	Pre-combustor pressure (outer bypass)	Pa
PLA	Throttle lever angle	deg
$S M_{C}$	Surge margin of high-pressure compressor	\
$S M_{F}$	Surge margin of fan	\
$T_{t 2}$	Atmospheric static temperature	K
$T_{t 25}$	Temperature after fan	K
$T_{t 41}$	Pre-turbine temperature	K
$w f$	Main combustion chamber fuel flow	kg/s
$w_{f, a f t e r}$	Afterburner fuel flow	kg/s
$α_{C}$	Guide vane angle of the high-pressure compressor	deg
$π$	Pressure ratio	\
B	Bypass ratio	\
cmd	Target value	\

References

Huang, X.; Chen, Y.; Zhou, H. Analysis on development trend of global hypersonic technology. Bull. Chin. Acad. Sci. 2024, 39, 1106–1120. [Google Scholar]
Zhong, S.; Kang, Y.; Li, X. Technology Development of Wide-Range Gas Turbine Engine. Aerosp. Power 2023, 4, 19–23. [Google Scholar]
Johnson, L. Variable Cycle Engine Developments at General Electric-1955-1995; AIAA: Reston, VA, USA, 1995; pp. 105–143. [Google Scholar]
Mu, Y.; Wang, F.; Zhu, D. Simulation of variable geometry characteristics of single bypass variable cycle engine. Aeroengine 2024, 50, 52–57. [Google Scholar]
Brown, R. Integration of a variable cycle engine concept in a supersonic cruise aircraft. In Proceedings of the AIAA/SAE/ASME 14th Joint Propulsion Conference, Las Vegas, NV, USA, 18–20 June 1978. [Google Scholar]
Allan, R. General Electric Company variable cycle engine technology demonstrator p-rogram. In Proceedings of the AIAA/SAE/ASME 15th Joint Propulsion Conference, Las Vegas, NV, USA, 18–20 June 1979. [Google Scholar]
Feng, Z.; Mao, J.; Hu, D. Review on the development of adjusting mechanism invariable cycle engine and key technologies. Aeroengine 2023, 49, 18–26. [Google Scholar]
Zhang, Y.; Yuan, W.; Zou, T. Modeling technology of high-flow triple-bypass variable cycle engine. J. Propuls. Technol. 2024, 45, 35–43. [Google Scholar]
Liu, B.; Nie, L.; Liao, Z. Overall performance of interstage turbine mixed architecture variable cycle engine. J. Propuls. Technol. 2023, 44, 27–37. [Google Scholar]
Zeng, X.; Gou, L.; Shen, Y. Analysis and modeling of variable cycle engine control system. In Proceedings of the 11th International Conference on Mechanical and Aerospace Engineering (ICMAE), Athens, Greece, 18–21 July 2020. [Google Scholar]
Wu, Y.; Yu, Z.; Li, C.; He, M. Reinforcement learning in dualarm trajectory planning for a free-floating space robot. Aerosp. Sci. Technol. 2020, 98, 105657. [Google Scholar] [CrossRef]
Gu, S.; Holly, E.; Lillicrap, T. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In Proceedings of the IEEE International Conference on Robotics & Automation, Singapore, 29 May–3 June 2017. [Google Scholar]
Wada, D.; AraujoEstrada, S.A.; Windsor, S. Unmanned aerial vehicle pitch control under delay using deep reinforcement learning with continuous action in wind tunnel test. Aerospace 2021, 8, 258. [Google Scholar] [CrossRef]
Liu, Y.; Liu, H.; Tian, Y. Reinforcement learning based two-level control framework of UAV swarm for cooperative persistent surveillance in an unknown urban area. Aerosp. Sci. Technol. 2020, 98, 261–281. [Google Scholar] [CrossRef]
Sallab, A.; Abdou, M.; Perot, E. End-to-End deep reinforcement learning for lane keeping assist. In Proceedings of the 30th Conference on Neural Information Processing Systems (NIPS), Barcelona, Spain, 5–10 December 2016. [Google Scholar]
Kiran, B.; Sobh, I.; Talpaert, V. Deep reinforcement learning for autonomous driving: A survey. IEEE Trans. Intell. Transp. Syst. 2021, 28, 4909–4926. [Google Scholar] [CrossRef]
Mehryar, M.; Afshin, R.; Talwalkar, A. Reinforcement learning. In Foundations of Machine Learning; MIT Press: Cambridge, MA, USA, 2018; pp. 379–405. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D. Playing Atari with deep reinforcement learning. arXiv 2013, arXiv:1312.5602. [Google Scholar]
Qiu, X. Deep Reinforcement Learning. In Foundations of Machine Learning; China Machine Press: Beijing, China, 2020; pp. 339–360. [Google Scholar]
Francois-Lavet, V.; Henderson, P.; Islam, R. An introduction to deep reinforcement learning. Found. Trends Mach. Learn. 2018, 11, 219–354. [Google Scholar] [CrossRef]
Zheng, Q.; Jin, C.; Hu, Z. A study of aero-engine control method based on deep reinforcement learning. IEEE Access 2018, 6, 67884–67893. [Google Scholar] [CrossRef]
Liu, C.; Dong, C.; Zhou, Z. Barrier Lyapunov function based reinforcement learning control for air-breathing hypersonic vehicle with variable geometry inlet. Aerosp. Sci. Technol. 2020, 96, 105537–105557. [Google Scholar] [CrossRef]
Fang, J.; Zheng, Q.; Cai, C. Deep reinforcement learning method for turbofan engine acceleration optimization problem within full flight envelope. Aerosp. Sci. Technol. 2023, 136, 108228–108242. [Google Scholar] [CrossRef]
Tao, B.; Yang, L.; Wu, D. Deep reinforcement learning-based optimal control of variable cycle engine performance. In Proceedings of the 2022 International Conference on Advanced Robotics and Mechatronics (ICARM), Guilin, China, 7–9 November 2022. [Google Scholar]
Gao, W.; Zhou, X.; Pan, M. Acceleration control strategy for aero-engines based on model-free deep reinforcement learning method. Aerosp. Sci. Technol. 2022, 120, 107248–107260. [Google Scholar] [CrossRef]
Silver, D.; Lever, G.; Heess, N. Deterministic Policy Gradient Algorithms. Proc. Mach. Learn. Res. 2014, 32, 387–395. [Google Scholar]
Hahnloser, R.; Sarpeshkar, R.; Mahowald, M. Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature 2000, 405, 947–951. [Google Scholar] [CrossRef] [PubMed]
He, K.; Zhang, X.; Ren, S.; Sun, J. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [Google Scholar]
Kanervisto, A.; Scheller, C.; Hautamäki, V. Action space shaping in deep reinforcement learning. In Proceedings of the 2020 IEEE Conference on Games (CoG), Osaka, Japan, 24–27 August 2020. [Google Scholar]

Figure 1. Schematic diagram of the engine structure.

Figure 2. Reinforcement learning schematic.

Figure 3. Structure of the ITMA controller.

Figure 4. Control law diagram of

α_{C}

and

A_{m i x}

: (a) describes the plot of

α_{C}

as a function of

n_{C H}

. (b) shows the plot of

A_{m i x}

as a function of Ma.

Figure 4. Control law diagram of

α_{C}

and

A_{m i x}

: (a) describes the plot of

α_{C}

as a function of

n_{C H}

. (b) shows the plot of

A_{m i x}

as a function of Ma.

Figure 5. Structure of the DDPG algorithm.

Figure 6. Actor network structure.

Figure 7. Critic network structure.

Figure 8. Network training structure diagram.

Figure 9. Training progress curve.

Figure 10. PID control simulation for H = 0 km, Ma = 0: (a) describes the variation of the throttle stick; (b) shows the curve tracking of the low-pressure relative rotational speed; (c) describes the curve tracking of connotation drop ratio; (d) shows the curve tracking of the outer boost ratio; and (e,f) show the variation of engine pre-turbine temperature high-pressure compressor surge margin, and fan surge margin during the variation of the throttle stick according to (a).

Figure 11. DRL control simulation for

H = 0 km

,

Ma = 0

: (a) illustrates the variation of the throttle stick; (b) shows the curve tracking of the low-pressure relative rotational speed; (c) illustrates the curve tracking of the connotation drop ratio; (d) shows the curve tracking of the outer boost ratio; and (e,f) illustrate the variation of engine pre-turbine temperature, high-pressure compressor surge margin, and fan surge margin during the variation of the throttle stick as described in (a).

Figure 11. DRL control simulation for

H = 0 km

,

Ma = 0

: (a) illustrates the variation of the throttle stick; (b) shows the curve tracking of the low-pressure relative rotational speed; (c) illustrates the curve tracking of the connotation drop ratio; (d) shows the curve tracking of the outer boost ratio; and (e,f) illustrate the variation of engine pre-turbine temperature, high-pressure compressor surge margin, and fan surge margin during the variation of the throttle stick as described in (a).

Figure 12. PID control simulation results at

H = 10 km

and

M a = 0.9

: (a) shows the throttle lever angle variation; (b) demonstrates the low-pressure relative rotational speed tracking performance; (c) displays the connotation drop ratio tracking; (d) illustrates the outer culvert boost ratio tracking performance; and (e,f) present the variations of the engine pre-turbine temperature, high-pressure compressor surge margin, and fan surge margin during the throttle lever angle changes corresponding to (a).

Figure 12. PID control simulation results at

H = 10 km

and

M a = 0.9

: (a) shows the throttle lever angle variation; (b) demonstrates the low-pressure relative rotational speed tracking performance; (c) displays the connotation drop ratio tracking; (d) illustrates the outer culvert boost ratio tracking performance; and (e,f) present the variations of the engine pre-turbine temperature, high-pressure compressor surge margin, and fan surge margin during the throttle lever angle changes corresponding to (a).

Figure 13. DRL control simulation for H = 10 km and Ma = 0.9: (a) illustrates the variation of the throttle stick; (b) shows the curve tracking of the low-pressure relative rotational speed; (c) illustrates the curve tracking of the connotation drop ratio; (d) shows the curve tracking of the outer boost ratio; and (e,f) illustrate the variation of engine pre-turbine temperature, high-pressure compressor surge margin, and fan surge margin during the variation of the throttle stick as described in (a).

Table 1. Range of variable components.

Notation	Definition	Range
$w_{f}$	Main combustion chamber fuel flow	0–1.4 (kg/s)
$w_{f, a f t e r}$	Afterburner fuel flow	0–6 (kg/s)
$α_{C}$	Guide vane angle of the high-pressure compressor	0–40 (deg)
$A_{8}$	Nozzle throat area	0–0.45 ( $m^{2}$ )
$A_{8, o u t e r}$	Bypass nozzle throat area	0–0.3 ( $m^{2}$ )
$A_{m i x}$	Internal and external bleed air intake area	0–0.05 ( $m^{2}$ )

Table 2. Hyper-parameters.

Parameter	Value
Number of hidden layers in the Actor network	4
Number of hidden layers in the Critic network	State: 2, Action: 3
Number of nodes in the Actor network	30, 30, 20, 20
Number of nodes in the Critic network	State: 30, 20, Action: 20, 20, 20
Learning rate of the Actor	0.0001
Learning rate of the Critic	0.001
Soft update rate	0.001
Replay Buffer	1,000,000
Number of samples in the replay Buffer N	512
Discount factor $γ$	0.99

Table 3. BP network input and output.

Notation	Parameter Name	Unit	Input/Output
H	Flight altitude	km	Input
Ma	Flight Mach number	\
$n_{R L, c m d}$	Low-pressure relative rotational speed command	\
$π_{c m d}$	Connotation drop ratio command	\
$π_{B, c m d}$	Outer culvert boost ratio command	\
$w_{f}$	Main combustion chamber fuel flow	kg/s	Output
$A_{8}$	Nozzle throat area	$m^{2}$
$A_{8, o u t e r}$	Bypass nozzle throat area	$m^{2}$

Table 4. Data interaction.

Notation	Instructions	Data
$α_{C}$	Open-loop control strategy	$agent \to model$
$w_{f}$	DRL control strategy
$w_{f}$	Open-loop control strategy
$A_{8}$	DRL control strategy
$A_{8, outer}$	DRL control strategy
$A_{mix}$	Open-loop control strategy
$P_{t 6}$	Calculate pressure ratio	$model \to agent$
$P_{s 31}$	Calculate pressure ratio and safety constraints
$n_{RL}$	For safety constraints
$n_{RH}$	For safety constraints
$T_{t 41}$	For safety constraints

Table 5. Training-related parameters.

Parameter Name	Parameter Value
Dimensionality of observations	8
Dimensionality of actions	3
Training step	0.02 s
Number of training epochs	1000

Table 6. Comparative data for

n_{R L}

,

π

, and

π_{B}

at

H = 0

km and

M a = 0

.

Table 6. Comparative data for

n_{R L}

,

π

, and

π_{B}

at

H = 0

km and

M a = 0

.

Control Type	PLA	t/s (PID)	Overshoot (PID)	t/s (DRL)	Overshoot (DRL)	$Δ t$ (%)
$n_{R L}$	20–30	3.164	1.422	1.374	1.74	56.57
	30–40	3.260	0.532	1.439	0	55.86
	40–50	2.589	0	1.145	0	55.77
	50–60	2.205	0	1.068	0	51.56
	60–50	2.109	0	1.314	0.8	37.70
	50–40	2.205	0	1.232	0.2	44.13
	40–30	1.918	0	1.232	0	35.77
	30–20	8.245	4.25	5.139	2.25	37.67
$π$	20–30	3.147	9.19	1.746	9.76	44.64
	30–40	2.493	0	1.234	0	50.54
	40–50	2.301	0	1.204	0	47.87
	50–60	2.205	0	1.001	0	54.58
	60–50	2.589	0	1.356	0	47.72
	50–40	2.685	0	1.247	0	53.66
	40–30	2.876	0	1.332	0	53.75
	30–20	3.875	0	1.562	0	59.67
$π_{B}$	20–30	3.931	0.2623	2.336	0.3016	40.56
	30–40	4.027	0.614	2.598	0.678	35.43
	40–50	3.452	0	2.356	0	31.8
	50–60	2.589	0	2.331	0	9.97
	60–50	3.356	0	2.547	0	24.1
	50–40	3.26	0	2.368	0	27.39
	40–30	5.465	0	2.896	0	47.13
	30–20	6.136	0.348	3.019	0.256	50.76

Table 7. Comparative data for

n_{R L}

,

π

, and

π_{B}

at

H = 10 km

and

M a = 0.9

.

Table 7. Comparative data for

n_{R L}

,

π

, and

π_{B}

at

H = 10 km

and

M a = 0.9

.

Control Type	PLA	t/s (PID)	Overshoot (PID)	t/s (DRL)	Overshoot (DRL)	$Δ t$ (%)
$n_{R L}$	60–75	3.164	0.344	1.515	0.2225	52.04
	75–60	4.044	0.207	1.693	0.1563	58.19
	60–90	2.869	1.429	1.837	0	35.97
	90–115	3.45	1.225	2.872	0	16.75
$π$	60–75	2.876	4.801	0.842	2.205	35.96
	75–60	2.780	4.845	1.304	2.106	53.03
	60–90	3.931	8.252	1.868	5.874	52.43
	90–115	3.547	5.667	1.658	3.265	53.22
$π_{B}$	60–75	2.301	4.6709	1.056	3.356	54.06
	75–60	3.132	5.2516	1.754	3.386	43.99
	60–90	3.068	11.77	1.398	8.93	54.41
	90–115	3.356	4.15	2.209	2.25	34.18

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ding, Y.; Wang, F.; Mu, Y.; Sun, H. Wide-Range Variable Cycle Engine Control Based on Deep Reinforcement Learning. Aerospace 2025, 12, 424. https://doi.org/10.3390/aerospace12050424

AMA Style

Ding Y, Wang F, Mu Y, Sun H. Wide-Range Variable Cycle Engine Control Based on Deep Reinforcement Learning. Aerospace. 2025; 12(5):424. https://doi.org/10.3390/aerospace12050424

Chicago/Turabian Style

Ding, Yaoyao, Fengming Wang, Yuanwei Mu, and Hongfei Sun. 2025. "Wide-Range Variable Cycle Engine Control Based on Deep Reinforcement Learning" Aerospace 12, no. 5: 424. https://doi.org/10.3390/aerospace12050424

APA Style

Ding, Y., Wang, F., Mu, Y., & Sun, H. (2025). Wide-Range Variable Cycle Engine Control Based on Deep Reinforcement Learning. Aerospace, 12(5), 424. https://doi.org/10.3390/aerospace12050424

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Wide-Range Variable Cycle Engine Control Based on Deep Reinforcement Learning

Abstract

1. Introduction

2. Preliminary Knowledge and Problem Description

2.1. Study Objects

2.2. Overview of Deep Reinforcement Learning

2.3. Description of the Problem

3. Design of the Deep Reinforcement Learning Controller

3.1. Control System Structure

3.2. Designing the Agent

3.2.1. Agent Structure

3.2.2. Principles of Policy Optimization

3.3. Algorithm Setup

3.4. Network Training

3.4.1. Overall Training Framework and Data Interaction

3.4.2. Training Results and Analysis

4. Simulation Results and Analysis

4.1. Simulation Results and Analysis for H = 0 km, Ma = 0

4.2. Simulation Results and Analysis for H = 10 km, Ma = 0.9

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI