Deep Reinforcement Learning Based Energy Management Strategy for Vertical Take-Off and Landing Aircraft with Turbo-Electric Hybrid Propulsion System

Yu, Feifan; Tang, Wang; Chen, Jiajie; Wang, Jiqiang; Sun, Xiaokang; Chen, Xinmin

doi:10.3390/aerospace12040355

Open AccessArticle

Deep Reinforcement Learning Based Energy Management Strategy for Vertical Take-Off and Landing Aircraft with Turbo-Electric Hybrid Propulsion System

by

Feifan Yu

^1,2

,

Wang Tang

^3,†,

Jiajie Chen

^1,*

,

Jiqiang Wang

^1,*,

Xiaokang Sun

¹ and

Xinmin Chen

¹

Research Center for Special Aircraft Systems Engineering Technology, Ningbo Institute of Materials Technology and Engineering, Chinese Academy of Sciences, Ningbo 315201, China

²

Ningbo College of Materials Technology and Engineering, University of Chinese Academy of Sciences, Ningbo 315201, China

³

School of Computer Science, University of Leeds, Woodhouse Lane, Leeds LS2 9JT, UK

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Aerospace 2025, 12(4), 355; https://doi.org/10.3390/aerospace12040355

Submission received: 24 March 2025 / Revised: 14 April 2025 / Accepted: 15 April 2025 / Published: 17 April 2025

(This article belongs to the Section Aeronautics)

Download

Browse Figures

Versions Notes

Abstract

Due to the limitations of pure electric power endurance, turbo-electric hybrid power systems, which offer a high power-to-weight ratio, present a reliable solution for medium- and large-sized vertical take-off and landing (VTOL) aircraft. Traditional energy management strategies often fail to minimize fuel consumption across the entire flight profile while meeting power demands under varying flight conditions. To address this issue, this paper proposes a deep reinforcement learning (DRL)-based energy management strategy (EMS) specifically designed for turbo-electric hybrid propulsion systems. Firstly, the proposed strategy employs a Prior Knowledge-Guided Deep Reinforcement Learning (PKGDRL) method, which integrates domain-specific knowledge into the Deep Deterministic Policy Gradient (DDPG) algorithm to improve learning efficiency and enhance fuel economy. Then, by narrowing the exploration space, the PKGDRL method accelerates convergence and achieves superior fuel and energy efficiency. Simulation results show that PKGDRL has a strong generalization capability in all operating conditions, with a fuel economy difference of only 1.6% from the offline benchmark of the optimization algorithm, and in addition, the PKG module enables the DRL method to achieve a huge improvement in terms of fuel economy and convergence rate. In particular, the prospect theory (PT) in the PKG module improves fuel economy by 0.81%. Future research will explore the application of PKGDRL in the direction of real-time total power prediction and adaptive energy management under complex operating conditions to enhance the generalization capability of EMS.

Keywords:

deep reinforcement learning; energy management strategy; vertical take-off and landing; turbo-electric hybrid propulsion system; prior knowledge

1. Introduction

Compared to traditional multi-rotor unmanned aerial vehicles (UAVs) and fixedwing UAVs, electric vertical take-off and landing (eVTOL) [1] aircraft exhibit flexible aerodynamic configurations and higher propulsion efficiency. These characteristics have showcased broad application prospects in fields such as low-altitude general aviation and logistics transportation, making eVTOL a key research focus in the global aerospace industry. However, due to the limitations of energy density in aviation batteries, eVTOL aircraft currently face challenges such as short flight duration and limited payload capacity, making it difficult to meet the requirements of high-load and long-duration missions. In contrast, hybrid power systems combine the advantages of electric propulsion with the maturity of traditional aeroengine. This not only enables distributed propulsion configurations but also effectively compensates for the endurance limitations of pure electric systems. As a result, turbo-electric hybrid propulsion system [2] have emerged as an ideal propulsion solution for vertical take-off and landing aircraft in application scenarios such as regional logistics transportation and extended rescue missions in complex environments (e.g., mountainous or maritime areas). Internationally, NASA has developed a strategic roadmap to integrate hybrid power systems into the aviation sector by 2050 [3]. General Electric [4] has successfully tested megawatt-class hybrid propulsion systems, and other organizations such as Raytheon, Electra [5], Safran, and VerdeGo have made significant progress in the development of hybrid propulsion systems. These technological advancements are laying the foundation for future air mobility solutions, with hybrid power systems expected to gradually achieve commercial application.

Energy management strategies (EMSs) are essential for optimizing the operation of turbo-electric hybrid propulsion systems. By employing advanced strategies to coordinate the use of aeroengine and battery power, an EMS ensures efficient energy use, minimizes fuel consumption, and reduces greenhouse gas emissions. Energy management strategies can generally be classified into three categories: (1) rule-based methods, which rely on predefined rules and heuristics; (2) optimization-based methods, which utilize mathematical models to find optimal solutions; (3) learning-based methods, which employ machine learning techniques to adaptively improve performance based on operational data. Rule-based energy management strategies are divided into deterministic and fuzzy rule approaches, which control the output power of hybrid power sources based on fixed, predefined rules without dynamically optimizing for specific objectives like efficiency or fuel economy. While simple and efficient, these strategies require human expertise and generally lack optimality [6], especially in complex environments where they may exhibit poor stability, adaptability, and flexibility [7,8]. Consequently, they are often inadequate for real-time operations in such contexts. Optimization-based energy management strategies replace human intuition with computer algorithms, focusing on optimizing objectives like system efficiency and fuel economy. These strategies can be categorized into offline optimization (global optimization), which processes data and finds optimal solutions without time constraints, and online optimization, which adapts and responds in real-time to changing conditions. Dynamic programming (DP), a global optimization method, achieves optimal solutions under deterministic conditions but requires significant computational resources, making it an offline benchmark for energy management systems [9]. In contrast, the Equivalent Consumption Minimization Strategy (ECMS) [10] provides more practical, real-time optimization capabilities for control systems. ECMS has the advantage of obtaining an instantaneous optimal power allocation by equating the electric power consumption to the fuel consumption and has a low computational complexity. The disadvantages are that the global optimum cannot be achieved because only the current state is considered, the effect of the current control action on the future state of the system is neglected, and it can only be applied to a single optimization objective and cannot optimize multiple objectives at the same time. Traditional rule-based EMS systems suffer from poor optimality and adaptability to complex conditions, while optimization-based EMS systems face challenges due to high computational costs and limited real-time performance. Recently, learning-based EMS systems have emerged as viable solutions to these issues. Techniques such as convolutional neural networks (CNNs) [11] and long short-term memory networks (LSTM) [12] have shown promising results in EMS applications. In highly interactive real-time contexts, deep reinforcement learning (DRL) offers distinct advantages. Mnih et al. [13] introduced the Deep Q Network (DQN), combining deep learning with Q learning to master Atari games directly from pixel inputs without manual feature engineering. Lillicrap et al. [14] proposed the Deep Deterministic Policy Gradient (DDPG) algorithm, addressing continuous control problems using an actor–critic framework with deterministic policy gradients. Oh et al. [15] developed a method integrating memory mechanisms and active perception in Minecraft to tackle complex tasks in partially observable environments. Schulman et al. [16] introduced Trust Region Policy Optimization (TRPO), stabilizing policy updates via constrained optimization to improve training efficiency. Lastly, Kendall et al. [17] demonstrated real-world end-to-end autonomous driving using deep reinforcement learning, achieving control with just 20 min of real-world training data. Zhang et al. [18] looked at the development of energy management strategies in the field of hybrid vehicles, laying the foundation for the subsequent application of deep reinforcement learning on energy management strategies. Liu et al. [19,20] developed an EMS based on Q learning, which, compared to stochastic dynamic programming, showed notable improvements in computational efficiency, optimality, and adaptability. However, its reliance on discrete action and state spaces limits its applicability in more complex EMS scenarios that require continuous decision-making. Wu et al. [21] extended this work by integrating Deep Q Learning (DQN) into EMSs, addressing the limitations of discrete actions and enhancing fuel economy and operational flexibility in real-world applications. Subsequently, Wu [22] and Tan [23] applied the Deep Deterministic Policy Gradient (DDPG) algorithm to EMS systems, using prioritized experience replay algorithms to approximate the effectiveness of dynamic programming, demonstrating superior performance in handling continuous actions and states.

In the aviation field, including hybrid electric vertical take-off and landing (hVTOL), the intricate nature of aircraft power systems, with numerous operational constraints and safety requirements, has presented significant challenges for the widespread adoption of DRL in EMS systems. Conventional reinforcement learning methods often struggle to identify satisfactory or even reasonable strategies during the learning process due to the complexity of these environments. Reinforcement learning (RL) relies heavily on trial and error to maximize cumulative rewards through a policy [24]. However, RL has limitations, particularly its dependence on large volumes of real samples during training, which can lead to inefficient sampling and local optima traps. Incorporating prior human knowledge into the RL training process can guide the exploration phase, imitating expert decision-making behaviors and effectively narrowing the search space for optimal solutions. In hVTOL EMS systems, prior knowledge, such as guidelines for battery charging and discharging power, recommendations for remaining battery energy, and advice on battery charge–discharge cycles, can significantly enhance sampling efficiency and reduce the risk of suboptimal solutions.

Building on these challenges and the potential of leveraging prior knowledge, this paper introduces a novel Prior Knowledge-Guided Deep Reinforcement Learning (PKGDRL) framework designed for energy management systems. The PKGDRL framework aims to enhance learning efficiency and fuel economy by integrating domain-specific insights with advanced DRL techniques.

This paper makes the following contributions:

It introduces the Prior Knowledge-Guided Deep Reinforcement Learning (PKGDRL) framework for energy management systems, utilizing the advanced DDPG algorithm to integrate prior knowledge effectively.
It addresses the challenges of learning efficiency and energy economy in DRL-based EMS by leveraging prior knowledge to streamline the exploration and learning process, significantly improving efficiency and ensuring optimal policy discovery.
It provides a detailed modeling of hVTOL systems based on turbo-electric hybrid propulsion systems, showcasing the practical applicability and benefits of the proposed DRL-based hVTOL-EMS.

The rest of the paper is structured as follows: Section 2 describes the modeling of distributed turbo-electric hybrid propulsion system. Section 3 covers the integration of prior knowledge and the development of the DRL-based hVTOL-EMS. Section 4 gives the EMS performance evaluation of PKGDRL and proves the effectiveness of the PKG module and demonstrates the robustness of PKGDRL. Finally, the paper concludes with a summary in Section 5.

2. Configuration and Modeling

2.1. Mission Profile

The aircraft described in this paper is a large hybrid electric vertical take-off and landing (VTOL) vehicle. The cruise speed and altitude of the aircraft are 0.2 Mach (244.8 kw/h) and 500 m, respectively. The main parameters of the aircraft and its power system are referenced as follows in Table 1.

This paper focuses on a rescue flight mission at sea as the subject of study, examining its flight profiles (Figure 1). Each flight profile encapsulates three key dimensions of information: required power, flight altitude, and Mach number. These dimensions are essential for evaluating the model’s ability to manage power distribution, maintain optimal altitude, and adjust the speed (Mach number) under varying operational scenarios. By analyzing these parameters, we can assess the model’s effectiveness in real-world flight conditions. The detailed description of each stage of the flight profiles used in this paper is shown in Table 2.

In addition, the algorithms and models proposed in this paper are currently applicable to vertical take-off and landing aircraft powered by turbo-electric hybrid propulsion systems, with the weight of the aircraft being 500 kg and the maximum engine power being 100 kw, and the specific parameters are shown in Table 1. If it is necessary to expand to other tasks or other types and parameters of aircraft, it is necessary to change the flight profile to retrain the model.

2.2. Hybrid Power System

In a turbo-electric hybrid propulsion system, the turboshaft engine is kept at its rated speed, where it is most efficient, to drive the generator to produce electrical energy, which is then used to power the electric propulsion units, rather than directly powering the aircraft. This setup is well suited to the series configuration as it reduces mechanical complexity, increases redundancy, and enhances control over the propulsion system, making it a preferred choice for hybrid aircraft. Given these advantages, this study focuses on the design and analysis of a turbo-electric hybrid propulsion system using a series configuration, as illustrated in Figure 2.

Efficiency considerations, such as AC/DC and DC/AC conversion efficiencies, are crucial for the integration of turbo-electric systems with batteries. These efficiencies directly impact the effective utilization and transmission of power between components, thereby affecting the overall performance and energy economy of the propulsion system [25].

The EMC dynamically adjusts the power distribution between the electric propulsion units and onboard systems based on the total power demand of the aircraft, optimizing overall energy efficiency and performance. When the power output of the aviation turbo-electric system exceeds the total power demand of the aircraft, the excess power is directed to charge the battery. This function ensures that energy is not wasted and keeps the battery ready for peak power demands or emergency situations [26].

2.3. Component Modeling

2.3.1. Turbo-Electric Model

The turbo-electric model consists of a turboshaft engine and a generator. The relationship between these two components can be described by the following equation:

T_{eng} = 9550 \times P_{eng} / n

(1)

P_{gen} = P_{eng} \times η_{g e n}

(2)

T_{eng}

(Nm) represents the engine torque,

P_{eng}

(kW) denotes the engine output power, and n (rpm) signifies the engine speed.

η_{g e n}

denotes the generator efficiency, which can be obtained from the interpolation table reflecting the performance characteristics of the generator, as shown in Figure 3.

Fuel consumption is the primary metric for evaluating the performance of an energy management system as it directly affects the operational efficiency of an aircraft and its impact on the environment. Fuel economy reflects the ability of a system to minimize fuel consumption while maintaining desired performance [26]. Calculating the fuel consumption of an internal combustion engine is relatively simple due to its relatively simple mechanical structure [27]. Because the turboshaft engine does not have the kind of automobile engine fuel consumption map, only the design point of the fuel consumption data, other operating point fuel consumption is generally based on a high-precision component-level model, obtained by solving the turboshaft engine common operating equations. This method is obviously not applicable to real-time estimation of fuel consumption of the hybrid system, so it is necessary to carry out the full wrap-around modeling calculations offline, and train the real-time fuel consumption module.

In order to be able to calculate the fuel consumption in real time and to ensure high accuracy, this paper utilizes neural networks and machine learning algorithms to produce a real-time fuel consumption prediction module. Python 3.11’s LightGBM, XGBoost, CatBoost, and torch libraries were mainly used to build machine learning models. The principle is to predict the real-time fuel consumption of the engine using the target speed, target power, and flight altitude. The principle is shown in Figure 4.

To ensure robust predictions, we evaluated several machine learning algorithms, including CatBoost, LightGBM, XGBoost, Multilayer Perceptron (MLP), and Deep Neural Networks (DNNs, an extension of MLP with additional hidden layers). Among them, CatBoost, LightGBM, and XGBoost are decision tree-type machine learning algorithms; their tree has a maximum depth of 6 and learning rate of 0.05. MLP and DNN are neural networks; they have 3 and 5 hidden layers, respectively, and learning rate is 0.001 for all of them. These algorithms were selected for their ability to handle non-linear relationships and high-dimensional data effectively. We assessed their performance using the Root Mean Square Error (RMSE) as the evaluation metric, focusing on the model’s accuracy in predicting fuel consumption.

Based on the results shown in Table 3, LightGBM demonstrates the highest accuracy in predicting fuel consumption, as indicated by the lowest Root Mean Square Error (RMSE) among the tested algorithms. Due to its superior performance in handling complex data patterns and providing reliable predictions, we have selected LightGBM as the real-time prediction module for fuel consumption in our energy management system.

2.3.2. Li-Ion Battery Model

The working principle of the battery subsystem is to update the State of Charge (SoC) based on the power

P_{batt}

allocated by the energy management controller, as given in Equation (3):

\{\begin{array}{l} I (t) = \frac{V_{oc} (t) - \sqrt{V_{o c}^{2} (t) - 4 \cdot R_{0} \cdot P_{batt} (t)}}{2 \cdot R_{0}} \\ S o C (t) = \frac{Q_{0} - \int_{0}^{t} I (t) d t}{Q} \end{array}

(3)

where

V_{oc} (t)

is the open-circuit voltage,

R_{0}

is the internal resistance, and the State of Charge (SoC) is the percentage of the remaining battery capacity relative to the rated capacity.

Q_{0}

is the initial battery capacity,

Q

is the nominal total battery capacity, and

I (t)

is the current.

P_{batt}

is the battery power.

P_{batt}

is essentially the power output of the engine subtracted from the demanded power, i.e., it is calculated how much power needs to be output from the battery at this point, as shown in Equations (4) and (5).

P_{mot} = P_{req} / η_{f a n}

(4)

P_{batt} = ((P_{mot} / η_{D C / A C}) - (P_{gen} \times η_{A C / D C})) / η_{D C / D C}

(5)

In addition, the open-circuit voltage

V_{o c} (t)

and internal resistance of the battery are not constant but vary with the State of Charge (SoC) level. As the SoC changes, the battery’s electrochemical properties alter, leading to variations in these parameters. Figure 5 illustrates how the open-circuit voltage and internal resistance change with varying SoC levels.

2.4. Operation Mode and Constraints

This section outlines the operation mode of the system and details the constraints that must be managed throughout its operation.

Limit 1: Battery output power

P_{batt}

must not be greater than the maximum power

P_{batt}^{\max}

30 kw.

Limit 2: The battery output power

P_{batt}

must not be greater than the maximum value of its own performance; this value, determined by the voltage

V_{oc} (t)

and resistance

R_{0}

together, is a changing value, and the constraints are shown below, that is, at this time the maximum value of the battery’s own performance is

\frac{V_{o c}^{2} (t)}{4 \cdot R_{0}}

.

V_{o c}^{2} (t) - 4 \cdot R_{0} \cdot P_{b a a t} (t) \geq 0

(6)

Limit 3: Engine output

P_{eng}

shall not be greater than 100 kw above maximum power.

Limit 4: When SoC is 0, the battery is depleted, so the final calculated

P_{batt}

must be less than or equal to 0 (charging). Similarly, when SoC is 1, the battery is fully charged, so the final calculated

P_{batt}

must be greater than or equal to 0 (discharging). Particularly, we need to consider situations where at an SoC of 0,

P_{batt}

cannot provide power, and the engine alone cannot meet the total power demand of the aircraft. This situation results from an unreasonable cumulative strategy over multiple moments, indicating the overall strategy is flawed and should also be penalized. Therefore, this phenomenon must be avoided as well. The entire operation process, as illustrated in Figure 6, can be divided into the following steps:

Using Equation (4), the total power demand can directly determine the total power of the electric propulsion unit ( $P_{mot}$ ). Jump to step 2.
Using the engine output power (as the decision variable), the generator output power ( $P_{gen}$ ) can be obtained by using Equations (1) and (2). Jump to step 3.
After obtaining $P_{mot}$ and $P_{gen}$ , use Equation (5) to determine the battery output power ( $P_{batt}$ ). Jump to step 4.
Perform Limit 1 and Limit 2 checks on the resulting $P_{batt}$ . If it exceeds the maximum value, fix it to the maximum feasible value and go to step 5, and vice versa go to step 6.
Since $P_{batt}$ has been changed, using Equations (1), (2) and (5), compute the changed $P_{eng}$ and check it for limit 3. If it exceeds the maximum, fix it to the feasible maximum and re-complete step 3. If at this point $P_{batt}$ still does not satisfy the restriction 1 and restriction 2 checks, give a penalty to make the RL algorithm realize that this strategy is undesirable. Jump to step 6.
Perform the restriction 4 check, and if it is not satisfied give a penalty to make the RL algorithm realize that this strategy is undesirable. Jump to step 7.
Taking the $P_{eng}$ at this time with the externally input flight altitude and Mach number as data, the fuel consumption at this time is obtained using the fuel real-time prediction module.
Use $P_{batt}$ in Equation (3) to update the SoC value.

For the energy management system of distributed turbo-electric hybrid propulsion system, the optimal strategy generally takes the optimal value near the boundary conditions, so defining the boundary conditions and constraints clearly becomes one of the necessary conditions to find the model for the more optimal strategy. The essence of reinforcement learning is trial and error, and this study combines the four boundary conditions defined to require that the explored actions stay within the limits of multiple boundary conditions as much as possible, which somewhat reduces the tendency of the actions to explore outside the boundary conditions and avoids most of the irrational strategies. As shown in Figure 6, the actions selected in this paper are tested and restricted by heavy boundary conditions; for those that can be restricted, we put their values on the boundary conditions, and for those that cannot be restricted, we can only give a huge penalty to this unreasonable action.

3. Utilization of Prior Knowledge and Deep Reinforcement Learning

In this study, prior knowledge includes battery characteristics, the maximum energy required during aircraft operation, the number of battery charge and discharge cycles, and the constraints obtained in Section 2.4. They will serve as constraints to guide the EMS towards global optimality. This study uses DDPG to further explore the potential of energy management, as compared to DQL. Prior knowledge derives from the universal knowledge of the optimal EMS.

3.1. Prior Knowledge

The charge and discharge characteristics of the battery are also important factors influencing EMS performance. From the variation in internal resistance during the charge and discharge processes in Figure 5, it can be seen that when the SoC is 0.6, the internal resistance is minimal, and the battery efficiency is the highest [28]. Similarly, we need to avoid operating in regions of low battery efficiency, such as when SoC is less than 0.3, where the charge and discharge internal resistances are high, leading to low efficiency. From another perspective, when the battery is discharged to only 30% of its total capacity, the battery voltage will drop sharply, also affecting battery efficiency.

It is known that the aircraft will operate at a high power level for a period during operation. Relying solely on engine output power, often even at maximum, may not meet the power requirements during this period. Therefore, the battery must also output a certain power to complement the engine to meet the power requirements. As prior knowledge, we require the battery pack to maintain a sufficient amount of energy for the next high-power operation in most cases.

According to experience, an excellent EMS should not change the charge and discharge state frequently. Therefore, we consider minimizing the number of changes in the charge and discharge state as part of the prior knowledge. Additionally, we incorporate the constraints from Section 2.4 as the final part of prior knowledge, aiming to guide EMS to avoid penalties and irrational phenomena.

3.2. DRL-Based Energy Management Strategy

DRL agents encounter environments with Markov properties. The agent and the environment interact continuously: the agent selects actions, the environment provides rewards for these actions, and its presents new states to the agent. This study combines the DDPG algorithm with prior knowledge of hVTOL to learn the optimal EMS actions.

Figure 7 shows how hVTOL interacts with the environment and agents. The states, action variables, and rewards are set as follows:

\{\begin{array}{l} s = \{S o C, H, M a, P_{req}\} \\ a = \{continuous P_{eng}\} \\ r = f_{P T} \{- [α \cdot E_{total} (t) + β \cdot F_{P X} (t)]\} \end{array}

(7)

where

α

is the energy consumption weight and

β

is the a priori knowledge guidance weight.

r = f_{P T}

is the prospect theory, which helps DRL to explore the better values faster, and applying prospect theory, a psychological theory that can simulate the decision-making of human decision-makers, to DRL is also one of the innovations of this paper. The formula of prospect theory is as follows:

f_{P T} (x) = \{\begin{array}{l} x^{0.88}, & x \geq 0 \\ - 2.25 {(- x)}^{0.88}, & x < 0 \end{array}

(8)

E_{fuel} (t)

here is the output value of the real-time fuel prediction module referred to in Section 2.3.1, where

F_{PK} (t)

is the a prior knowledge guidance term, which consists of four parts, namely, SoC guidance range, energy reserve, charging/discharging state change, and system constraints, which are denoted as

K_{1}, K_{2}, K_{3}, K_{4}

, respectively, and its calculation formula is as follows:

F_{PK} (t) = K_{1} + K_{2} + K_{3} + K_{4}

(9)

where

K_{1} = \{\begin{array}{l} P_{batt} / P_{batt}^{m a x}, S o C < 0.3 \\ - 1, 0.3 \leq S o C \leq 0.6 \\ P_{batt} / (- P_{batt}^{m a x}), S o C > 0.6 \end{array}

(10)

The specific meaning is that when the SoC is less than 0.3 when the battery is charging or the SoC is greater than 0.6 when the battery is discharging, give a reward, and vice versa give a penalty; in addition, when the SoC is in the interval of 0.3 to 0.6, the result has been to give a reward.

K_{2} = S o C_{ref} - S o C

,

S o C_{ref}

is to satisfy the next high-power operation of the reference power; when the power is less than this value then it needs to be punished.

K_{3}

is simply expressed as a charging and discharging state. The higher the number of changes, the more frequent and greater the penalty. The expression is

K_{3} = i

, where i is the number of charge/discharge changes.

K_{4}

is the system constraint; if the model breaks the constraint, it will face a huge penalty, which motivates the model in training, and will not break the constraint. The expression for

K_{4}

is

K_{4} = λ_{j}

, where j is the number of times the bounds are exceeded and

γ

is the penalty factor.

γ

is generally taken to be a larger value.

Among the various DRL algorithms, this study chooses DDPG (Deep Deterministic Policy Gradient) for its outstanding performance in handling continuous state and action spaces, thanks to its independent action network. This characteristic allows DDPG to map states to deterministic sequences of continuous actions. Table 4 provides the framework for the baseline algorithm in this paper. The algorithm employs deep neural networks to implement the critic and actor, using multilayer perceptrons to manage large and complex state and action spaces. The critic network is trained using the Bellman equation, while the actor network is updated using the sampled policy gradient method. Table 5 shows the Hyperparameters of DDPG training.

The architecture of the actor–critic network designed specifically for the EMS task is illustrated in Figure 8. This architecture features a pyramid-shaped topology, with each layer gradually decreasing in size. Extensive comparative studies have confirmed the effectiveness of this design [28,29]. To enhance learning efficiency and accelerate convergence, the DDPG framework introduces prioritized experience replay (PER). PER prioritizes the replay of experiences based on the magnitude of the TD (temporal difference) error, as shown in reference [30]. Unlike random experience replay, PER more frequently replays important observation data, significantly improving the efficiency of reinforcement learning.

In this study, DQL is used as a control method. However, DQL is only applicable to scenarios with discrete action spaces, which makes it unsuitable for high-dimensional and continuous operational spaces. Therefore, in this research, the state variables and reward functions for DQL are set the same as those for DDPG. The action range for DQL is defined by dividing the engine’s power range (60 kW to 100 kW) into 11 segments, with each segment’s boundary value serving as the action choice space.

a = P_{eng} : 60 kw, 64 kw, 68 kw, 72 kw, 76 kw, 80 kw, 84 kw, 88 kw, 92 kw, 96 kw, 100 kw

(11)

4. Results and Discussion

In this section, we present a comprehensive analysis of the EMS based on PKGDRL, evaluating its performance through extensive comparative experiments and assessment metrics. The model is tested using standard flight profiles as training data, with the initial battery State of Charge (SoC) set to 0.9.

4.1. Feasibility of PKGDRL in EMS

In order to validate the effectiveness of the methodology of this paper, the PKGDRL (PKG-DDPG) method, the rule-based method and the offline benchmark (DP) of the optimization-based method are analyzed and compared in various dimensions. The rule-based approach used in this paper can be simply summarized as follows: the engine always outputs a maximum power of 100 kw, and when the engine power is greater than the demanded power, the excess power is charged to the battery.

In terms of energy management strategies, an important metric for evaluating the merits of a strategy is the State of Charge (SoC) variation curve. The closer the shape of the SoC curve to the offline benchmark (DP) of the optimization-based method, the better the method, generally speaking. The SoC curves of the three methods at specific flight profiles are shown below:

It can be seen that the PKG-DDPG method is far superior to the rule-based method in Figure 9, and is also very close to the SoC curve of the DP in the first half of the section. The difference between the PKG-DDPG method and the DP is slightly magnified from 2000 s onwards, but the curves are similar in shape, which suggests that the chosen strategies are broadly consistent. The reason for this difference is also explainable, as we have included a battery energy retention term in the a priori knowledge (ensure sufficient energy for the next high power ascent or descent), the strategy chosen for the PKG-DDPG method avoids the SoC being too low, similar to the behavior of the DP method after 2000 s. The fuel consumption of the three methods is shown in Table 6.

From Table 6, we can see that with the same real-time strategy, the fuel economy of the PKG-DDPG method is much better than that of the rule-based method, with a fuel saving of 5.9%, which is only 1.6% away from the DP. Next, we show the specific power curves of the above three methods, which are shown by Figure 10 (PKG-DDPG), Figure 11 (Rule) and Figure 12 (DP), respectively.

4.2. Validity of the PKG Module

In Section 3, we mentioned that the PKG module mainly consists of four parts of prior knowledge and foreground theory, and the advantage of the PKG module is to help deep reinforcement learning algorithms to converge to effective strategies more easily and to end up with better strategies. In this section, the effectiveness of PKG module will be illustrated in the form of an ablation experiment. The results of the ablation experiment are shown in the following Table 7.

In this paper, two methods, DDPG and DQN, are selected to illustrate the effectiveness of prospect theory (PT), and the results are shown in Figure 13, Figure 14, Figure 15 and Figure 16. From Figure 13 we can see that DDPG without prospect theory has two sections of frequent charging and discharging drastic changes, in the previous paragraph, for example, since the ordinary DDPG is not guided by prospect theory, it is not sensitive to the loss of charge/discharge counts, which leads to the “hesitant” and repeated charging/discharging state of the strategy in that period of time.

In addition, we can see in Table 7 that DDPG with prospect theory reward is much higher than the DDPG method without prospect theory. The advantage of foreground theory can also be seen in the fuel economy. In Figure 13 and Figure 14, we can see that DDPG with the addition of prospect theory shows a huge improvement in both convergence speed and effectiveness (finding a better strategy).

In order to verify that the foreground theory is not only valid for DDPG, we also verify it on the DQN method. From Figure 15 and Figure 16, after adding the foreground theory to DQN, not only are the reward and fuel economy greatly improved, but the SoC curve is also much closer to DP, and the convergence speed and effect are also improved. It is worth noting that the oscillations of the SoC curve in Figure 15 are more frequent, which is due to the fact that the DQN, compared to the DDPG, has discrete action selection, which leads to a larger span of action selection and a larger variation in engine power, which is finally reflected in the more frequent oscillations of the SoC curve. In addition, there are also several large oscillations in the reward in Figure 16; as stated in Section 2.4, generally the optimal strategy is near the boundary conditions, but once the intelligent body explores outside of the boundary conditions, it receives a huge penalty, which is reflected as a negative reward, which is the root cause of several cliff-diving decreases in the reward in this figure.

Next, we analyzed the effect of four prior knowledge on the convergence of deep reinforcement learning, and the experimental results are shown in Figure 17. This figure shows the average reward plot for deep reinforcement learning in the application setting of this paper, with the original four prior knowledge and each missing one of the prior knowledge, reflecting the convergence of the reinforcement learning training. We numbered the four experiments as X1, X2, X3, and X4. As an example, X1 represents an experiment that used prior knowledge PK2, PK3, and PK4, but was missing PK1. Similar to X1, X2, X3, and X4 represent experiments that were missing prior knowledge PK2, PK3, and PK4, respectively. The experimental results show that the lack of any one of these four types of prior knowledge does not allow the final result to converge and does not result in a usable effective policy, i.e., all the policies are beyond the boundaries mentioned in Section 2 and are ineffective policies.

It is worth mentioning the two experiments of X2 and X4: although X2 reaches the peak of REWARD in the 110th epoch, the strategy still fails to obtain a valid solution in the model inference; the whole curve of X4 looks like REWARD has been kept at a higher value, but what is missing is the PK4 (the penalty term for exceeding the bounds), which results in the X4 curve reward looking high, but ignores the huge penalty for exceeding the boundary—essentially, the average reward level of X4 would be around −34.

To summarize this section, we experimentally demonstrate the effectiveness of the PKG module in the sense that the foreground theory possesses the effect of accelerated convergence versus convergence to a better strategy, whereas for the four prior knowledge, the absence of any one of them fails to converge to a valid strategy.

4.3. Robustness Verification

The robustness of the algorithm is initially verified by simulating the flight profiles. In this subsection, we first verify the robustness of the EMS under different initial SoC cases. Since the initial SoC is always near the reference value in practical applications, the initial SoC is set as 0.9, 0.93, 0.95, 0.87, and 0.85. Figure 18 shows the trend of the SoC profiles with different initial SoCs.

The curves converge from different initial values until they intersect at 500 s. After 500 s, the five curves basically overlap, resulting in the same terminal SoC. In conclusion, the EMS obtained in this paper is able to work properly at various initial SoCs.

Secondly, the robustness of the EMS under different cruise times is verified. Since different tasks will go to different positions in the execution in practical applications, the total time of cruise is set to 1500, 2000, and 2500 s, and we denote the above three flight profiles with different cruise times as flight profile1, flight profile2, and flight profile3, respectively. Figure 19 shows the trend of SoC profiles for different total cruise times.

It can be seen that the shapes of the three curves are basically the same, only extending the SoC variation in the leveling-off phase; in short, the EMS obtained in this paper can work normally under different total cruise leveling-off times.

5. Conclusions

A Prior Knowledge-Guided Deep Reinforcement Learning (PKGDRL)-based energy management strategy (EMS) for hybrid Vertical Take-Off and Landing (hVTOL) vehicles was investigated, leveraging the state-of-the-art Deep Deterministic Policy Gradient (DDPG) algorithm. The incorporation of prior knowledge into the DRL-based EMS significantly enhances learning efficiency and fuel economy by narrowing the exploration and learning scope of optimal solutions. This method demonstrates substantial improvements in convergence efficiency and policy effectiveness compared to traditional DRL approaches. The PKG module proposed in this paper plays a huge role in the above advantages, where the prior knowledge guides the convergence of results and the prospect theory improves the fuel economy by 0.81%.

The experimental results indicate that the proposed PKGDRL method not only maintains high fuel economy but also optimizes energy use, achieving results very close to the dynamic programming (DP) benchmarks. The difference was only 1.6%. Additionally, the model is highly generalizable and robust, with stable and consistent SoC curves over different flight profiles and initial SoC. These results suggest that PKGDRL outperforms in energy efficiency and convergence speed, proving its robustness and suitability for real-time applications in hVTOL-EMS.

The algorithms and models presented in this paper are currently applicable to vertical take-off and landing aircraft powered by turbo-electric hybrid propulsion systems. Future work will explore the transferability of the PKGDRL model to other types of hybrid electric propulsion systems, with the aim of generalizing the approach and further reducing the development cycle for different EMS applications.

Author Contributions

Conceptualization, F.Y., X.S. and J.C.; methodology, F.Y. and W.T.; software, F.Y. and X.S.; validation, F.Y., W.T. and J.C.; data curation, F.Y. and J.C.; writing—original draft preparation, F.Y.; writing—review and editing, F.Y., W.T. and J.C.; visualization, F.Y.; supervision, J.W., and X.C.; project administration, J.W. and X.C.; funding acquisition, J.W. and X.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the Yongjiang Talent Project of Ningbo (No. 2022A-012-G); Ningbo major innovation project 2025 (2022Z040); and Ningbo major innovation project 2035 (2024Z063).

Data Availability Statement

The datasets generated and analyzed during the current study are available from the corresponding author on reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

hVTOL	Hybrid electric vertical take-off and landing
DRL	Deep reinforcement learning
EMS	Energy management strategy
PKGDRL	Prior Knowledge-Guided Deep Reinforcement Learning
DDPG	Deep Deterministic Policy Gradient
DP	Dynamic programming
SoC	State of Charge
eVTOL	Electric vertical take-off and landing vehicle
ECMS	Equivalent Consumption Minimization Strategy
DQN	Deep Q Learning
RL	Reinforcement learning
PT	Prospect theory
PKG-DDPG	Prior knowledge-guided Deep Deterministic Policy Gradient
PKGDQN	Prior knowledge-guided Deep Q Learning
PER	Prioritized experience replay
TD	Temporal difference

References

Xiang, S.; Xie, A.; Ye, M.; Yan, X.; Han, X.; Niu, H.; Huang, H. Autonomous eVTOL: A summary of researches and challenges. Green Energy Intell. Transp. 2023, 2023, 100140. [Google Scholar] [CrossRef]
Bravo, G.M.; Praliyev, N.; Veress, Á. Performance analysis of hybrid electric and distributed propulsion system applied on a light aircraft. Energy 2021, 214, 118823. [Google Scholar] [CrossRef]
Jansen, R.; Bowman, C.; Jankovsky, A.; Dyson, R.; Felder, J. Overview of NASA Electrified Aircraft Propulsion (EAP) Research for Large Subsonic Transports. In Proceedings of the 53rd AIAA/SAE/ASEE Joint Propulsion Conference, Atlanta, GA, USA, 10–12 July 2017; American Institute of Aeronautics and Astronautics: Reston, VA, USA, 2017. [Google Scholar] [CrossRef]
Nerone, T. Hybrid thermally efficient core (HyTEC) project overview. In Proceedings of the 8th International Workshop on Aviation and Climate Change, Cambridge, UK, 15–17 May 2023. [Google Scholar]
Uwase, D. An overview on zero-emission tugs (or ships) in the market. Mar. Sustain. Technol. 2024; in press. [Google Scholar]
Ali, A.M.; Söffker, D. Towards optimal power management of hybrid electric vehicles in real-time: A review on methods, challenges, and state-of-the-art solutions. Energies 2018, 11, 476. [Google Scholar] [CrossRef]
Li, S.G.; Sharkh, S.M.; Walsh, F.C.; Zhang, C.N. Energy and battery management of a plug-in series hybrid electric vehicle using fuzzy logic. IEEE Trans. Veh. Technol. 2011, 60, 3571–3585. [Google Scholar] [CrossRef]
Peng, J.; He, H.; Xiong, R. Rule based energy management strategy for a series–parallel plug-in hybrid electric bus optimized by dynamic programming. Appl. Energy 2017, 185, 1633–1643. [Google Scholar] [CrossRef]
Hofman, T.; Steinbuch, M.; van Druten, R.M.; Serrarens, A.F.A. Rule-based energy management strategies for hybrid vehicle drivetrains: A fundamental approach in reducing computation time. IFAC Proc. Vol. 2006, 39, 740–745. [Google Scholar] [CrossRef]
Onori, S.; Serrao, L.; Rizzoni, G. Adaptive equivalent consumption minimization strategy for hybrid electric vehicles. In Proceedings of the ASME 2010 Dynamic Systems and Control Conference, Cambridge, MA, USA, 12–15 September 2010; ASMEDC: New York, NY, USA, 2010; pp. 499–505. [Google Scholar] [CrossRef]
Maroto Estrada, P.; de Lima, D.; Bauer, P.H.; Mammetti, M.; Bruno, J.C. Deep learning in the development of energy management strategies of hybrid electric vehicles: A hybrid modeling approach. Appl. Energy 2023, 329, 120231. [Google Scholar] [CrossRef]
Sun, X.; Fu, J.; Yang, H.; Xie, M.; Liu, J. An energy management strategy for plug-in hybrid electric vehicles based on deep learning and improved model predictive control. Energy 2023, 269, 126772. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing Atari with Deep Reinforcement Learning. arXiv 2013, arXiv:1312.5602. [Google Scholar]
Lillicrap, T.; Hunt, J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
Oh, J.; Chockalingam, V.; Singh, S.; Lee, H. Control of memory, active perception, and action in minecraft. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; JMLR.org: Cambridge, MA, USA, 2016; pp. 2790–2799. [Google Scholar]
Schulman, J.; Levine, S.; Moritz, P.; Jordan, M.; Abbeel, P. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 6–11 July 2015; JMLR.org: Cambridge, MA, USA, 2015; pp. 1889–1897. [Google Scholar]
Kendall, A.; Grimes, M.; Cipolla, R. Learning to Drive in a Day. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 8248–8254. [Google Scholar] [CrossRef]
Zhang, F.; Hu, X.; Langari, R.; Cao, D. Energy management strategies of connected HEVs and PHEVs: Recent progress and outlook. Prog. Energy Combust. Sci. 2019, 73, 235–256. [Google Scholar] [CrossRef]
Liu, T.; Zou, Y.; Liu, D.; Sun, F. Reinforcement Learning of Adaptive Energy Management With Transition Probability for a Hybrid Electric Tracked Vehicle. IEEE Trans. Ind. Electron. 2015, 62, 7839–7849. [Google Scholar] [CrossRef]
Zou, Y.; Liu, T.; Liu, D.; Sun, F. Reinforcement learning-based real-time energy management for a hybrid tracked vehicle. Appl. Energy 2016, 171, 372–382. [Google Scholar] [CrossRef]
Wu, J.; He, H.; Peng, J.; Li, Y.; Li, Z. Continuous reinforcement learning of energy management with deep Q network for a power split hybrid electric bus. Appl. Energy 2018, 222, 799–811. [Google Scholar] [CrossRef]
Wu, Y.; Tan, H.; Peng, J.; Zhang, H.; He, H. Deep reinforcement learning of energy management with continuous control strategy and traffic information for a series-parallel plug-in hybrid electric bus. Appl. Energy 2019, 247, 454–466. [Google Scholar] [CrossRef]
Tan, H.; Zhang, H.; Peng, J.; Jiang, Z.; Wu, Y. Energy management of hybrid electric bus based on deep reinforcement learning in continuous state and action space. Energy Convers. Manag. 2019, 195, 548–560. [Google Scholar] [CrossRef]
Gläscher, J.; Daw, N.; Dayan, P.; O’Doherty, J.P. States versus Rewards: Dissociable Neural Prediction Error Signals Underlying Model-Based and Model-Free Reinforcement Learning. Neuron 2010, 66, 585–595. [Google Scholar] [CrossRef]
Zong, J.; Zhu, B.; Hou, Z.; Yang, X.; Zhai, J. Evaluation and comparison of hybrid wing VTOL UAV with four different electric propulsion systems. Aerospace 2021, 8, 256. [Google Scholar] [CrossRef]
Bai, M.; Yang, W.; Li, J.; Kosuda, M.; Fozo, L.; Kelemen, M. Sizing methodology and energy management of an air–ground aircraft with turbo-electric hybrid propulsion system. Aerospace 2022, 9, 764. [Google Scholar] [CrossRef]
Buvarp, D.; Leijon, J. Comparison of energy use, efficiency and carbon emissions of an electric aircraft and an internal combustion engine aircraft. In Proceedings of the AIAA AVIATION Forum and ASCEND 2024, San Diego, CA, USA, 12–16 August 2024; AIAA: Reston, VA, USA, 2024; p. 4183. [Google Scholar]
Lian, R.; Peng, J.; Wu, Y.; Tan, H.; Zhang, H. Rule-interposing deep reinforcement learning based energy management strategy for power-split hybrid electric vehicle. Energy 2020, 197, 117297. [Google Scholar] [CrossRef]
Larochelle, H.; Bengio, Y.; Louradour, P.; Lamblin, P. Exploring Strategies for Training Deep Neural Networks. J. Mach. Learn. Res. 2009, 10, 1–40. [Google Scholar]
Schaul, T.; Quan, J.; Antonoglou, I.; Silver, D. Prioritized Experience Replay. arXiv 2015, arXiv:1511.05952. [Google Scholar]

Figure 1. Flight envelope.

Figure 2. Ideal power source for distributed electric aircraft: turbo-electric hybrid propulsion system.

Figure 3. Generator efficiency diagram.

Figure 4. Fuel consumption prediction module.

Figure 5. Variation of open-circuit voltage and internal resistance with SoC.

Figure 6. Flowchart of model operation.

Figure 7. Agent–environment interaction for hVTOL energy management.

Figure 8. The architecture of the actor–critic network.

Figure 9. Comparison chart of SoC of three types of methods.

Figure 10. PKG-DDPG power curve.

Figure 11. Rule power curve.

Figure 12. DP power curve.

Figure 13. The impact of the presence or absence of prospect theory on the SoC curve of the DDPG approach.

Figure 14. The effect of the presence or absence of foreground theory on the convergence effect of DDPG methods.

Figure 15. The effect of the presence or absence of foreground theory on the SoC curve of the DQN method.

Figure 16. Convergence effect of DQN methods with and without prospect theory.

Figure 17. Convergence effect diagram for the missing part of the a priori knowledge.

Figure 18. Trend of SoC curves with different initial SoCs.

Figure 19. Trends in SoC curves for different total cruise leveling times.

Table 1. Performance parameters of turbo-electric hybrid propulsion system.

Parameters	Value
Whole Aircraft Mass ( $M$ )	500 kg
Turbine Engine Output Power ( $P_{e n g}$ )	60–100 kw
Turbine Engine Speed ( $n$ )	60,000 rpm
Maximum Battery Output Power ( $P_{b a t t}^{\max}$ )	30 kw
Number of Batteries Carried	20
Total Battery Capacity ( $Q$ )	40 Ah
Initial State of Charge (SoC)	0.9
Propeller Efficiency ( $η_{f a n}$ )	0.87
AC/DC Efficiency ( $η_{A C . D C}$ )	0.98
DC/DC Efficiency ( $η_{D C . D C}$ )	0.98
DC/AC Efficiency ( $η_{D C . A C}$ )	0.98

Table 2. Flight envelope description.

Flight Stage	Duration	Descriptions
Vertical Take-off	50 s	The aircraft initiates a vertical ascent from the ground
Climb	250 s	The aircraft gains altitude steadily to reach its cruising level
Cruise	800 s	The aircraft maintains a constant altitude and speed, optimizing fuel efficiency
Descent and Deceleration	200 s	The aircraft reduces altitude and speed in preparation for lower-altitude operations or landing
Low-Altitude Cruise	600 s	The aircraft flies at a lower altitude, typically for closer observation or specific operational tasks
Level Deceleration	200 s	The aircraft slows down while maintaining a level flight path
Hover (Work)	300 s	The aircraft remains stationary in the air for tasks requiring stable positioning
Climb	200 s	The aircraft ascends again, usually to resume cruising or transition to another phase
Cruise	600 s	Repeated to cover more distance or reach a new operational area
Descent	250 s	The aircraft reduces altitude as it approaches its final destination
Vertical Landing	50 s	The aircraft completes the mission by landing vertically back on the ground

Table 3. RMSE of each method.

Method	RMSE
CatBoost	0.028648
LightGBM	0.027041
XGBoost	0.028507
MLP	0.088775
DNN	0.138209

Table 4. Baseline algorithm.

Baseline Algorithm

1: Initialization: critic network and actor network with weights

θ^{Q}

and

θ^{μ}

, target network Q′ and

μ^{'}

with weights

θ^{Q^{'}} \leftarrow θ^{Q}

,

θ^{μ^{'}} \leftarrow θ^{μ}

, memory pool

R

, a random process

N

for action exploration

2: for episode = 1 :

M

do

3: get initial states:

S o C_{1}, H_{1}, M a_{1}, P_{r e q 1}

4: for

t

= 1,

T

do

5: Select action

a_{t} = μ (s_{t} | θ^{μ}) + N_{t}

according to the current policy and exploration noise

6: Execute action

a_{t}

, observe reward

r_{t}

and new states

s_{t + 1}

7: Putting the obtained reward into the prospect theory gets a new reward

r_{t}

8: Store transition

(s_{t}, a_{t}, r_{t}, s_{t + 1})

in

R

9: Sample a minibatch of transitions

(s_{t}, a_{t}, r_{t}, s_{t + 1})

from

R

with priority experience replay

10: Set

y_{i} = r_{i} + γ Q^{'} (s_{i + 1}, u^{'} (s_{i + 1} | θ^{μ^{'}}) | θ^{Q^{'}})

11: Update critic by minimizing the loss:

L = \frac{1}{N} \sum_{i} (y_{i} - Q (s_{i}, a_{i} | θ^{Q} {))}^{2}

12: Update the actor policy using the sampled policy gradient:

\nabla_{θ^{μ}} J \approx \frac{1}{N} \sum_{i} \nabla_{a} Q (s, a | θ^{Q}) |_{s_{i}, a = μ {(s}_{i})} \nabla_{θ^{μ}} μ (s | θ^{μ}) |_{s_{i}}

13: Update the target networks:

θ^{Q^{'}} \leftarrow τ θ^{Q} + (1 - τ) θ^{Q^{'}}

,

θ^{μ^{'}} \leftarrow τ θ^{μ} + (1 - τ) θ^{μ^{'}}

14: end for

15: end for

Table 5. Hyperparameters of DDPG agents.

Hyperparameter	Value
Learning rate of actor-network	0.00001
Learning rate of critic-network	0.00001
Batch size	512
Experience buffer size	10,000
Discount factor	0.69
Smoothing factor	0.01
Step time (s)	1
Terminal time (s)	3500

Table 6. Comparison chart of fuel consumption for three types of methods.

Method	Fuel Consumption (kg)
PKG-DDPG	50.662
Rule	53.651
DP	49.832

Table 7. PKG module ablation experiment.

Method	PT	PK1	PK2	PK3	PK4	Effective Solution	Fuel (kg)	Reward
DDPG	√	√	√	√	√	Effective	50.662	3414.943
DDPG	×	√	√	√	√	Effective	51.072	1724.988
DDPG	√	×	√	√	√	Ineffective	-	-
DDPG	√	√	×	√	√	Ineffective	-	-
DDPG	√	√	√	×	√	Ineffective	-	-
DDPG	√	√	√	√	×	Ineffective	-	-
DQN	√	√	√	√	√	Effective	51.518	51.239
DQN	×	√	√	√	√	Effective	53.563	−657.764

√ means that the module was used in this experiment (a priori knowledge) and × means that it was not.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yu, F.; Tang, W.; Chen, J.; Wang, J.; Sun, X.; Chen, X. Deep Reinforcement Learning Based Energy Management Strategy for Vertical Take-Off and Landing Aircraft with Turbo-Electric Hybrid Propulsion System. Aerospace 2025, 12, 355. https://doi.org/10.3390/aerospace12040355

AMA Style

Yu F, Tang W, Chen J, Wang J, Sun X, Chen X. Deep Reinforcement Learning Based Energy Management Strategy for Vertical Take-Off and Landing Aircraft with Turbo-Electric Hybrid Propulsion System. Aerospace. 2025; 12(4):355. https://doi.org/10.3390/aerospace12040355

Chicago/Turabian Style

Yu, Feifan, Wang Tang, Jiajie Chen, Jiqiang Wang, Xiaokang Sun, and Xinmin Chen. 2025. "Deep Reinforcement Learning Based Energy Management Strategy for Vertical Take-Off and Landing Aircraft with Turbo-Electric Hybrid Propulsion System" Aerospace 12, no. 4: 355. https://doi.org/10.3390/aerospace12040355

APA Style

Yu, F., Tang, W., Chen, J., Wang, J., Sun, X., & Chen, X. (2025). Deep Reinforcement Learning Based Energy Management Strategy for Vertical Take-Off and Landing Aircraft with Turbo-Electric Hybrid Propulsion System. Aerospace, 12(4), 355. https://doi.org/10.3390/aerospace12040355

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Reinforcement Learning Based Energy Management Strategy for Vertical Take-Off and Landing Aircraft with Turbo-Electric Hybrid Propulsion System

Abstract

1. Introduction

2. Configuration and Modeling

2.1. Mission Profile

2.2. Hybrid Power System

2.3. Component Modeling

2.3.1. Turbo-Electric Model

2.3.2. Li-Ion Battery Model

2.4. Operation Mode and Constraints

3. Utilization of Prior Knowledge and Deep Reinforcement Learning

3.1. Prior Knowledge

3.2. DRL-Based Energy Management Strategy

4. Results and Discussion

4.1. Feasibility of PKGDRL in EMS

4.2. Validity of the PKG Module

4.3. Robustness Verification

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI