4.1. Training Results and Comparative Analysis
The numerical experiments were conducted on a workstation equipped with an AMD Ryzen 9 7945HX processor, 64 GB of RAM, and an NVIDIA GeForce RTX 4060 GPU within a Windows 11 environment. To evaluate the performance of the proposed Energy-Aware Twin Delayed Deep Deterministic Policy Gradient (EA-TD3) trajectory planning algorithm, a comparative analysis was performed against three benchmark DRL frameworks: Deep Deterministic Policy Gradient (DDPG) [
35], Soft Actor-Critic (SAC) [
36], and the standard Twin Delayed Deep Deterministic Policy Gradient (TD3) [
37].
To ensure the validation of operational robustness and energy efficiency, all algorithms were trained and evaluated under identical environmental conditions, including synchronized wind field topologies and calibrated energy intensity models. The hyperparameter configurations for each algorithm are summarized in
Table 12. The EA-TD3 algorithm utilizes a
architecture and refined target policy noise levels, as identified in
Section 3.8, to manage the nonlinearities and stochastic perturbations inherent in urban trajectory planning.
Comparative experiments were conducted across four algorithms. To mitigate stochastic training fluctuations, a smoothing technique was applied to the raw data.
Figure 10 illustrates the training progress, where
Figure 10a shows the success rate,
Figure 10b shows the average reward,
Figure 10c shows the average steps, and
Figure 10d shows the energy consumption.
In the initial training phase, the agents undergo stochastic exploration, leading to frequent collisions with urban obstacles or environmental boundaries. These early failures result in a success rate near 0% and negative cumulative rewards. During this stage, the absence of an optimized trajectory planning policy keeps the average episode length between 150 and 180 steps. Such inefficient maneuvers lead to rapid energy depletion, nearly exhausting the eVTOL UAV battery budget.
As training progresses, the agents internalize trajectory planning behaviors through experience replay and policy gradient updates. Performance improvements emerge between 10,000 and 20,000 steps, characterized by rising success rates and rewards transitioning into positive territory, while average episode lengths decrease to 80–100 steps. By the conclusion of the training, EA-TD3 achieves a success rate of approximately 90–95%, an average reward of 100–130, and a reduced temporal footprint of 30–40 steps per episode.
Comparatively, SAC achieves the second-best performance with a success rate between 75% and 85%, while TD3 and DDPG exhibit success rates in the range of 60–75% and 50–65%, respectively. The average energy consumption of EA-TD3 is 10–20 J lower than that of the baseline algorithms. This efficiency validates the energy-aware optimization strategy, which is required for battery-constrained missions in complex urban airspaces.
4.2. Analysis of Energy-Aware Trajectory Planning Mechanisms
The training results validate the feasibility of the EA-TD3 algorithm for trajectory planning. To facilitate a qualitative comparison,
Figure 11 illustrates the 3D trajectories generated by the four evaluated algorithms within a standardized mission scenario. From a geometric perspective, all algorithms successfully execute a complete flight profile encompassing the phases of climb, cruise, and descent. However, the distinct topological variations in the trajectories, as shown in
Figure 11, highlight the varying trajectory characteristics of the agents within the low-altitude urban airspace.
For eVTOL UAV platforms such as the EH216-S, the requirement for trajectory planning lies in the rational utilization of limited energy resources amidst environmental uncertainties. To evaluate the impact of physical constraints on performance, we conduct three progressive analyses within a unified experimental framework. These scenarios represent distinct analytical dimensions of the same mission environment to evaluate the EA-TD3 framework.
1. Maneuvering Behavior Analysis Based on Real Battery Dynamics (Analysis 1). This analysis evaluates how the data-driven energy consumption model shapes the fundamental flight behaviors of the eVTOL UAV. Crucially, since the eVTOL UAV must operate without human intervention, the system must independently evaluate the cost disparities between various maneuvers to ensure mission viability. This phase specifically tests the ability of the EA-TD3 algorithm to internalize nonlinear power costs and identify the optimal balance between climb, cruise, and descent. It should be noted that the dynamic wind field is consistently applied across all analyses to ensure environmental realism and generalizability of the findings.
2. Robustness Testing Under Safe Flight Envelope Constraints and Resource Scarcity (Analysis 2). To evaluate the eVTOL UAV sensitivity to the Safe Flight Envelope (SFE), we progressively compressed the battery energy budget within the established dynamic wind field. This stress test is designed to identify the critical thresholds at which traditional energy-agnostic algorithms suffer functional failure. By highlighting the limitations of conventional methods under coupled environmental and resource pressures, this analysis demonstrates how the EA-TD3 autonomous agent ensures the reliability of missions through proactive trajectory planning reconfiguration. The inclusion of wind effects throughout this process validates that the energy-sensing mechanism remains effective under stochastic perturbations.
3. Analysis of Training Convergence Stability and Constraint Satisfaction (Analysis 3). This analysis is designed to dissect the underlying mechanisms responsible for the performance of the proposed algorithm. By analyzing the fluctuation rate of energy consumption throughout the training process, we distinguish the decision-making stability between soft penalties and hard constraints. The results reveal that incorporating explicit SOC observations enables the agent to internalize a risk-averse trajectory planning strategy. This mechanism ensures that the eVTOL UAV treats energy limits as a non-negotiable safety boundary. Such an intrinsic commitment to constraint satisfaction is the reason why the EA-TD3 autonomous agent maintains higher mission reliability in complex urban environments compared to traditional energy-agnostic frameworks.
4.2.1. Analysis 1: Maneuvering Behavior Analysis Based on Real Battery Dynamics
Under the baseline condition of sufficient energy reserves, this section investigates how the nonlinear power dynamics derived from the CMU battery dataset dictate the decision-making logic of the eVTOL UAV agent. Specifically, we examine the emergent ability of the agent to distinguish between energy-intensive climb/descent maneuvers and energy-efficient cruise flight within the unified experimental framework.
To mitigate the influence of stochastic policy exploration and ensure a rigorous validation of algorithmic robustness, we conducted 50 independent Monte Carlo (MC) simulations for each of the four trained algorithms.
Table 13 presents the statistical averages and performance metrics derived from these trials. To evaluate the geometric quality of the generated trajectories, two directional metrics are introduced: the heading metric, representing the cumulative absolute changes in the horizontal yaw angle to quantify total steering effort, and the smoothness metric, defined as the integrated angular deviation between successive 3D velocity vectors to reflect trajectory fluidity and the suppression of abrupt maneuvers. This comparative analysis verifies the stability of the autonomous trajectory planning policy when interacting with high-fidelity physical constraints.
The experimental results demonstrate that EA-TD3 achieves an 11.6% reduction in energy consumption average compared to the baseline, exhibiting higher energy efficiency and trajectory quality among the evaluated algorithms. This energy-saving effect is not solely a consequence of trajectory length reduction, but originates from the internalization of climb and descent cost balancing through extensive experience sampling. The specific analytical findings are as follows.
First, the optimization of maneuver frequency is governed by power cost awareness. Analysis of the behavioral characteristics reveals that EA-TD3 recorded the lowest frequency of vertical maneuvers, with only 8.82 climbs and 7.72 descents on average. Incorporating the high-fidelity battery dynamics model confirms that eVTOL UAV power consumption is highly nonlinear. In autonomous flight, the vertical climb phase requires substantial power to counteract gravity, resulting in instantaneous demands that significantly exceed those of the horizontal cruise phase.
Traditional algorithms, such as DDPG and TD3, lack an explicit energy perception mechanism and tend to exhibit greedy obstacle-avoidance behavior by frequently resorting to drastic altitude changes. In contrast, the EA-TD3 agent understands the physical cost associated with high-rate battery discharge. As illustrated in the lateral profile in
Figure 12, EA-TD3 maintains a more stable altitude layer and opts to bypass obstacles through horizontal trajectory adjustments rather than energy-intensive vertical shifts. While the trajectories of DDPG and EA-TD3 appear nearly coincident during the initial climb and mid-course phases in
Figure 12, this phenomenon stems from their shared algorithmic lineage. Since EA-TD3 is developed as an energy-aware extension of the TD3 framework, which itself evolves from DDPG, both algorithms utilize a deterministic actor structure that prioritizes the most direct geometric path to satisfy the primary mission completion reward in the early training stages. Within the rigid constraints of a narrow urban canyon, this deterministic gradient leads to a convergence toward a similar initial climb path. However, significant topological differences emerge when comparing EA-TD3 with SAC and the standard TD3. For SAC, the inclusion of a maximum entropy objective encourages continuous action space exploration, resulting in stochastic and curved climb paths. For the standard TD3, the absence of an integrated energy penalty causes the agent to ignore the high battery discharge rate during aggressive climbs. In contrast, EA-TD3 identifies a critical divergence during the descent phase by utilizing its twin-delayed value estimation to prioritize a gradual glideslope. This strategic shift avoids the high power loss regimes of the battery system and ensures the structural load stability of the eVTOL UAV throughout the trajectory planning mission.
Despite its superior energy performance,
Figure 13 reveals that the proposed EA-TD3 method exhibits noticeable right-angle maneuvers during the final landing phase. This behavior originates from the mission success-oriented logic in the terminal state where the agent executes aggressive heading corrections to eliminate residual positional errors to ensure the eVTOL UAV precisely strikes the landing pad and secures the success reward. Since the current reward structure does not impose heavy penalties on rapid heading changes, these sharp turns emerge as the most effective strategy for the agent to guarantee mission completion. However, these results also highlight a limitation of the current model, specifically the omission of strict kinematic smoothness constraints such as angular acceleration limits. Although the agent achieves high energy efficiency, it does so through a trade-off that sacrifices trajectory fluidity in the final seconds. In real-world scenarios, such abrupt maneuvers could impose severe structural stress on the airframe or even lead to aerodynamic stall. Future work will focus on incorporating curvature-constrained action spaces or refined smoothness terms to ensure that the generated trajectory planning solutions are more aligned with physical flight dynamics.
Secondly, to intuitively reveal decision-making disparities under physical constraints,
Figure 13 illustrates the correlation between flight altitude profiles and instantaneous energy consumption rates for each algorithm. As observed in the side plot of
Figure 13, the energy consumption trajectories of the baseline algorithms and EA-TD3 diverge significantly during the initial 10 to 15 steps. The main plot further demonstrates that the instantaneous energy consumption rates of TD3, DDPG, and SAC are approximately 40% higher than that of EA-TD3 in the initial phase.
From a trajectory perspective, the baseline algorithms, driven by energy-agnostic logic, tend to execute extremely steep climb maneuvers to rapidly establish a vertical safety margin. While this strategy is valid in purely geometric trajectory planning, these aggressive climbs consume 40% to 50% of the total energy budget within the first third of the mission when physical dynamics are considered. In contrast, the EA-TD3 energy consumption rate remains stable between 0.8 and 2.5 J per step throughout the flight.
This stabilized flight profile offers significant engineering value for autonomous operations. Consistent power output effectively extends battery cycle life and alleviates thermal management pressure on motors and electronic controllers during high-power discharges. Furthermore, for eVTOL platforms such as the EH216-S, smooth altitude transitions ensure structural load stability and minimize maneuvering stress on the airframe, adhering to rigorous operational standards for civil UAVs. In the figures, the green circles and red crosses represent the starting positions and targets, respectively, while stars denote the waypoints. Most importantly, the EA-TD3 results demonstrate that the energy perception mechanism enables the agent to identify the path with the lowest physical energy cost in complex urban canyons, achieving a dual optimization of energy and spatial efficiency.
4.2.2. Analysis 2: Robustness Testing Under Safe Flight Envelope Constraints and Resource Scarcity
In the UAM environment, an eVTOL UAV must handle both limited energy and wind turbulence in urban canyons. This section presents a stress test within the experimental framework by reducing the battery energy budget from 200 J to 120 J in a non-uniform dynamic wind field. To ensure statistical significance, we conducted 50 independent MC autonomous trajectory planning trials for each algorithm at every energy level. These experiments verify the algorithms’ sensitivity to the flight envelope and mission delivery capability under resource scarcity.
The results in
Table 14 show that the reliability of the algorithms diverges as the energy budget tightens. As a pilotless platform, the EH216-S requires its trajectory planning agent to identify paths that satisfy the flight envelope without real-time human correction, making the robustness of the autonomous logic the primary factor in mission success.
First, we analyze the performance drop observed at the energy threshold. For this scenario, 120 J is defined as the survival threshold. At this limit, energy-agnostic algorithms such as DDPG and SAC showed high sensitivity with success rates falling to 0% and 6.1% respectively. This occurs because these baseline frameworks lack real-time SOC awareness and cannot reconfigure their trajectories based on remaining energy reserves.
Figure 14 shows the flight status of different algorithms under energy limits. The thicker lines represent failed missions, and the circles of the same color represent the corresponding endpoints. The EA-TD3 algorithm results in the fewest failed missions, followed by TD3 and SAC, while DDPG shows the lowest reliability. As illustrated in
Figure 14, when baseline algorithms encounter wind perturbations requiring additional thrust, the eVTOL UAV frequently suffers from functional failure in the final stages, typically between steps 25 and 35. This occurs because their early maneuvers are energy-intensive, leaving no power buffer to counteract late-stage environmental stochasticity. Without managing energy–safety trade-offs, these agents lead the platform to power depletion before mission completion.
Secondly, the autonomous adaptive mechanism under aerodynamic energy coupling distinguishes EA-TD3 from other algorithms. EA-TD3 maintains a success rate of 87.8% even under the 120 J constraint. This resilience stems from its trajectory reconfiguration capability where the algorithm couples the wind field correction factor with SOC observations. Upon perceiving that energy levels are approaching critical boundaries, the agent identifies regions with lower aerodynamic resistance within the wind field. This decision-making logic moves beyond geometric obstacle avoidance and functions as a physics-driven resource-preserving strategy.
At the 120 J threshold, the performance lead of EA-TD3 over the standard TD3 reaches 61.3 percentage points. Analysis of the successful trajectories in
Figure 14 reveals that when facing an energy deficit, the algorithm increases flight range elasticity by suppressing maneuver gradients and optimizing cruise altitudes. These experiments verify that energy boundary sensitivity is a pivotal metric for evaluating trajectory planning frameworks. EA-TD3 demonstrates that explicit energy perception mechanisms serve as a safety guarantee for eVTOL UAV platforms confronting meteorological disturbances, ensuring autonomous airworthiness and mission delivery capability during energy-critical phases.
4.2.3. Analysis 3: Training Convergence Stability and Constraint Satisfaction
This analysis is designed to examine the relationship between decision-making stability and constraint mechanisms by quantifying energy efficiency fluctuations throughout the training evolution. To validate the reliability of the algorithm during the learning process, this section evaluates its statistical performance across multiple independent training sessions. Through the comparative visualization in
Figure 15, the distinction between EA-TD3 and the benchmark frameworks regarding energy consumption stability becomes evident.
First, we analyze the impact of soft penalties and hard constraints on training stability. In
Figure 15a, we can see that during the training evolution, DDPG, TD3, and SAC exhibited energy consumption fluctuations which oscillated between 57 J and 70 J. This instability stems from their reliance on an energy efficiency penalty within the reward function, which lacks a mandatory environmental termination condition. Under this mechanism, the agent frequently attempts high-energy maneuvers during exploration without bearing the immediate consequence of mission failure, which causes the strategy to oscillate between energy conservation and aggressive execution. In contrast, EA-TD3 maintained energy consumption within a narrow range of 61 to 63 J. By establishing energy depletion as a hard constraint termination condition and introducing explicit SOC observations, the agent was forced to internalize a risk-averse decision-making logic from the early stages of training. This mechanism ensures that EA-TD3 exhibits policy consistency across independent stochastic trajectory planning trials.
Secondly, we conducted statistical consistency verification through 50 independent simulations. To further demonstrate this reliability, the box plot in
Figure 15b illustrates the statistical distribution characteristics of the trials. EA-TD3 not only achieved the lowest average energy consumption of 62.46 J but also maintained the smallest interquartile range (IQR) among all evaluated algorithms. This demonstrates that its learned energy allocation strategy possesses high repeatability when encountering varying starting coordinates and environmental noise. In contrast, the baseline algorithms exhibit wider bandwidths and outliers, which indicate that in the absence of hard constraints, they are prone to unpredictable high-energy behaviors. Such a level of uncertainty is a challenge for the autonomous operation of eVTOL UAV platforms.
The experimental results demonstrate that relying solely on reward-based penalties is insufficient to cultivate strategies with consistent reliability. Through a hard constraint-driven training paradigm, EA-TD3 induces flight behaviors characterized by high consistency and low volatility, which provides technical support for the safe execution of unmanned missions in confined urban airspaces.
4.3. Ablation Study
To verify the necessity and effectiveness of the energy perception mechanism in DRL-based eVTOL UAV autonomous trajectory planning, this study conducts ablation experiments by increasing environmental fidelity. Unlike the previous stress tests focusing on energy boundaries, these experiments are performed with a sufficient energy budget of 200 J to isolate the impact of three physical factors on trajectory planning performance, including dynamic wind fields, high-fidelity energy consumption models, and complex urban obstacle layouts.
As summarized in
Table 15, we designed four simulation tiers progressing from idealized environments to those with physical realism. By introducing key variables of the operational environment, we established a performance benchmark for eVTOL UAV autonomous trajectory planning. This tiered configuration is intended to deconstruct the influence of environmental fidelity on the algorithm decision-making logic to ensure that the agent possesses robustness for transition from simulation to real-world application.
In the Level 1 configuration, the environment consists of a constant wind field, a uniform energy consumption model, and three simplified building obstacles. The mission objective is to reach a single target landmark. This level serves as an idealized trajectory planning environment, which provides a performance baseline for the agent.
In the Level 2 configuration, while maintaining constant wind and simplified obstacles, a realistic energy consumption model derived from the CMU dataset is introduced. This model accounts for the nonlinear power distribution of the eVTOL UAV during maneuvers such as climbing, cruising, and descending. This stage subjects the trajectory planning algorithm to physical performance constraints to test its ability to internalize aerodynamic costs.
In the Level 3 configuration, retaining the high-fidelity energy model, the constant wind field is replaced with a dynamic sinusoidal wind field that fluctuates across spatial and temporal dimensions. This enhancement introduces non-stationary aerodynamic drag, which evaluates the autonomous trajectory planning system trajectory correction capabilities under dynamic environmental uncertainties.
In the Level 4 configuration, representing the peak of environmental fidelity, this level integrates dynamic wind fields, CMU-based power models, six building clusters, and multiple traversal landmarks. This scenario requires the algorithm to execute energy management across multiple task phases while navigating spatial complexity. It serves as an assessment of the eVTOL UAV autonomous operational envelope in urban canyons.
Four algorithms, DDPG, TD3, SAC, and EA-TD3, were trained independently at each environment level under identical hyperparameter configurations. Each agent underwent training steps per level with an initial energy budget of 200 J. The architecture employed a three-layer neural network with 256 units per layer, a batch size of 256, a learning rate of , and a discount factor . Upon convergence, 50 independent MC simulations were performed at each level for evaluation. Success was defined as reaching the final waypoint without collisions or power depletion.
The quantitative results of the ablation study in
Table 16 show that as environmental fidelity increases, EA-TD3 maintains operational resilience. The performance gap between EA-TD3 and the baseline frameworks exhibits a nonlinear expansion as physical constraints tighten.
During Level 1 and Level 2, characterized by simplistic configurations, all algorithms achieved success rates between 96% and 100% under idealized spatial or static energy models. This suggests that in predictable environments without dynamic disturbances, reward penalty mechanisms are sufficient for basic trajectory planning. Consequently, the necessity of an explicit energy perception mechanism is less prominent under these low-fidelity conditions.
However, a performance bifurcation point emerges at Level 3 with the introduction of non-stationary aerodynamic disturbances. Without the coupled perception between wind field correction factors and real-time SoC, the success rates of SAC, TD3, and DDPG decline to a range between 82% and 92%. In contrast, EA-TD3 maintains a success rate of 94% while reducing energy consumption by 13% by leveraging its adaptability to dynamic uncertainties.
Under the Level 4 operational envelope, the results define the survival red line for eVTOL UAV autonomous systems. DDPG shows a performance drop with a success rate of 62% and energy consumption of 46.61 J. Conversely, EA-TD3 maintains a 98% success rate despite the constraints of dynamic wind fields and multi-waypoint transitions, requiring 38.40 J on average. Compared to TD3, EA-TD3 improves the success rate by 23.3 percentage points and achieves a 9.9% gain in energy efficiency. These findings demonstrate that as operational environments shift toward high fidelity, energy awareness is a core safety driver ensuring the mission survivability of unmanned eVTOL UAV platforms in urban canyons.
Ablation experiments demonstrate that energy perception mechanisms are necessary in complex operational environments.
Figure 16 illustrates the percentage advantage of the EA-TD3 algorithm over the DDPG, SAC, and TD3 models regarding the trajectory planning success rate and energy consumption under different environmental levels. In simplified scenarios such as Level 1 and Level 2, energy optimization managed through reward penalty mechanisms is sufficient for basic mission success. As shown in
Figure 16a, the success rate advantage of EA-TD3 remains marginal at these stages as all algorithms achieve performance levels near 100%.
However, as environmental factors introduce non-stationary wind disturbances and increased spatial complexity in Level 3 and Level 4, explicit energy state observation becomes critical for mission survivability. The success rate advantage of EA-TD3 widens at Level 3, where it maintains a 94% success rate while the performance of other algorithms declines to between 82% and 92%. At Level 4, this performance gap expands further. EA-TD3 sustains a 98% success rate while the success rate for TD3 drops to 90%, SAC to 72%, and DDPG to 62%. EA-TD3 outperforms the least effective algorithm by 36 percentage points. These results confirm that for unmanned eVTOL UAV systems operating in high-fidelity environments, energy awareness is a requirement for maintaining the SFE.
This improvement in efficiency is accompanied by energy consumption patterns.
Figure 16b illustrates the disparities in energy depletion across various complexity levels. At Level 1, the discrepancy in energy consumption is negligible, and this trend continues through Level 2 where the difference remains minimal. However, at Level 3 and Level 4, EA-TD3 achieves approximately 10% energy savings compared to the baseline algorithms. This margin represents the accumulated efficiency gains in complex multi-waypoint scenarios. The energy-saving advantage and the success rate of EA-TD3 exhibit a synchronized monotonic trend, which expands as the realism of the simulation environment increases.
The ablation experiments provide a foundation for integrating energy-aware reinforcement learning into energy-constrained eVTOL UAV autonomous trajectory planning frameworks. The empirical results demonstrate that energy awareness is more than a supplementary optimization function. Instead, it serves as a core mission capability for eVTOL UAV platforms operating in complex environments, with its significance increasing alongside system complexity. For unmanned eVTOL UAV systems navigating under realistic constraints such as variable dynamic wind fields, urban building clusters, and multi-task landmark sequences, the physical awareness trajectory planning paradigm provided by EA-TD3 is necessary. It ensures mission execution and energy safety management, providing technical support for reliable autonomous flight within future UAM networks.