Lightweight Obstacle Avoidance for Fixed-Wing UAVs Using Entropy-Aware PPO

Su, Meimei; Chai, Haochen; Zhao, Chunhui; Lyu, Yang; Hu, Jinwen

doi:10.3390/drones9090598

Open AccessArticle

Lightweight Obstacle Avoidance for Fixed-Wing UAVs Using Entropy-Aware PPO

by

Meimei Su

^†,

Haochen Chai

^†

,

Chunhui Zhao

,

Yang Lyu

^* and

Jinwen Hu

School of Automation, Northwestern Polytechnical University, Xi’an 710129, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Drones 2025, 9(9), 598; https://doi.org/10.3390/drones9090598

Submission received: 4 July 2025 / Revised: 7 August 2025 / Accepted: 14 August 2025 / Published: 26 August 2025

Download

Browse Figures

Versions Notes

Abstract

Obstacle avoidance during high-speed, low-altitude flight remains a significant challenge for unmanned aerial vehicles (UAVs), particularly in unfamiliar environments where prior maps and heavy onboard sensors are unavailable. To address this, we present an entropy-aware deep reinforcement learning framework that enables fixed-wing UAVs to navigate safely using only monocular onboard cameras. Our system features a lightweight, single-frame depth estimation module optimized for real-time execution on edge computing platforms, followed by a reinforcement learning controller equipped with a novel reward function that balances goal-reaching performance with path smoothness under fixed-wing dynamic constraints. To enhance policy optimization, we incorporate high-quality experiences from the replay buffer into the gradient computation, introducing a soft imitation mechanism that encourages the agent to align its behavior with previously successful actions. To further balance exploration and exploitation, we integrate an adaptive entropy regularization mechanism into the Proximal Policy Optimization (PPO) algorithm. This module dynamically adjusts policy entropy during training, leading to improved stability, faster convergence, and better generalization to unseen scenarios. Extensive software-in-the-loop (SITL) and hardware-in-the-loop (HITL) experiments demonstrate that our approach outperforms baseline methods in obstacle avoidance success rate and path quality, while remaining lightweight and deployable on resource-constrained aerial platforms.

Keywords:

deep reinforcement learning; navigation; collision avoidance; depth estimation; monocular vision

1. Introduction

Fixed-wing Unmanned Aerial Vehicles (UAVs) have emerged as key platforms with the development of the Low-Altitude Economy (LAE) [1] and the rise of Urban Air Mobility (UAM) [2]. Due to the extended endurance and high-speed characteristics of fixed-wing UAVs compared to rotary UAVs [3], they are especially preferred in applications such as long-distance cargo delivery [4], wide-area inspection [5], and emergency operations [6]. When a UAV is flying at low altitudes, autonomously avoiding potential obstacles, such as human-made facilities and terrains, becomes a key capability to guarantee its flight safety. Achieving low-altitude obstacle avoidance on a fixed-wing UAV is especially challenging due to its higher speed and inescapably large turning radius compared to other aerial platforms, as shown in Figure 1. It requires not only extended environment perception capabilities to realize early collision warning but also more flexible avoidance path planning subject to more rigorous dynamic constraints, to generate feasible avoidance trajectories.

Classic path planning methods predominantly rely on sampling [7], or optimization techniques [8], which necessitate comprehensive environmental perception, or rely on high-precision prior maps. The high-speed flight property of fixed-wing UAVs often makes it impractical to produce comprehensive maps or collect accurate environmental data of unknown scenarios [9]. Specifically, some scholars utilize costly sensors, such as LiDAR [10], or RADAR [11] to enhance environmental perception. However, these approaches not only increase system complexity and cost but also limit their applicability in scenarios with limited onboard resources. Thus, a lightweight navigation framework that removes the dependency on complete environment information or expensive sensors is imperative.

The visual sensor is a most widely used sensor in robot applications with its unparalleled low Size, Weight, and Power consumption (SWaP) and high sensory resolution. Compared to classic intelligent heuristic algorithms [12,13,14], DRL demonstrates unique strengths in tackling the obstacle avoidance problem with instantaneous visual information. More specifically, the success of DRL in gaming applications [15] has spurred interest in exploring its potential for visual avoidance [16], a domain where DRL further exemplifies its strength as an end-to-end learning framework [17,18]. However, strategic problems still need to be resolved for the integration of DRL into the fixed-wing UAV obstacle problem. Studies [19,20,21] have employed reward functions based on distance metrics to produce favorable results. However, these reward functions only consider target arrival and fail to account for path smoothness, which is particularly crucial for flight platforms such as fixed-wing UAVs. When considering the coupling optimization of visual information and obstacle avoidance, some studies use multi-frame depth maps [22] as the input of deep reinforcement learning collision avoidance, which undoubtedly increases the systematic latency and jeopardizes real-time performance, especially on edge computing platforms, which is a critical index for the high-speed fixed-Wing UAV. Besides the specific strategic problem, a persistent common challenge in utilizing DRL algorithms in robot applications is the imbalance between exploration and exploitation during training. Excessive focus on exploration can hinder algorithm convergence, while insufficient exploration may overlook superior solutions [23].

Keen to address the above challenges in using DRL to achieve fixed-wing obstacle avoidance, we propose a lightweight deep reinforcement learning framework that utilizes single-frame images captured by low-cost vision sensors as input and exploits the advantages of Proximal Policy Optimization (PPO) [24] to effectively address obstacle avoidance tasks, offering the following contributions:

Considering the flight stability and dynamic constraints of fixed-wing UAVs, we propose an optimization framework formulated as an entropy-aware PPO learning model, which incorporates a reward function balancing target approach and path maintenance to ensure smooth and efficient collision avoidance flight trajectories.
We introduce a strategy updating mechanism based on entropy-aware adjustment to address the challenge of local optimization caused by PPO’s reliance on historical data during training. This mechanism ensures that our algorithm identifies obstacle-avoidance strategies with higher success rates.
We demonstrate that the proposed framework outperforms other methods in obstacle avoidance efficiency and flight path smoothness through software-in-the-loop and hardware-in-the-loop experiments, and we confirm the feasibility of running the algorithm on edge devices.

The remainder of this paper is organized as follows. Section 2 reviews related works, and Section 3 presents the problem definition and its mathematical formulation. Section 4 introduces the entropy-aware PPO. Section 5 discusses the computational experiments and their results. Section 6 concludes the paper.

2. Related Work

2.1. Fixed-Wing UAV Collision Avoidance

In contrast to obstacle avoidance algorithms for quadcopters, those for fixed-wing UAVs must account for complex dynamic constraints. For example, a fixed-wing UAV usually has narrower cruise velocity bounds, making it unable to change its velocity abruptly or hover in place like a quadrotor UAV.

Classic obstacle avoidance algorithms, such as Dijkstra [25] and A-star [26], are commonly used in static obstacle environments. However, these methods encounter significant challenges with local minima and often generate trajectories lacking smoothness, particularly in environments with closely spaced obstacles or narrow passages. Another class of algorithms, such as artificial potential field methods [27], RRT [28], and VFH [29], is more suitable for dynamic obstacle environments. These methods do not account for the dynamic constraints of fixed-wing UAVs, which often necessitate extensive post-processing to smooth the generated paths. To mitigate the reliance on post-processing, many researchers have adopted Dubins curves [30] for fixed-wing UAV path planning. Dubins curves employ a combination of straight-line and circular arc segments to generate paths that precisely satisfy the kinematic constraints of fixed-wing UAVs. In addition to Dubins-based planners, several recent works have explored the use of meta-heuristic optimization algorithms that incorporate kinematic constraints, such as maximum curvature, directly into their cost functions. These methods provide greater flexibility and solution diversity in complex and constrained environments, particularly when smooth path generation is crucial for fixed-wing UAVs. For example, ref. [31] proposed a constrained differential evolution algorithm tailored for UAV path planning in disaster scenarios, effectively embedding obstacle and curvature constraints into the planning process. Ref. [32] introduced a multi-strategy fusion differential evolution approach to enhance path feasibility in complex terrains. More recently, ref. [33] developed DE3D-NURBS, a 3D planner that integrates differential evolution with non-uniform rational B-splines (NURBS), enabling curvature-constrained and obstacle-aware trajectories in cluttered 3D spaces. These methods offer robust alternatives to traditional planners by accommodating both environmental complexity and vehicle-specific motion constraints.

Different from the above, which puts the obstacle avoidance idea in the planning layer, there are also a large number of studies that consider the obstacle avoidance module in the control layer, for example, approaches based on Model Predictive Control (MPC) [34] primarily use optimization theory or continuously update waypoints or routes to prevent collision. Nevertheless, this approach necessitates the creation of highly detailed models of aircraft, with the modeling process for aircraft with varying dynamics being inherently distinct. Consequently, the potential for generalization is limited.

The aforementioned methods face significant challenges when sensors provide only partial or incomplete environmental and obstacle information. To address this problem, numerous studies have focused on leveraging learning-based algorithms to solve obstacle avoidance under partially observed or unknown environmental conditions.

2.2. DRL for Visual Navigation

Deep reinforcement learning, a prominent subfield of machine learning, provides a unique advantage in facilitating interactive and adaptive learning within complex and uncertain environments. This capability makes it a preferred approach among researchers aiming to tackle intricate problems that traditional algorithms struggle to address effectively. From the initial deployment of table storage to address discrete state and action spaces, such as SARSA [35] and Q-learning [36], these techniques can effectively address problems with reduced complexity. Subsequently, the concept of value approximation in neural networks led to the development of Deep Q-learning (DQN) [37], Double DQN [38], and Dueling DQN [39]. These algorithms overcome the limitations of previous approaches, effectively enabling the handling of continuous state spaces. Dueling DQN, on the other hand, incorporates an advantageous function to assess the quality of an action within the dual network structure. In contrast to the aforementioned algorithms, which are based on the iterative updating of Bellman value functions, DDPG [40], SAC [41], TRPO [42], and PPO [24] are based on the theory of gradient descent. Among these options, PPO stands out for its stability, usability, and efficiency. The introduction of a clipping loss function serves to enhance the stability of the training process, limiting the magnitude of each policy update and thereby avoiding the potential for drastic policy changes. Compared to TRPO, PPO simplifies the training process by avoiding complex constrained optimization calculations while keeping policy updates within a safe range. In addition to excelling in benchmark tasks, PPO proves effective in addressing complex real-world problems. However, one limitation of PPO is its high data requirement for training and its heavy reliance on historical data, which may lead to excessive dependence on prior experience if early-stage learning data is insufficient.

The combination of DRL and visual information aims to optimize the navigation efficiency and performance of RL by improving the feature extraction and fusion of information from the perception side. Some researchers have utilized RGB images [43], area-segmented images [16], and depth images [22] to guide robots in optimizing navigation and obstacle avoidance strategies. Many works typically rely on comprehensive maps or collect accurate local maps; however, the high-speed flying nature of fixed-wing UAVs often makes it difficult to obtain comprehensive maps or gather accurate data in unknown environments where environmental information is impractical to acquire [16,44,45,46]. Other scholars have enhanced the generalization capabilities of DRL by incorporating human knowledge as prior information, enabling navigation in novel environments with sparse rewards [47]. While a substantial quantity of data can be augmented with human knowledge through supervised learning, the acquisition of strategic learning is frequently constrained by the methodologies employed for label generation. Conversely, the acquisition of human knowledge necessitates a considerable investment of effort.

In summary, integrating deep reinforcement learning with visual information presents a powerful approach to solving robot navigation and obstacle avoidance challenges. However, existing DRL-based navigation and obstacle avoidance algorithms often suffer from an imbalance between learning and utilization, resulting in prolonged convergence times, low efficiency, and suboptimal obstacle avoidance performance. To address this, our approach incorporates a self-regulating entropy mechanism to enhance reinforcement learning performance. Combined with a backpropagation reward mechanism, this approach significantly improves navigation efficiency in unknown obstacle environments for fixed-wing UAVs.

3. Methodology

3.1. Problem Formulation

The monocular vision-based obstacle avoidance problem can be modeled as a Markov Decision Process (MDP), which is characterized by a tuple.

{S, A, P, r}

. At time step t, the fixed-wing UAV collects environmental state variables

s_{t}

using its camera. To ensure realistic flight behavior, the UAV’s forward velocity is constrained within

[v_{\min}, v_{\max}]

, where

v_{\min} >

20 m/s, reflecting the physical limitations of fixed-wing platforms. The policy is trained under these bounds, preventing hovering or zero-velocity behavior that is infeasible in practice. Based on the state

s_{t} \in S

, the UAV selects an action

a_{t}

from the action space

A

. The action

a_{t} \in A

interacts with the environment, generating a reward signal

r_{t}

and resulting in a transition to the next state

s_{t + 1}

. Given a forward speed

v_{t}

and angular velocity

ω_{t}

, this constraint translates to

|ω_{t}| \leq \frac{v_{t}}{r_{\min}}

. This effectively limits the maximum allowable heading change per time step and ensures that the action commands result in curvature-continuous, dynamically feasible trajectories. The objective of the algorithm is to find a policy that maximizes the cumulative reward

\sum_{t = 0}^{\infty} γ^{t} \cdot r_{t}

by selecting actions

a_{t}

that yield the highest expected return at any given time step t.

3.1.1. State Space

The state space encompasses the environmental data collected by the camera and the information regarding the target. This can be represented as

S = \{S^{env}, S^{tar}\}

(1)

where

S^{env}

represents the environment captured by the camera, while

S^{tar}

denotes the features related to the target.

S^{env}

refers to a latent representation obtained from the encoder of an autoencoder network, designed to reduce redundant and adversarial information. In this context, the RGB image captured by the front-view monocamera

i_{RGB}

is processed to extract depth information, as illustrated below

D_{t} = Γ_{depth} (I_{RGB}, θ_{depth}),

(2)

where

D_{t} \subset D \in R^{H \times W}

denotes a depth map with dimensions H (height) and W (width), and

Γ_{depth}

is the depth estimation model with parameter

θ_{depth}

at time step t.

The latent representation is subsequently derived through the convolutional encoding of the current generated depth map. This process, at a given time step t, can be expressed as follows

f_{t} = Γ_{enc} (D_{t}, θ_{e}),

(3)

where

f_{t} \subseteq f \in R^{K}

denotes the latent variable of size K, while

Γ_{enc}

represents the encoding function parameterized by

θ_{e}

. Accordingly,

S^{env}

is derived as follows

S^{env} = [f],

(4)

S^{target}

represents a local goal, which can be expressed as follows

S^{target} = [d, α],

(5)

where d and

α

represent the normalized relative distance and angle to the goal position, respectively. In this context, we consider a 2-dimensional coordinate system, where d is computed as follows

d = \frac{∥ p_{target}, p_{ego} ∥_{2}}{d_{\max}},

(6)

Here,

∥ p_{target} - p_{ego} ∥_{2}

denotes the Euclidean distance between the UAV’s current position and the target. The term

d_{\max}

represents the maximum allowable travel distance in the mission area, acting both as a safety constraint and a normalization factor. By dividing by

d_{\max}

, we ensure that the distance measure

d \in [0, 1]

, which stabilizes reward scaling and improves training convergence.

α

is in radians and is calculated by

α = arctan 2 (\frac{p_{target, y} - p_{ego, y}}{p_{target, x} - p_{ego, x}}) / π,

(7)

where x and y correspond to the longitudinal and lateral axes of the coordinate system, respectively.

3.1.2. Action Space

To adapt to the flight characteristics of fixed-wing UAVs, the action space is composed of waypoints in various directions within the body-fixed coordinate system under a constant altitude system, as well as the continuation of the action from the previous time step. This can be formulated as

A = \{\begin{matrix} w_{t - 1} & if continue last action \\ w_{t} & otherwise \end{matrix},

(8)

where

w_{t}

represents the choosing waypoint and can be calculated as

(x_{t}^{b o d y}, y_{t}^{b o d y}) \overset{Δ}{=} λ \cdot (cos (Δ_{yaw}), sin (Δ_{yaw})),

(9)

where

Δ_{yaw} \in \{0, \pm \frac{π}{6}, \pm \frac{π}{4}, \pm \frac{π}{3}\}

represents the discrete desired change in yaw angle magnitude and

λ

represents the Euclidean distance between the calculated waypoint and the current position.

Remark 1.

In this study, the UAV control input is limited to the yaw angle, as the obstacle avoidance and navigation policy operate in a 2D horizontal plane. The UAV maintains a constant altitude throughout the mission, which is regulated by a separate altitude-hold controller. This design choice is informed by real-world deployment constraints of fixed-wing UAV autopilot systems, particularly PX4 and ArduPilot, both of which typically perform heading changes in discrete steps due to bounded turn rates and control limitations inherent in their autopilot design.

3.1.3. Reward Function

The design of the reward function remains one of the most significant challenges in DRL algorithms. A primary limitation of RL is that reward functions are typically hand-crafted and tailored to specific domains. There has been quite a bit of research on replay buffer, and most of the work provides a way to automatically obtain cost functions from expert demonstrations. However, these approaches are often computationally intensive, and the optimization required to identify a reward function that accurately represents expert trajectories is inherently complex. This paper focuses on designing a denser reward function to enhance the obstacle avoidance strategy, aiming not only to achieve high success rates in avoiding obstacles but also to enable smoother paths. Using the “knowledge” stored in the replay buffer for reward function design means that the agent’s reward is shaped not only by the immediate outcome of its current action but also by its contrast or divergence from past experiences. This encourages information gain, exploration of novel knowledge, and avoidance of redundant experiences or repetitive behaviors. Behavioral guidance toward aligning with more optimal past experiences. In the process of obstacle avoidance, this paper introduces a reward function that incorporates an inference mechanism to ensure robust learning under conditions of general applicability and rapid convergence. Specifically, we use the following four sub-reward terms: target-reaching reward

r_{target}

, collision penalty

r_{collision}

, distance incentive

r_{dis}

, and trajectory-following reward

r_{track}

. Each component is associated with a positive scalar weight

C_{i}

, where

\forall i \in {1, 2, 3, 4}

,

C_{i}

represents the weight of the i-th reward module.

When the drone reaches its designated target, it immediately receives a reward

r_{target}

defined as

r_{target} = \{\begin{matrix} C_{1} & if reaches target \\ 0 & otherwise \end{matrix} .

(10)

If the drone experiences a collision, it incurs a negative reward

r_{collision}

as a penalty defined as

r_{collision} = \{\begin{matrix} - C_{2} & if collision happens \\ 0 & otherwise \end{matrix} .

(11)

The drone should approach the target as quickly as possible, making it necessary to encourage the drone to be closer to the target at time t than at time

t - 1

. The corresponding reward

r_{dis}

is defined as

r_{dis} = Δ d \cdot C_{3},

(12)

Δ d = d_{t - 1} - d_{t},

(13)

where

d_{t}

represents the relative distance between the drone and the target point at time t.

To encourage the UAV to move consistently toward the goal, we introduce a directional alignment reward based on the cosine similarity between the agent’s instantaneous motion vector and the direct vector pointing from its current position to the target. This reward does not rely on a pre-planned global path but instead promotes locally goal-directed behavior at each timestep. Therefore, we designed a reward function

r_{track}

that encourages the drone to follow the planned path while learning to interpret depth information to avoid obstacles. The corresponding reward

r_{track}

is defined as

r_{track} = δ \cdot C_{4},

(14)

δ = \frac{p_{target} \cdot p_{ego}}{{∥p_{target}∥}_{2} \cdot {∥p_{ego}∥}_{2}},

(15)

where

\forall i \in \{1, \dots, 4\}, C_{i} > 0

represents the weight of each reward module.

p_{ego}

denotes the UAV’s current velocity vector, and

p_{target}

denotes the vector from the UAV’s current position to the target. A larger

δ

indicates that the UAV is better aligned with the target direction, thereby contributing to smoother and more efficient navigation. This formulation is particularly suitable for partially observable environments where global trajectories are unavailable.

To encourage the strategy to “move closer to high-quality experience” rather than merely maintaining the entropy of the action, we introduce the distribution of high-value experience in the replay buffer.

\begin{matrix} r_{buffer} (s) = - D_{KL} (π_{E} (\cdot | s) ∥ π_{θ} (\cdot | s)) \\ = - \sum_{a} π_{E} (a | s) log \frac{π_{E} (a | s)}{π_{θ} (a | s)} \end{matrix}

(16)

where

π_{E} (a | s) \propto exp (\frac{r (s, a)}{τ})

.

τ

is a temperature parameter controlling the sharpness of the softmax distribution over rewards with

τ

denoting a temperature parameter that controls the stochasticity of the expert policy. A smaller

τ

results in a more deterministic policy favoring high-reward actions, while a larger

τ

induces a flatter, more exploratory distribution. The reward is taken as the maximum weighted value over all experiences in the buffer. We construct a buffer with a priority queue structure, specifically to retain high-value and high-quality experiences.

The overall reward function is constructed by combining the four aforementioned sub-terms as follows,

r = r_{target} + r_{collision} + r_{dis} + r_{track} + r_{buffer} .

(17)

Remark 2.

The smoothness of the generated path is promoted not through explicit analytical constraints, but through training under physically realistic dynamics and reward components that discourage abrupt directional changes. Quantitative curvature analysis confirms the emergence of smooth and feasible paths appropriate for fixed-wing flight.

4. Lightweight Obstacle Avoidance Using Entropy-Aware PPO

In this section, we design a novel entropy-aware PPO-based lightweight model to solve the fixed-wing UAV obstacle avoidance problem. The framework contains a lightweight backbone, an efficient strategy selection mechanism, and a new optimization objective function.

4.1. Overview

The overall framework is illustrated in Figure 2. A depth map is first generated from a monocular RGB image using ZoeDepth [48]. The resulting depth is encoded by a lightweight backbone [49] to extract visual features, which are concatenated with goal features and input into the policy network to predict actions. The critic network simultaneously estimates state values. An adaptive entropy module regulates the exploration-exploitation balance during training, while the replay buffer is continuously updated based on a tailored reward function to facilitate stable policy optimization. Our method is motivated by previous studies that examined the encoding of depth maps from multiple consecutive frames as state variables in reinforcement learning models for navigation and obstacle avoidance tasks. However, the storage and encoding of multiple frames of depth maps lead to high memory consumption and degrade the real-time performance of the system, rendering this approach unsuitable for fast-moving fixed-wing UAVs equipped with edge computing devices.

To address the challenge of increased memory consumption caused by stacking multiple depth maps and to alleviate potential generalization issues that arise in depth inference, we incorporated a fine-tuned monocular depth estimation model proposed by [48], which is proven to be reliable across a wide range of environments. By fine-tuning this depth model for our specific application, we are able to generate reliable enough depth maps for the following deep reinforcement learning module and at the same time reduce the computational burden of processing multiple depth frames. Additionally, one of our primary objectives was to ensure that the proposed architecture sustained computational efficiency, particularly when deployed on edge devices with limited processing capabilities. To this end, we integrated [49], a model specifically designed for efficient feature extraction, as part of our system architecture. Specifically, it improves feature extraction by performing element-wise multiplication between two linear transformation features, an operation inherently optimized for execution on Neural Processing Unit (NPU) architectures. NPUs are specifically optimized for matrix operations and parallel processing [50], making them particularly suitable for operations involving intensive linear algebra computations. By leveraging the compatibility between the feature fusion mechanism and NPU hardware, we achieved both high performance and low power consumption, which are essential for edge computing environments.

4.2. Strategy Selection Mechanism

In this section, we undertake a detailed examination of the factors that must be taken into account when selecting strategy mechanisms from two distinct perspectives.

4.2.1. Balance Exploration and Exploitation

The challenge of reinforcement learning lies in striking a balance between exploration and exploitation. An excess of exploration can result in situations where the algorithm fails to converge or converges slowly, whereas an excess of exploitation can lead to the disadvantage of local optimality. In the traditional PPO algorithm framework, it uses an importance sampling mechanism to train the model. More importantly, its assumed that the distributions of the training and learning models are consistent. However, the approach can lead to over-dependence of the data of the learning model on the merits of the training data when the data trained by the intelligences are not picked for good strategies, making the learning success rate decrease. To solve the problem, we design a new strategy selection mechanism.

4.2.2. Lowering Sensitivity to Prior Knowledge

When viewed through the lens of prior knowledge, the efficacy of conventional PPO algorithms, along with other deep reinforcement learning techniques, is markedly influenced by the data accumulated in previous iterations. In an effort to mitigate this reliance on prior knowledge and drawing inspiration from maximum entropy methods, we have devised policy mechanisms that are not only robust and stable but also exhibit rapid convergence through the use of self-tuning.

Definition 1.

The strategy entropy in the Markov process affects the balance between exploration and utilization, where for each state and action, the constraint is given as follows.

H (a |s) = e^{\frac{\sum_{t} γ^{t} r_{t}}{r_{e}}} (- \sum_{a} π_{θ} (s, a) log π (s, a)),

(18)

where

r_{e}

denotes an expected value of reward. When the aforementioned entropy H is higher, there is a greater propensity for utilization. Conversely, when entropy is lower, there is a greater propensity for exploration.

Remark 3.

The previously mentioned strategy of entropy allows us to effectively address the challenge of balancing exploration and exploitation. However, in light of the necessity for simplified implementation and reduced computational complexity in engineering, there is a clear need for the development of more sophisticated entropy operators.

To make it learn under conditions that increase its success rate, we design a more generalized strategy entropy mechanism.

H (a |s) = \frac{\sum_{t} M_{s}}{B a t c h} (- \sum_{a} π_{θ} (s, a) log π_{θ} (s, a)),

(19)

where

M_{s}

is the total number of successes in a

B a t c h

, and

B a t c h

represents a set of data samples used when updating a policy or value function.

Lemma 1.

H (π (s, a))

is η-smooth, equipped with the Taylor’s theorem, such that

\begin{matrix} {∥\nabla H (π_{θ} (s, a)) - \nabla H (π_{θ} (s^{'}, a))∥}_{\infty} \\ \leq η {∥π_{θ} (s, a) - π_{θ} (s^{'}, a)∥}_{\infty}, \end{matrix}

(20)

where η is a coefficient.

Theorem 1.

For any

k < N, k \in N

, the entropy can be used as the sample mean as follows,

\begin{matrix} \begin{matrix} H (π_{θ} (s, a)) = \frac{1}{T} \sum_{i = 0}^{T - 1} [(T - 1) V^{m} C_{k} \\ {{∥π_{θ} (s, a) - π_{θ} (s^{'}, a)∥}^{m}]}^{λ}, \end{matrix} \end{matrix}

(21)

where

C_{k} = {[\frac{k!}{(k + 1 - [λ])!}]}^{λ}

and

V^{m} = π^{\frac{m}{2}} /(\frac{m}{2} + 1)!

is the volume of the unit ball

B (0, 1)

in

R^{m}

. Furthermore, it holds that

\begin{matrix} lim_{N \to \infty} H_{N}^{k} (π_{θ} (s, a)) = H (π_{θ}^{L} (s, a)), \end{matrix}

(22)

where

π^{L} (s, a)

denotes the Lebesgue measure.

Proof.

Equipped with Lemma 1, let

π^{*} = arg max_{π \in Π} H (π_{θ} (s))

\begin{matrix} H (π_{θ}^{*} (s, a)) - H (π_{θ} (s, a)) \\ \leq κ exp (- T η) + 2 β σ + ζ . \end{matrix}

(23)

Note that

H (s)

is finite if

π_{θ} (s)

is of bounded support. Indeed, consider the imposed smoothing on

H (π_{θ} (s, a))

, and we have

\begin{matrix} H (π_{θ} (s, a)) \geq max_{π_{θ} \in Π} H (π_{θ} (s, a)) - σ, \end{matrix}

(24)

and

\begin{matrix} {∥\nabla H (π_{θ} (s, a))∥}_{\infty} \leq κ = e^{\frac{\sum_{t} γ^{t} r_{t}}{E x p (R)}}, \end{matrix}

(25)

where

T \geq 10 ζ κ log 10 ζ

.

Hence,

H (π_{θ} (s, a))

tends to be the support of

π_{θ}^{L} (s, a)

. This concludes the proof. □

4.3. Entropy-Aware PPO Method

Based on the preceding analysis of entropy and its role in reinforcement learning, we developed an optimized Proximal Policy Optimization (PPO) framework that incorporates entropy-aware regulation. By dynamically adjusting the entropy coefficient during training, the proposed method effectively balances the trade-off between exploration and exploitation. This adaptive mechanism enhances policy robustness, improves convergence stability, and ensures safer decision-making, which is particularly critical for lightweight obstacle avoidance in fixed-wing UAVs. The loss function of the traditional PPO algorithm is mainly based on the advantage function. To improve the universality of the algorithm, an advantage function based on the reward function in Equation (17) is designed to make the algorithm a closed loop.

\begin{matrix} A_{θ} (s_{t}, a_{t}) \overset{Δ}{=} Q_{θ} (s_{t}, a_{t}) - V_{θ} (s_{t}) \\ = E_{s_{t}, a_{t}} (\sum_{l} r_{t + l}) - E_{s_{t}} (\sum_{l} r_{t + l}), \end{matrix}

(26)

where θ is the vector of policy parameters before the update, and

Q_{θ} (s_{t}, a_{t})

and

V_{θ} (s_{t})

are the inferring action-value function and inferring value function, which can be obtained with the reward function in Equation (17). Therefore,

A_{θ} (s_{t}, a_{t})

is an inferring advantage function at time step t.

In the PPO algorithm, the importance sampling mechanism is used to control the updating range of the policy while optimizing the policy so as to avoid the problem of drastic changes caused by excessive updating. However, during the training process, we usually use some old strategies that have been trained to collect samples, rather than using the latest strategies that are currently available. This leads to the problem of mismatch between the sample and the current policy, which is called “policy offset”. In order to ensure the exploration of better solutions when the distribution gap between the two datasets is large, we no longer assume that the distribution of the old and new strategies is similar, and we encourage the exploration of new strategies when the distribution difference between the old and new strategies is large. Therefore, this paper explores and utilizes data considering the distribution of old and new strategies.

\begin{matrix} \begin{matrix} E_{(s_{t}, a_{t}) \sim π_{θ}} [A_{θ} (s_{t}, a_{t}) \nabla log π_{θ} (a_{t}^{n} |s_{t}^{n})] \\ = E_{(s_{t}, a_{t}) \sim π_{θ^{'}}} [\frac{π_{θ} (a_{t} |s_{t})}{π_{θ^{'}} (a_{t} |s_{t})} \frac{π_{θ} (s_{t})}{π_{θ^{'}} (s_{t})} A_{θ} (s_{t}, a_{t}) \\ \nabla log π_{θ} (a_{t}^{n} |s_{t}^{n})], \end{matrix} \end{matrix}

(27)

where

θ^{'}

represents the vector of policy parameters after the update,

π_{θ}

and

π_{θ^{'}}

are the old and new strategies, respectively.

In traditional PPO implementations, the process of training is influenced by a fixed hyperparameter which determines the exploration magnitude. In this paper, a new PPO method with an inferring reward mechanism and adaptive entropy is introduced, which incorporates a dynamic scaling of the entropy coefficient based on the recent return obtained by the agent. Based on the above discussion, the final loss function can be written in the following form,

\begin{matrix} L_{θ}^{C L I P} = E_{(s_{t}, a_{t}) \sim π_{θ}^{'}} [\min (\frac{π_{θ} (a_{t} |s_{t})}{π_{θ^{'}} (a_{t} |s_{t})} \frac{π_{θ} (s_{t})}{π_{θ^{'}} (s_{t})} A_{θ} (s_{t}, a_{t}), \\ c l i p (\frac{π_{θ} (a_{t} ∣ s_{t})}{π_{θ^{'}} (a_{t} ∣ s_{t})}, 1 - ε, 1 + ε) A_{θ} (s_{t}, a_{t}))], \end{matrix}

(28)

\begin{matrix} L_{θ}^{V F} = {(V_{θ} (s_{t}) - V_{t}^{target})}^{2}, \end{matrix}

(29)

\begin{matrix} L_{θ}^{E N T} = E_{(s_{t}, a_{t}) \sim π_{θ}} [H (π_{θ} (a_{t} | s_{t}))], \end{matrix}

(30)

\begin{matrix} L^{E A} = L^{C L I P} - w_{1} L^{V F} + w_{2} L^{E N T}, \end{matrix}

(31)

L_{total} = L^{E A} + λ \cdot E_{s \sim B} [\sum_{a} π_{E} (a | s) log π_{θ} (a | s)]

(32)

where

B

is the replay buffer, i.e., a set of collected experiences;

(s, a)

represents a sampled experience from the buffer consisting of the state and action.

w_{1}

weights the policy entropy term, which encourages exploration by penalizing low-entropy (i.e., overly deterministic) policies;

w_{2}

weights the extrinsic reward term, which guides the agent toward task-specific goals such as obstacle avoidance and target reaching.

\nabla_{θ} L_{total} = \nabla_{θ} L^{E A} + λ \sum_{s \in B} \sum_{a} π_{E} (a | s) \nabla_{θ} log π_{θ} (a | s) .

(33)

It can be seen from the gradient that the guiding strategy approaches the frequent actions in the buffer. When the strategy and the replay buffer are very close, the gradient approaches 0 and will not interfere with the normal optimization of the PPO algorithm. When the current strategy moves away from the buffer, it will push the strategy parameter θ towards the buffer, that is, it will imitate the actions that frequently occur in experience. When the initial tendency of the strategy is to explore, buffer experience can accelerate convergence and avoid getting trapped in random or inefficient behaviors. The pseudo-code of the algorithm can be found in Algorithm 1.

Algorithm 1 Entropy-aware PPO for UAV obstacle avoidance

Initialization: Initialize actor network

π_{θ}

, critic network

V_{ϕ}

.

Initialize high-quality experience buffer

B_{high_quality}

.

Define environment and parameters: max_episodes, max_timesteps.

Define hyperparameters: learning rate α, discount factor γ, PPO clip ϵ, weights

w_{1}, w_{2}

, guidance coefficient λ.

Load pre-trained lightweight depth model

Γ_{depth}

and encoder

Γ_{enc}

.

for episode = 1 to max_episodes do

Reset environment and UAV position

p_{ego}

; clear episode buffer

B_{episode}

.

for timestep t = 1 to max_timesteps do

Capture RGB image

I_{RGB}

; compute depth map

D_{t} = Γ_{depth} (I_{RGB})

.

Encode visual features

f_{t} = Γ_{enc} (D_{t})

.

Get relative distance d and angle α to the target; form state

s_{t} = [f_{t}, d, α]

.

Get action probability and value:

a_{t} \sim π_{θ} (s_{t})

,

V_{ϕ} (s_{t})

.

Execute

a_{t}

; observe

s_{t + 1}

, collision status

c_{status}

, goal status

g_{status}

.

Compute rewards:

r_{target} = C_{1}

if target reached, else 0;

r_{collision} = C_{2}

if collision, else 0;

r_{distance} = C_{3} \cdot (d_{t - 1} - d_{t})

;

r_{track} = C_{4} \cdot cos (angle (p_{ego}, p_{target}))

Total reward:

r_{total} = r_{target} + r_{collision} + r_{distance} + r_{track}

Store

{s_{t}, a_{t}, r_{total}, log π_{θ} (a_{t} | s_{t})}

in

B_{episode}

if

r_{total}

is high or successful transition then

Add transition to

B_{high_quality}

end if

if collision, target reached, or timestep limit then

break

end if

end for

for

k = 1

to

K_{epochs}

do

Compute advantage estimates

A_{t}

using GAE

Compute adaptive entropy:

H (a | s) = \frac{M_{s}}{B} \cdot (- \sum π_{θ} (a | s) log π_{θ} (a | s))

PPO clipped loss:

L^{CLIP} = E [\min (ρ_{t} A_{t}, clip (ρ_{t}, 1 - ϵ, 1 + ϵ) A_{t})]

Value function loss:

L^{VF} = {(V_{ϕ} (s_{t}) - V_{target})}^{2}

Entropy loss:

L^{ENT} = E [H (a | s)]

Replay buffer imitation loss:

L^{BUFFER} = D_{KL} (π_{E} ∥ π_{θ})

Total loss:

L_{total} = L^{CLIP} - w_{1} L^{VF} + w_{2} L^{ENT} - λ L^{BUFFER}

Update θ, ϕ; using Adam on

L_{total}

Clear

B_{episode}

end for

Output: Trained actor policy

π_{θ}

5. Experimental Validation

To evaluate its effectiveness, three experimental setups are designed in this section. First, we design an ablation experiment to separately demonstrate the impact of the designed reward function and the update mechanism. Second, we demonstrate the superiority of the proposed method through comparisons with other deep reinforcement learning algorithms. Finally, a hardware-in-the-loop simulation experiment is conducted to verify the deployment capability of the proposed framework on edge devices, while also comparing it with classic sample-based methods.

5.1. Training Settings

We conduct the training on a machine equipped with an Intel Xeon E5-2678 V3 CPU and two NVIDIA RTX 3090 GPU. A high-fidelity simulator, AirSim (v 1.8.1) [51] build on Unreal Engine (v UE4.27), is employed to build the different environments and provide data including RGB images captured by its camera and fixed-wing UAV’s position. The fixed-wing UAV’s dynamics model is provided by JSBSim (v 1.2.0) [52], an open source platform widely regarded for its high accuracy in modeling aerodynamics and flight physics. The specific model used in this work is the Skywalker X8 which is from Skywalker Technology Co., Ltd., Shenzhen, China, a popular choice for its stability and versatility in various flight scenarios. The neural network models are established using the PyTorch (v 3.6) framework.

We conduct the training of the proposed method with the parameters shown in Table 1 within a 1000 m by 600 m rectangular urban environment constructed using UE. In the experiments, target points are randomly selected from three predefined flight paths. The parameters listed in Table 1 were selected based on a combination of UAV platform constraints, established practices in the reinforcement learning literature, and empirical tuning. Specifically, the air speed and maximum flight distance (

d_{\max}

) were set to reflect the operational characteristics of high-speed fixed-wing UAVs commonly reported in previous studies [1,53]. The input depth map resolution (

224 \times 224

) is consistent with widely used visual encoders, providing sufficient spatial information while maintaining computational efficiency. (

C_{1}, C_{2}, C_{3}, C_{4}

) were initially determined through heuristic design and then refined via ablation studies. These weights were tuned to ensure a balanced influence of goal-reaching incentives, collision penalties, heading smoothness, and directional alignment. The PPO hyperparameters, including learning rate, discount factor (γ), clip range (ϵ), batch size, and epoch number, were initialized using standard values reported in prior works [24]. A light grid search was conducted to improve convergence speed and policy stability in our environment. Coefficients for value loss (

w_{1}

) and entropy regularization (

w_{2}

) were selected to balance policy exploration and value fitting. Overall, this combination of physically realistic parameters and principled tuning contributes to the stable and effective training of the proposed navigation policy.

As shown in Figure 3, varying numbers of obstacles are distributed along the three flight paths. The variation in obstacle density across the routes simulates the challenge of avoidance faced by fixed-wing UAVs in environments with different levels of obstacle density. Image data are collected using a simulated camera provided by AirSim, generating color images with a resolution of 480 × 640 for the depth estimation module.

5.2. Ablation Studies

In this section, we study the importance of various design modules in our framework.

5.2.1. Inferred Reward Function

To evaluate the contribution of the reward structure to trajectory quality and overall performance, we conduct both visual and quantitative analyses of the designed reward modules. Beyond the commonly used distance-based term

r_{target}

,

r_{collision}

and

r_{dis}

, our framework introduces two additional components: the trajectory alignment reward

r_{track}

and the backtracking penalty

r_{buffer}

, which are critical for ensuring directional stability and forward motion efficiency, respectively. Table 2 presents the results of an ablation study where each of these modules is removed in turn. When the

r_{track}

term is excluded, the UAV shows a drastic decline in reach rate (from

81 %

to

21 %

) and a sharp increase in episode length, indicating frequent detours and unstable trajectories. This suggests that without directional guidance, the UAV struggles to maintain course alignment even if it avoids collisions. Similarly, removing the

r_{buffer}

term leads to degraded performance, as reflected in both reach rate (dropping to

16 %

) and longer flight durations. This highlights the importance of penalizing regressions or inefficient backtracking during navigation. Due to the large performance gap caused by different reward configurations, plotting the results as curves would result in poor visual comparability. Therefore, we adopt a table format to clearly illustrate the impact of each reward component on overall behavior. Collectively, these results validate the effectiveness of incorporating both

r_{track}

and

r_{buffer}

into the reward formulation, enabling the UAV to produce more stable, efficient, and dynamically feasible trajectories.

We evaluate the contribution of the designed reward function to the smoothness and stability of flight trajectories. First, we employ only the distance reward function, assigning it a weight of

C_{3}

= 1 and referring to this configuration as the distance model. Subsequently, we compare the smoothness of the flight trajectories generated by the distance model and the proposed model along three predefined flight paths. As illustrated in Figure 4b,d,f, the stability that is observed in the proposed model’s trajectories can be attributed to the carefully designed reward function, which balances multiple factors such as obstacle avoidance and path smoothness. In contrast, the trajectories generated by the model using only

r_{dis}

exhibit more abrupt directional changes, as highlighted by the jagged red solid lines in Figure 4a,c,e. Such rapid course corrections can impose additional strain on the fixed-wing UAV’s control system, which can lead to potential instability, particularly in complex urban environments. By incorporating a smoothness criterion into the reward structure, the proposed model effectively reduces the necessity for drastic course adjustments, enabling the UAV to follow a more fluid and consistent path.

This improvement in flight stability is critical in real-world scenarios, where maintaining smooth trajectories helps minimize energy consumption and ensures safer navigation through dynamic and uncertain environments. Thus, the results clearly demonstrate the superiority of the proposed reward function in producing smoother and more stable flight paths for fixed-wing UAVs, ultimately enhancing overall flight performance in obstacle-rich environments.

5.2.2. Adaptive Entropy

To evaluate the impact of the proposed Adaptive Entropy on the obstacle avoidance tasks, we design a strategy comparison experiment. We, respectively, train the PPO models with entropy weights of 0.01 and 0.001, and then compared their rewards per episode during training with the proposed method. The rewards obtained during training are shown in Figure 5. The lower weight model shows a gradual improvement in performance, starting with relatively low rewards and steadily increasing as the training progresses, indicates moderate variability, suggesting consistent learning behavior. The higher weight, while starting at a lower initial reward, exhibits a steady increase over time, indicating higher variability in performance, particularly in the earlier stages of training. Our proposed method demonstrates the fastest learning curve, with cumulative rewards increasing rapidly in the early episodes. By the end of the training, it converges at the highest cumulative reward. The shaded area around the orange curve is relatively narrow, indicating low variability and suggesting that the proposed method is more stable and consistent across different episodes.

These results indicate that the proposed method outperforms PPO model with different entrophy weights in terms of cumulative rewards, demonstrating its effectiveness in navigating the fixed-wing UAVs through environments with obstacles. Additionally, the faster convergence of the proposed method shows its potential for quicker policy learning, making it suitable for practical applications where rapid adaptation is critical.

5.3. Policy Comparison

To evaluate the effectiveness of our proposed fixed-wing UAV obstacle avoidance method, we conducted tests in three distinct scenarios named Scene I (City), Scene II (Line-cruising), and Scene III (Valley).

We compare our proposed method with several established reinforcement learning algorithms, including PPO, TRPO, A3C, DQN, and DDPG. All algorithms are tested in the same initial simulation environment, with repeated trials in each scenario to ensure robust results. In each scenario, the agent’s task is to fly from a starting position to a target without colliding with any obstacles. The performance of each algorithm is measured using the task completion rate (Success Rate, %), which represents the percentage of trials in which the agent successfully completed the task. Each test is repeated 100 times per scenario to minimize the impact of randomness on the results.

Table 3 presents the task completion rates for our proposed method and the other strategies across the three scenarios. The results demonstrate that our proposed approach outperformed the baseline algorithms in all scenarios. Specifically, in the City and Line-cruising scenarios, our method achieved task completion rates of 86.0% and 80.0%, respectively, which are higher than those of the other algorithms. In the more complex Valley scenario, our approach still maintain a strong performance with a 74.0% success rate. In comparison, the PPO algorithm achieves

82.0 %

,

76.0 %

, and

69.0 %

in the three scenarios, which is slightly lower than our method. Other algorithms such as TRPO, A3C, DQN, and DDPG perform relatively worse, with success rates below

80.0 %

in all scenarios. Notably, in the Valley scenario, DDPG exhibit the lowest success rate of

62.0 %

.The experimental results indicate that our proposed method is capable of effectively handling different levels of obstacle complexity across various environments, achieving higher task completion rates than existing reinforcement learning strategies. The superior performance of the proposed method, particularly in complex environments, may be the result of its adaptive strategy optimization and efficient exploration mechanism. While the performance of all algorithms is comparable in simpler scenarios, such as Line-cruising, our method shows a significant advantage in more complex scenarios like City and Valley.

In addition to comparing various strategies of deep reinforcement learning, we also compared them with the existing sample-based methods in the same scenario as described in Section 5.3. In Scene I, the fixed-wing UAV in an urban environment, where it have to fly through densely packed buildings and avoid structural obstacles as illustrated in Figure 6a. In Scene II, simulating a power line inspection in mountainous terrain, the fixed-wing UAV’s primary task is to avoid obstacles such as mountainous ridges and power poles while maintaining proximity to power lines as depicted in Figure 6b. In Scene III, the fixed-wing UAV face a desert canyon with dynamic terrain changes, where the algorithm has to account for steep ascents and descents while avoiding natural formations such as cliffs and ridges as shown in Figure 6c. However, it is important to note that our algorithm incorporates a field-of-view (FOV) constraint, which restricts the UAV’s perception and decision-making to only the local visible region. This mimics realistic onboard sensor limitations. As a result, in complex environments such as Scene III, where obstacles may occlude the target or create narrow corridors, the agent may temporarily lose access to the global goal location or choose a safer, reachable subgoal within its visible area. The comparison of flight trajectories is presented in Figure 6, showcasing the performance of both algorithms across the three environments. In Scenario I, the proposed algorithm generates a smoother and more compact trajectory, as shown in Figure 7a. In contrast, the sample-based method tends to select detour regions with fewer obstacles, resulting in a longer path. The corresponding angular velocity curve in Figure 7a indicates that, although the proposed method exhibits higher initial control effort, it converges more quickly to a stable state, demonstrating superior adaptability in cluttered environments. In Scenario II, where obstacles are sparsely distributed, the sample-based method produces a trajectory that is closer to the straight-line reference path, as depicted in Figure 7b. In Scenario III, although the trajectory generated by the sample-based method appears smoother in Figure 7c, its maneuverability is fundamentally constrained by local sampling under field-of-view (FOV) limitations. As a result, it fails to execute the sharp turns required to navigate through the terrain. The angular velocity curve in Figure 7c clearly reveals abrupt control variations in the sample-based method. In contrast, the proposed algorithm maintains consistent maneuverability and successfully identifies feasible turning paths despite the FOV constraints.

The corresponding quantitative results are summarized in Table 4. In Scene I, the proposed method achieves a higher success rate and a shorter average path length compared to the baseline, while maintaining similar angular smoothness, indicating better obstacle-aware planning in urban environments. In Scene II, where obstacles are more sparse and globally observable, the sample-based method performs competitively in terms of success rate and path efficiency. However, in Scene III that is characterized by complex terrain and frequent occlusions the sample-based method fails to reach the target reliably, as reflected by the missing entries (“–”) in Table 4. This failure is primarily due to the following two critical factors considered in our design: the dynamics constraints of fixed-wing UAVs and the field-of-view (FOV) limitation imposed during planning. Since the sample-based method relies on sampling only within the currently visible region and lacks global reasoning, it is prone to deviating from the goal and selecting unreachable paths, as evident from the blue trajectory in Figure 6c, which veers away from the canyon passage. In contrast, the proposed DRL-based algorithm effectively incorporates visibility and motion constraints, enabling robust navigation even in such challenging scenarios.

5.4. Hardware-in-the-Loop Simulation

A hardware-in-the-loop simulation experiment is conducted to demonstrate the deployability of the proposed algorithm. Additionally, a comparison with sample-based algorithms [54] is made to validate the performance of the proposed approach. The simulation platform consists of a computer equipped with an Intel i5-13600KF CPU and an NVIDIA RTX 4070Ti SUPER GPU, acting as the primary simulation unit. The onboard edge computing platform, the OrangePi 5B, equipped with a Rockchip RK3588s processor and a Neural Processing Unit (NPU), is used to execute real-time inference of the trained model. The model is initially trained and converted to the RKNN format using the RKNN-Toolkit2 (v2.1.0) and deployed via the Python (v3.10) API. The experimental platform is shown in Figure 8 and validation scenes are the same as the aforementioned Policy Comparison tests.

The quantitative analysis of the results is presented in Figure 9, where the X and Y coordinate distributions of the flight trajectories are plotted for each scene. The proposed algorithm’s path (solid line) is compared with the expected trend (dashed line) across all three environments. As shown in Figure 9a, the deviations between the proposed path and the reference straight-line path were minimal, indicating that the learned policy closely follows the intended direction in less challenging environments. In Scene II, the proposed method produced a noticeably smoother path with fewer oscillations, especially in the Y coordinate distribution, demonstrating its superiority in densely cluttered environments, as depicted in Figure 9b. This trend contrasts sharply with the results shown in Figure 9c, where the greatest divergence between the two methods is evident. The proposed algorithm’s ability to execute sharp turns and navigate narrow spaces enabled it to successfully complete the flight path, in contrast to the sample-based method, which struggled in this scenario.

Overall, the experimental results show that the proposed algorithm outperforms the sample-based method in more complex environments (Scene II and Scene III), particularly in terms of path smoothness and path length. While the sample-based method performs better in simpler environments with sparse obstacles (Scene I), the proposed algorithm demonstrates greater adaptability and robustness in challenging, real-world scenarios. These findings suggest that the proposed algorithm’s ability to make quick decisions in real-time, demonstrating its potential for deployment in real-world edge-based fast moving fixed-wing UAV systems where obstacle avoidance is critical.

6. Conclusions

In this paper, we present a lightweight DRL framework that leverages inferred single-frame depth maps as input and employs a compact network architecture to address the obstacle avoidance challenges of high-speed fixed-wing UAVs. Our framework incorporates an inferring reward function to account for the stability and dynamic constraints of fixed-wing UAVs, along with an adaptive entropy-based strategy update mechanism to balance exploration and exploitation during training. The proposed method is evaluated in various scenarios through hardware-in-the-loop simulations and compared with other reinforcement learning algorithms. Experimental results demonstrate that our framework significantly outperforms baseline methods in terms of obstacle avoidance effectiveness and path smoothness.

Despite these promising results, the current approach has certain limitations. The use of inferred depth maps may impact obstacle detection accuracy, particularly for small or fast-approaching obstacles. In future work, we plan to validate the algorithm on a real VTOL fixed-wing UAV platform to assess its performance in real-world environments. Furthermore, we aim to improve the framework’s adaptability and robustness by incorporating multi-agent cooperation for distributed obstacle avoidance, applying meta-learning techniques to enable rapid adaptation to unseen scenarios, and integrating uncertainty-aware decision-making to handle partial observability in complex environments. We also intend to embed energy-efficient navigation strategies to improve operational endurance and explore sim-to-real transfer methods to enhance the reliability of deployment in practical UAV missions.

Author Contributions

Conceptualization, M.S.; methodology, M.S.; software, M.S.; validation, M.S. and H.C.; formal analysis, M.S.; investigation, M.S.; resources, M.S.; data curation, M.S.; writing—original draft preparation, M.S.; writing—review and editing, M.S., H.C., C.Z., Y.L. and J.H.; visualization, H.C.; supervision, C.Z., Y.L., and J.H.; project administration, M.S.; funding acquisition, C.Z., H.C. contributed to algorithm discussion and visual effect implementation. All authors have read and agreed to the published version of the manuscript.

Funding

This study was funded by the Open Fund of the National Natural Science Foundation of China (No. 62073264) and the Key Research and Development Project of Shaanxi Province (No. 2021ZDLGY01-01).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Huang, C.; Fang, S.; Wu, H.; Wang, Y.; Yang, Y. Low-Altitude Intelligent Transportation: System architecture, infrastructure, and key technologies. J. Ind. Inf. Integr. 2024, 42, 100694. [Google Scholar] [CrossRef]
Cohen, A.P.; Shaheen, S.A.; Farrar, E.M. Urban air mobility: History, ecosystem, market potential, and challenges. IEEE Trans. Intell. Transp. Syst. 2021, 22, 6074–6087. [Google Scholar] [CrossRef]
Lyu, Y.; Wang, W.; Chen, P. Fixed-Wing UAV Based Air-to-Ground Channel Measurement and Modeling at 2.7 GHz in Rural Environment. IEEE Trans. Antennas Propag. 2024, 73, 2038–2052. [Google Scholar] [CrossRef]
Zhang, A.; Xu, H.; Bi, W.; Xu, S. Adaptive mutant particle swarm optimization based precise cargo airdrop of unmanned aerial vehicles. Appl. Soft Comput. 2022, 130, 109657. [Google Scholar] [CrossRef]
Lungu, M. Backstepping and dynamic inversion combined controller for auto-landing of fixed wing UAVs. Aerosp. Sci. Technol. 2020, 96, 105526. [Google Scholar] [CrossRef]
Wang, C.; Wei, Z.; Jiang, W.; Jiang, H.; Feng, Z. Cooperative Sensing Enhanced UAV Path-Following and Obstacle Avoidance with Variable Formation. IEEE Trans. Veh. Technol. 2024, 73, 7501–7516. [Google Scholar] [CrossRef]
Karaman, S.; Walter, M.R.; Perez, A.; Frazzoli, E.; Teller, S. Anytime motion planning using the RRT. In Proceedings of the 2011 IEEE International Conference on Robotics and Automation, Shanghai, China, 9–13 May 2011; IEEE: Piscataway, NJ, USA, 2011; pp. 1478–1483. [Google Scholar]
Mercy, T.; Van Parys, R.; Pipeleers, G. Spline-based motion planning for autonomous guided vehicles in a dynamic environment. IEEE Trans. Control Syst. Technol. 2017, 26, 2182–2189. [Google Scholar] [CrossRef]
Wu, J.; Wang, H.; Liu, Y.; Zhang, M.; Wu, T. Learning-based fixed-wing UAV reactive maneuver control for obstacle avoidance. Aerosp. Sci. Technol. 2022, 126, 107623. [Google Scholar] [CrossRef]
Muñoz-Bañón, M.Á.; Velasco-Sanchez, E.; Candelas, F.A.; Torres, F. OpenStreetMap-based autonomous navigation with lidar naive-valley-path obstacle avoidance. IEEE Trans. Intell. Transp. Syst. 2022, 23, 24428–24438. [Google Scholar] [CrossRef]
Popov, A.; Gebhardt, P.; Chen, K.; Oldja, R. Nvradarnet: Real-time radar obstacle and free space detection for autonomous driving. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 6958–6964. [Google Scholar]
Mandloi, D.; Arya, R.; Verma, A.K. Unmanned aerial vehicle path planning based on A* algorithm and its variants in 3D environment. Int. J. Syst. Assur. Eng. Manag. 2021, 12, 990–1000. [Google Scholar] [CrossRef]
Ma, H.; Meng, F.; Ye, C.; Wang, J.; Meng, M.Q.-H. Bi-Risk-RRT based efficient motion planning for autonomous ground vehicles. IEEE Trans. Intell. Veh. 2022, 7, 722–733. [Google Scholar] [CrossRef]
Wang, J.; Li, T.; Li, B.; Meng, M.Q.-H. GMR-RRT*: Sampling-based path planning using gaussian mixture regression. IEEE Trans. Intell. Veh. 2022, 7, 690–700. [Google Scholar] [CrossRef]
Kaufmann, E.; Bauersfeld, L.; Loquercio, A.; Müller, M.; Koltun, V.; Scaramuzza, D. Champion-level drone racing using deep reinforcement learning. Nature 2023, 620, 982–987. [Google Scholar] [CrossRef]
Wu, K.; Wang, H.; Esfahani, M.A.; Yuan, S. Learn to navigate autonomously through deep reinforcement learning. IEEE Trans. Ind. Electron. 2021, 69, 5342–5352. [Google Scholar] [CrossRef]
Xue, Y.; Chen, W. A UAV navigation approach based on deep reinforcement learning in large cluttered 3D environments. IEEE Trans. Veh. Technol. 2022, 72, 3001–3014. [Google Scholar] [CrossRef]
Kulhánek, J.; Derner, E.; Babuška, R. Visual navigation in real-world indoor environments using end-to-end deep reinforcement learning. IEEE Robot. Autom. Lett. 2021, 6, 4345–4352. [Google Scholar] [CrossRef]
Wu, J.; Zhou, Y.; Yang, H.; Huang, Z.; Lv, C. Human-guided reinforcement learning with sim-to-real transfer for autonomous navigation. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 14745–14759. [Google Scholar] [CrossRef] [PubMed]
Huang, H.; Zhu, G.; Fan, Z.; Zhai, H.; Cai, Y.; Shi, Z.; Dong, Z.; Hao, Z. Vision-based distributed multi-UAV collision avoidance via deep reinforcement learning for navigation. In Proceedings of the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan, 23–27 October 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 13745–13752. [Google Scholar]
Wang, C.; Wang, J.; Shen, Y.; Zhang, X. Autonomous navigation of UAVs in large-scale complex environments: A deep reinforcement learning approach. IEEE Trans. Veh. Technol. 2019, 68, 2124–2136. [Google Scholar] [CrossRef]
de Jesus, J.C.; Kich, V.A.; Kolling, A.H.; Grando, R.B.; Guerra, R.S.; Drews, P.L.J. Depth-CUPRL: Depth-imaged contrastive unsupervised prioritized representations in reinforcement learning for mapless navigation of unmanned aerial vehicles. In Proceedings of the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan, 23–27 October 2022; IEEE: Piscataway, NJ, USA, 2022. [Google Scholar]
Yuan, M.; Pun, M.-O.; Wang, D. Rényi state entropy maximization for exploration acceleration in reinforcement learning. IEEE Trans. Artif. Intell. 2022, 4, 1154–1164. [Google Scholar] [CrossRef]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. In Proceedings of the 34th International Conference on Machine Learning (ICML)–Deep Reinforcement Learning Workshop, Sydney, Australia, 6–11 August 2017. [Google Scholar]
Dijkstra, E.W. A note on two problems in connexion with graphs. Numer. Math. 1959, 1, 269–271. [Google Scholar] [CrossRef]
Hart, P.E.; Nilsson, N.J.; Raphael, B. A formal basis for the heuristic determination of minimum cost paths. IEEE Trans. Syst. Sci. Cybern. 1968, 4, 100–107. [Google Scholar] [CrossRef]
Warren, C.W. Global path planning using artificial potential fields. In Proceedings of the 1989 IEEE International Conference on Robotics and Automation, Scottsdale, AZ, USA, 14–19 May 198; IEEE: Piscataway, NJ, USA, 1989; pp. 316–317. [Google Scholar]
Noreen, I.; Khan, A.; Habib, Z. Optimal path planning using RRT* based approaches: A survey and future directions. Int. J. Adv. Comput. Sci. Appl. 2016, 7, 1–10. [Google Scholar] [CrossRef]
Babinec, A.; Duchoň, F.; Dekan, M.; Pásztó, P.; Kelemen, M. VFH* TDT (VFH* with Time Dependent Tree): A new laser rangefinder based obstacle avoidance method designed for environment with non-static obstacles. Robot. Auton. Syst. 2014, 62, 1098–1115. [Google Scholar] [CrossRef]
McLain, T.; Beard, R.W.; Owen, M. Implementing Dubins Airplane Paths on Fixed-Wing UAVs; Springer: Berlin/Heidelberg, Germany, 2014. [Google Scholar]
Yu, C.; Li, J.; Zhou, J. A constrained differential evolution algorithm to solve UAV path planning in disaster scenarios. Knowl.-Based Syst. 2020, 204, 106209. [Google Scholar] [CrossRef]
Chai, Z.; Zheng, J.; Xiao, L.; Yan, B.; Qu, P.; Wen, H.; Wang, Y.; Zhou, H.; Sun, H. Multi-strategy fusion differential evolution algorithm for UAV path planning in complex environment. Aerosp. Sci. Technol. 2022, 121, 107287. [Google Scholar] [CrossRef]
Freitas, E.J.; Cohen, M.W.; Neto, A.A.; Guimarães, F.G.; Pimenta, L.C. DE3D-NURBS: A differential evolution-based 3D path-planner integrating kinematic constraints and obstacle avoidance. Knowl.-Based Syst. 2024, 300, 112084. [Google Scholar] [CrossRef]
Lindqvist, B.; Mansouri, S.S.; Agha-mohammadi, A.-A.; Nikolakopoulos, G. Nonlinear MPC for collision avoidance and control of UAVs with dynamic obstacles. IEEE Robot. Autom. Lett. 2020, 5, 6001–6008. [Google Scholar] [CrossRef]
Zhao, D.; Wang, H.; Shao, K.; Zhu, Y. Deep reinforcement learning with experience replay based on SARSA. In Proceedings of the 2016 IEEE Symposium Series on Computational Intelligence (SSCI), Athens, Greece, 6–9 December 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 1–6. [Google Scholar]
Watkins, C.J.C.H.; Dayan, P. Q-learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
Hernandez-Garcia, J.F.; Sutton, R.S. Understanding multi-step deep reinforcement learning: A systematic study of the DQN target. arXiv 2019, arXiv:1901.07510. [Google Scholar] [CrossRef]
Van Hasselt, H.; Guez, A.; Silver, D. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence, AAAI, Phoenix, AZ, USA, 12–17 February 2016; Volume 30. [Google Scholar]
Wang, Z.; Schaul, T.; Hessel, M.; Van Hasselt, H.; Lanctot, M.; De Freitas, N. Dueling network architectures for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; PMLR: New York, NY, USA, 2016; pp. 1995–2003. [Google Scholar]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. In Proceedings of the International Conference on Learning Representations (ICLR), San Juan, PR, USA, 2–4 May 2016. [Google Scholar]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; PMLR: New York, NY, USA, 2018; pp. 1861–1870. [Google Scholar]
Lapan, M. Deep Reinforcement Learning Hands-On; Packt Publishing Ltd.: Birmingham, UK, 2018. [Google Scholar]
Padhy, R.P.; Sa, P.K.; Narducci, F.; Bisogni, C.; Bakshi, S. Monocular vision-aided depth measurement from RGB images for autonomous UAV navigation. ACM Trans. Multimed. Comput. Commun. Appl. 2023, 20, 1–22. [Google Scholar] [CrossRef]
Martini, M.; Cerrato, S.; Salvetti, F.; Angarano, S.; Chiaberge, M. Position-agnostic autonomous navigation in vineyards with deep reinforcement learning. In Proceedings of the 2022 IEEE 18th International Conference on Automation Science and Engineering (CASE), Mexico City, Mexico, 20–24 August 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 477–484. [Google Scholar]
Lu, Y.; Chen, Y.; Zhao, D.; Li, D. MGRL: Graph neural network based inference in a Markov network with reinforcement learning for visual navigation. Neurocomputing 2021, 421, 140–150. [Google Scholar] [CrossRef]
Chai, R.; Niu, H.; Carrasco, J.; Arvin, F.; Yin, H.; Lennox, B. Design and experimental validation of deep reinforcement learning-based fast trajectory planning and control for mobile robot in unknown environment. IEEE Trans. Neural Netw. Learn. Syst. 2022, 35, 5778–5792. [Google Scholar] [CrossRef] [PubMed]
Jiang, H.; Zhou, Y.; Lin, J.; Xu, Z.; Lv, C.; Liu, Y. Temporal knowledge-aware soft actor-critic for robot navigation. IEEE Trans. Ind. Inform. 2021, 17, 7431–7439. [Google Scholar]
Bhat, S.F.; Birkl, R.; Wofk, D.; Wonka, P.; Müller, M. Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv 2023, arXiv:2302.12288. [Google Scholar]
Ma, X.; Dai, X.; Bai, Y.; Wang, Y.; Fu, Y. Rewrite the Stars. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 5694–5703. [Google Scholar]
Tan, T.; Cao, G. Deep learning on mobile devices through neural processing units and edge computing. In Proceedings of the IEEE INFOCOM 2022—IEEE Conference on Computer Communications, Virtual, 2–5 May 2022; pp. 1209–1218. [Google Scholar]
Shah, S.; Dey, D.; Lovett, C.; Kapoor, A. AirSim: High-fidelity visual and physical simulation for autonomous vehicles. In Field and Service Robotics; Springer: Berlin/Heidelberg, Germany, 2017. [Google Scholar]
Berndt, J. JSBSim: An open source flight dynamics model in C++. In Proceedings of the AIAA Modeling and Simulation Technologies Conference and Exhibit, Providence, RI, USA, 16–19 August 2004; p. 4923. [Google Scholar]
Zhang, Z.; Wu, J.; Dai, J.; He, C. A novel real-time penetration path planning algorithm for stealth UAV in 3D complex dynamic environment. IEEE Access 2020, 8, 122757–122771. [Google Scholar] [CrossRef]
Paranjape, A.A.; Meier, K.C.; Shi, X.; Chung, S.-J.; Hutchinson, S. Motion primitives and 3D path planning for fast flight through a forest. Int. J. Robot. Res. 2015, 34, 357–377. [Google Scholar] [CrossRef]

Figure 1. Simulation scenarios and fixed-wing UAV model used for training and validating. Red lines indicate obstacle-avoidance paths. Full video link: https://youtu.be/DXP54UI2lbE (accessed on 23 November 2024).

Figure 2. The proposed obstacle avoidance framework for fixed-wing UAVs. Visual features extracted from monocular RGB images are fused with target information and used for action generation and value estimation.

Figure 3. Training flight paths. The yellow six-pointed stars represent the targets, the red star indicates the fixed-wing UAV’s take-off position, and the purple line represents the expected flight path.

Figure 4. Comparison of obstacle avoidance flight trajectories under different reward functions. Red lines indicate UAV trajectories generated by DRL policies; blue arrows represent desired trajectories toward targets; green dashed lines show inferred depth maps during navigation. (a,c,e) use only

r_{dis}

; (b,d,f) use the proposed reward function.

Figure 4. Comparison of obstacle avoidance flight trajectories under different reward functions. Red lines indicate UAV trajectories generated by DRL policies; blue arrows represent desired trajectories toward targets; green dashed lines show inferred depth maps during navigation. (a,c,e) use only

r_{dis}

; (b,d,f) use the proposed reward function.

Figure 5. Training cumulative rewards comparison. The solid lines represent the average rewards of our algorithms and baselines per episode, while the shaded areas indicate the variability in the reward accumulation for each method.

Figure 6. HIL comparison between proposed method and sample based method in different scenes. (a) Scene I. (b) Scene II. (c) Scene III. The red line represents the flight path generated by our proposed method, while the blue line represents the sample based method.

Figure 7. Comparison of angular velocity profiles in three test scenes. (a) Scene I. (b) Scene II. (c) Scene III. Each plot shows the angular velocity (rad/sample) of the sample-based method (orange) and the proposed method (blue-green). The proposed method achieves smoother convergence and better turn-handling performance across different environments.

Figure 8. Hardware-in-the-loop (HIL) platform structure. The platform consists of hardware components, where the Orange Pi 5B and PC communicate via an Ethernet connection. The software components include the offboard control module and simulation scene module, used for control and decision validation in a simulated environment.

Figure 9. HIL flight path and coordinate distributions across three distinct scenes. (a) Scene I. (b) Scene II. (c) Scene III. Each scene shows the fixed-wing UAV’s path in a virtual environment along with its X and Y coordinate distributions compared to the expected trend.

Table 1. Common parameter settings for PPO and proposed algorithm.

Parameter	Value
Air Speed (m/s)	30
Depth Map Size ( $H, W$ )	224, 224
Reward Term Weight ( $C_{1}, - C_{2}, C_{3}, C_{4}$ )	30, −30, 0.5, 1.0
Flying Distance Cap $d_{m a x}$ (m)	1300
Learning Rate	0.0003
Gamma ( $γ$ )	0.95
Clip Range ( $ϵ$ )	0.3
K Epochs	2
Batch Size	2048
Value Loss Coefficient ( $w_{1}$ )	0.5
Entropy Loss Coefficient ( $w_{2}$ )	0.1
Max Timesteps Per Episode	60
Max episodes	3000
State Dimension	256
Action Dimension	8

Table 2. Ablation study of

r_{t r a c k}

,

r_{b u f f e r}

, and entropy-aware loss. ✓ indicates the module is used; – indicates not used. Bold values indicate the best performance in each column.

Table 2. Ablation study of

r_{t r a c k}

,

r_{b u f f e r}

, and entropy-aware loss. ✓ indicates the module is used; – indicates not used. Bold values indicate the best performance in each column.

Scheme	$r_{track}$	$r_{buffer}$	Entropy-Aware Loss	Collision Rate	Fail Reach	Episode Length
A (baseline)	–	–	–	73%	16%	69.8
B	✓	✓	–	17%	15%	151.4
C	✓	–	–	13%	21%	173.7
D	–	✓	–	9%	11%	130.4
E (complete)	✓	✓	✓	5%	9%	121.6

Table 3. Task completion results of different obstacle avoidance strategies in different scenes. ↓ means that the arrows in the table headers indicate that the values are sorted in descending order. Bold values indicate the best performance in each column.

Success Rate (%, ↓)	Scene I: City	Scene II: Line-Cruising	Scene III: Valley
Proposed	86.0	80.0	74.0
PPO	82.0	76.0	69.0
TRPO	80.0	74.0	68.0
A3C	78.0	72.0	66.0
DQN	77.0	70.0	64.0
DDPG	76.0	68.0	62.0

Table 4. Quantitative comparison across three representative scenes. Bold values indicate the best performance in each column.

Indicator	Scene I	Scene II	Scene III
Success Rate (Proposed)	83%	80%	74%
Success Rate (Baseline)	72%	86%	–
Avg. Path Length (m) (Proposed)	952.3	1980.2	1727.8
Avg. Path Length (m) (Base)	1080.6	1806.3	–

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Su, M.; Chai, H.; Zhao, C.; Lyu, Y.; Hu, J. Lightweight Obstacle Avoidance for Fixed-Wing UAVs Using Entropy-Aware PPO. Drones 2025, 9, 598. https://doi.org/10.3390/drones9090598

AMA Style

Su M, Chai H, Zhao C, Lyu Y, Hu J. Lightweight Obstacle Avoidance for Fixed-Wing UAVs Using Entropy-Aware PPO. Drones. 2025; 9(9):598. https://doi.org/10.3390/drones9090598

Chicago/Turabian Style

Su, Meimei, Haochen Chai, Chunhui Zhao, Yang Lyu, and Jinwen Hu. 2025. "Lightweight Obstacle Avoidance for Fixed-Wing UAVs Using Entropy-Aware PPO" Drones 9, no. 9: 598. https://doi.org/10.3390/drones9090598

APA Style

Su, M., Chai, H., Zhao, C., Lyu, Y., & Hu, J. (2025). Lightweight Obstacle Avoidance for Fixed-Wing UAVs Using Entropy-Aware PPO. Drones, 9(9), 598. https://doi.org/10.3390/drones9090598

Article Menu

Lightweight Obstacle Avoidance for Fixed-Wing UAVs Using Entropy-Aware PPO

Abstract

1. Introduction

2. Related Work

2.1. Fixed-Wing UAV Collision Avoidance

2.2. DRL for Visual Navigation

3. Methodology

3.1. Problem Formulation

3.1.1. State Space

3.1.2. Action Space

3.1.3. Reward Function

4. Lightweight Obstacle Avoidance Using Entropy-Aware PPO

4.1. Overview

4.2. Strategy Selection Mechanism

4.2.1. Balance Exploration and Exploitation

4.2.2. Lowering Sensitivity to Prior Knowledge

4.3. Entropy-Aware PPO Method

5. Experimental Validation

5.1. Training Settings

5.2. Ablation Studies

5.2.1. Inferred Reward Function

5.2.2. Adaptive Entropy

5.3. Policy Comparison

5.4. Hardware-in-the-Loop Simulation

6. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI