1. Introduction
Unmanned aerial vehicle (UAV) guidance stands as a fundamental technology in the fields of aerospace. Since the inception of guided UAV systems, considerable research efforts have focused on improving both guidance precision and control effectiveness [
1]. Among various existing approaches, a proportional navigation (PN) generates acceleration commands for tracking by reducing a line-of-sight (LOS) angular rate between the UAV and the target. Its simplicity and computational efficiency have led to its widespread use in the UAV guidance systems [
2]. Despite these advantages, the PN exhibits inherent limitations when applied to scenarios involving nonlinear dynamics, sensor interference, and maneuvering targets. Specifically, as the UAV approaches a target, the LOS angular rate tends to decrease, diminishing the effectiveness of the PN-generated commands. Consequently, this ultimately reduces tracking accuracy [
3]. These limitations are even more pronounced in environments where targets may perform evasive maneuvers or where on-board sensors may experience noise and command delays. In such situations, conventional PN methods struggle to ensure robust performance [
4].
To overcome the aforementioned limitations, guidance laws utilizing various classical control strategies, such as sliding mode control and optimal control formulation, have been studied [
5,
6]. These methods are designed to satisfy additional constraints, such as terminal collision angle, field-of-view (FOV) limitations, and time-to-energy efficiency [
7]. In [
8], non-singular terminal sliding mode guidance was developed to accommodate angular constraints during the terminal phase. In [
9], collision time and collision angle control strategies were introduced to improve tracking effectiveness in specific scenarios. These methods have shown superior performance over the PN in structured or deterministic environments. However, their practical implementation remains complex. Many of these control laws rely heavily on accurate system modeling and precise state estimation. Therefore, maintaining their performance in real-world environments is difficult due to sensor noise, disturbances, and uncertain target behavior. Moreover, optimization-based derivation techniques require real-time numerical solvers or iterative computations [
10]. This can impose a significant computational burden on systems with limited onboard processing capacity [
11].
In recent decades, computational resources and machine learning technologies have rapidly advanced. This has led to a growing interest in data-driven strategies such as deep learning. In this method, neural networks have been used to approximate optimal control policies or reproduce classical guidance laws. In [
12], a deep neural network was used as a supervised learning method to mimic the behavior of the PN. In [
13], a regression-based prediction model was used to learn to minimize the final failure distance. These approaches have effectively reduced computational load. However, their performance remains limited due to a lack of generalization to unfamiliar conditions during training.
To address these limitations, deep reinforcement learning (DRL) has recently attracted attention as an alternative. Unlike supervised learning, this method promotes autonomous exploration within the action space, allowing the agent to maximize its cumulative reward [
14]. This allows policy improvement through numerous learning iterations. These characteristics make it particularly suitable for control problems with high-dimensional state and action spaces, such as missile guidance [
15]. Furthermore, DRL is suitable for robust and adaptive guidance because it can dynamically adapt through learning in uncertain and variable environments containing noise [
16].
Various DRL algorithms such as the deep deterministic policy gradient (DDPG) [
17], proximal policy optimization (PPO) [
18], and soft actor–critic (SAC) [
19] have been applied to missile guidance problems [
20,
21,
22]. These methods have been shown to work effectively even in cases with stochastic disturbances or uncertain target behaviors. In [
23], it was shown that DRL-based guidance laws achieved lower miss distances compared to the conventional PN in environments with sensor noise. In [
24], a kernel-based online RL framework capable of real-time adaptation to maneuvering targets was proposed, and it showed partially improved performance.
However, existing studies have been limited to two-dimensional models [
24]. In other words, relatively low effort has been made to address the full complexity of three-dimensional tracking. Moreover, the range of initial condition settings is often limited. This can lead to policy overfitting to a narrow state space and hinders adaptability to changes in the actual environment. Moreover, most training and validation environments are configured on the premise of ideal observation conditions [
25]. In other words, it is difficult to secure the robustness of the policy against uncertainties such as sensor noise and observation errors that may occur during actual operation.
In particular, there are many cases where the initial condition setting range is limited [
14,
26]. This may cause a problem in which the policy is overfitted to a narrow state space, and reduces adaptability to actual environmental changes. In addition, even in the design of the compensation function, there were many cases where the physical criteria for improving the guidance performance were not sufficiently reflected. For example, in many studies, compensation was constructed based on the final distance or simple weight. In other words, designs that reflect zero-effort miss (ZEM) that can quantitatively represent the actual performance are rare [
25,
26].
This study aims to address these issues by proposing a DRL-based guidance law in both two-dimensional and three-dimensional environments. To achieve this, SAC algorithm is utilized. The SAC is an off-policy, entropy-regularized, and sample-efficient method known for its performance in continuous control tasks. The UAV–target tracking problem is formulated as a Markov decision process (MDP) [
27] for both 2D and 3D tracking scenarios, with explicit inclusion of sensor noise and unseen initial configurations. Unlike prior works limited to 2D cases or narrow state-space generalization, this study constructs a comprehensive training and evaluation framework using OpenAI Gym [
28], with models exposed to wide-ranging and randomized conditions. Numerical experiments using the proposed method have verified that the learned policy can effectively respond to various noises and target movements.
The contributions of this study are summarized as follows:
Noise-contaminated DRL environments for 2D/3D tracking: The UAV–target tracking scenarios in both 2D and 3D are formulated as MDPs while explicitly injecting Gaussian observation noise into LOS and range measurements, enabling robustness to sensor uncertainty to be learned during training.
Development of robust guidance laws using the SAC: The SAC algorithm is employed to construct continuous action guidance policies that remain stable across varying noise levels, demonstrating robustness even under disturbances 50 times stronger than those used during training.
Reward function based on zero-effort miss (ZEM): The reward function is constructed directly from the zero-effort miss (ZEM), providing a physically grounded terminal-accuracy metric that improves precision in final tracking phases where the LOS rate diminishes and conventional PN becomes less effective.
Full 3D MDP and wide initial condition: The guidance problem is extended to full 3D with yaw and pitch LOS dynamics and is trained across a broad range of random initial conditions, including yaw/pitch angles up to ±90°, ensuring generalization across diverse tracking geometries.
Comprehensive comparison with PN-based guidance: Through extensive simulations, the proposed method is compared with traditional PN under various termination distances and noise levels, showing superior performance, especially in close-range trackings and noisy conditions.
The remainder of this paper is structured as follows: The 2D and 3D kinematic models of the UAV and target are defined in
Section 2. In
Section 3, the DRL-based guidance law is proposed by defining the UAV follow situation as the MDP. The numerical simulations are conducted with the proposed method and the results are analyzed in
Section 4. Finally,
Section 5 concludes this paper.
4. Validations
In this section, extensive simulations are performed to validate the performance of the proposed guidance law. Then, the results are analyzed.
4.1. Simulation Setup
This subsection describes the environment for training and validation. The actor and critic network structure of SAC in this study is shown in
Table 1. All neural networks consist of an input layer, two hidden layers of 256 in size, and an output layer. The actor network receives only the observation values as input and generates the actions, and the critic network generates Q values from the observation values and actions. The hyperparameters for the network training are shown in
Table 2. The total time step required for learning in each environment is set to
and
, respectively.
The 2D and 3D kinematic model-based environments are constructed using the OpenAI Gym python module. The Gym python module [
28] is a popular platform for training and testing in the RL area. The 4th-order runge-kutta is used for integration with a time interval of 0.01 s in the 2D and 3D environments used for learning and verification. This integration time step setting (0.01) is used in most inductive studies [
32,
33,
34]. It maintains computational efficiency and a realistic inductive update rate. It also provides sufficient numerical accuracy [
32]. Especially, the observed values measured in the Gym environment are normalized to range from
to
, and are utilized in the training and validation of the proposed method. The action output from the policy is normalized to range from
to
in order for the scalability increment of the proposed method. In addition, to evaluate the robustness of the proposed method against sensor uncertainty, synthetic observation noise is intentionally injected into the measurement vector before it is passed to the actor and critic networks. The default Gym-based simulation does not include sensor disturbance models, and real sensor noise characteristics (e.g., seeker LOS jitter, IMU drift, radar angle/velocity errors) are not publicly available due to military confidentiality. Therefore, we adopt a widely used and physically reasonable assumption in the guidance literature by modeling measurement disturbances as zero-mean Gaussian noise.
where
o is the noise-free observation generated from the kinematic environment. This noise is added immediately before feeding the state vector into both the actor and critic networks, ensuring that the learned policy remains robust to corrupted measurements.
The noise levels used in this study follow two approaches: Training—a mild noise level of is applied to help the policy learn stable feature extraction without destabilizing early training. Evaluation—all three noise levels (0.001, 0.01, 0.05) are tested to assess robustness under progressively harder sensing conditions, with representing a disturbance 50 times stronger than the training condition.
First, the DRL-based guidance method is validated in a 2D environment. This confirms that the overall framework is properly structured. Next, the proposed method is extended to a 3D environment, and its performance is verified. Each procedure is performed with the settings shown in
Table 3 and
Table 4.
In the Tables, the tracking scenario is defined using the kinematic model and the generation of guidance commands. The control effort required for command tracking is not considered. Since system dynamics are not modeled, aerodynamic effects such as Mach number variation and altitude-dependent behavior are neglected.
In this study, the UAV is modeled using the kinematic tracking framework. The objective is to evaluate the guidance algorithm independently of a specific airframe or propulsion system. The maneuvering capability is represented by bounded lateral acceleration limits (±20 g in 2D and ±40 g in 3D). These limits define the achievable turn capability. This abstraction enables fair evaluation under controlled constraints.
4.2. Validation in 2D Environment
In this subsection, validation is performed in the 2D environment and the results are analyzed. The proposed method in the 2D tracking environment is validated through an evaluation phase using a DRL model that has undergone a learning phase. The initial condition for the learning and evaluation phases are shown in
Table 3. Since the main focus of this study is on the 3D tracking environment, the 2D environment validation was conducted to verify the normal operation of the entire framework.
The objective of the proposed method is to reduce the final miss distance to within 3 m. If the final distance between the UAV and the target is below this threshold, the trial is considered successful. Conversely, if the distance exceeds 3 m or the tracking takes more than 40 s, it is counted as a failure. The observations used in the evaluation phase contain Gaussian noise . In contrast, the training phase uses a smaller standard deviation of 0.001.
All methods were evaluated using 1000 Monte Carlo simulations under randomized initial conditions. The figures show only representative trajectories for clarity. The reported success rates and statistical results are computed from the full set of simulations.
As shown in the
Figure 4,
Figure 5 and
Figure 6, the evaluation results demonstrate the robustness and generalization capability of the proposed DRL-based guidance law. In the trajectory and state plots, each colored solid line corresponds to an individual simulation episode, and curves with the same color represent the trajectory and associated state evolution of that episode. This representation allows consistent interpretation of the guidance behavior under varying initial conditions and noise realizations. Although the policy was trained under a limited range of initial yaw angles (±40°, with 10° intervals), the evaluation phase tested a much wider range (±90°, with 1° intervals). Despite this extrapolation to previously unseen states, the method consistently achieved success rates above 90%. This indicates that the learned policy effectively captures general guidance principles rather than simply overfitting to specific training conditions.
In addition, robustness to observation noise was confirmed. As the standard deviation of the Gaussian noise increased from 0.001 to 0.05, the performance degradation was minimal, with only a marginal drop in success rate. This demonstrates that the SAC-based policy remains stable under noisy sensor conditions and suggests its potential for real-world deployment.
The acceleration command output from the actor network was also analyzed. As shown in
Figure 3, the network’s output is normalized between −1 and +1, then scaled by a factor of 20 g to generate the final control command. As shown in
Figure 4, the acceleration command is relatively smooth and continuous. This suggests that the DRL-based model can sufficiently demonstrate physically realistic behavior and numerically stable control performance. A closer analysis reveals that while there is some chattering in the acceleration command, this command is ultimately a guidance command in a system that is divided into guidance and control. Therefore, it is sufficiently surmountable during the control.
Meanwhile,
Figure 4,
Figure 5 and
Figure 6 show the verification results with a noise variance of 0.001, i.e., the noise condition applied during training. Similar behavior was observed even when noise levels of 0.001 and 0.05 were applied, with action chattering increasing by approximately 2 to 2.5 times. The decrease in the success rate as noise increased was observed to be around 2%.
Furthermore, as shown in
Figure 6, the reason
increases rapidly when reaching the end point is related to the LOS rate (
). That is, according to Equation (
2), as the distance between the UAV and the target decreases, the
increases. This causes
for tracking at the end point to change significantly.
In summary, the proposed SAC-based DRL guidance framework demonstrates high generalization ability and robustness. Despite training under limited conditions, it maintains consistent performance even under a wide range of initial conditions and noise. Furthermore, the output acceleration commands are confirmed physically feasible. Through this, the overall framework of the proposed method has been validated in the 2D environment. Therefore, the framework is used to extend and validate the proposed method in the 3D environment.
4.3. Validation in 3D Environment
In this subsection, validation is performed in the 3D environment and the results are analyzed. Similar to the validation in the 2D environment, the proposed guidance law is trained and validated in the 3D environment. Because the 3D environment is more challenging than the 2D environment, the evaluation for the 3D environment is performed in three phases, as shown in
Table 4.
The success criterion is consistent with the 2D case: if the final distance between UAV and target is less than 3 m, it is considered a success. If the UAV diverges or the tracking takes more than 40 s, it is considered a failure. Gaussian noise is added to the observation values just before they are passed into the policy network.
The evaluation results in the 3D environment confirm the generalization capability and robustness of the proposed SAC-based guidance law. The evaluation progresses from phase 1 to phase 3. As a result, the initial conditions become increasingly diverse. This is especially true in phase 3, where the UAV’s direction is randomized. Despite this, as shown in
Table 5, the proposed method maintains consistently high success rates above 90% in phases 1 and 2, and over 80% even in phase 3. This demonstrates that the learned policy is not overfitted to a narrow set of initial conditions, but can effectively adapt to unseen configurations in a high-dimensional state space.
In terms of noise robustness, only a minimal degradation in success rate (about 1–2%) was observed when the standard deviation of the Gaussian noise increased from 0.01 to 0.05. This indicates that the policy network learned to extract stable and effective control behavior even in the presence of substantial observation noise, reinforcing the practical applicability of the method under sensor uncertainty.
The control actions generated by the policy network were also analyzed. The outputs of the actor network are normalized in the range
and scaled by a factor of 40 g to produce actual acceleration commands. As shown in
Figure 7,
Figure 8 and
Figure 9c, both
and
commands remain smooth and continuous over time, without abrupt oscillations or instability. This suggests that the learned policy respects the underlying UAV kinematics and produces physically meaningful control commands suitable for real-time use.
To benchmark the proposed approach, simulations were conducted using the traditional PN law under identical scenarios and noise conditions. All guidance methods used the same UAV kinematic model, velocity, and acceleration limits. This ensures a fair comparison. Therefore, performance differences are attributed to the guidance algorithms rather than airframe capability.
The terminal distance threshold is an important evaluation criterion. As shown in
Table 5, both methods achieve similar success rates under relaxed terminal conditions (e.g., 5 m). In this case, precise terminal correction is not required. Conventional guidance laws can perform adequately. However, differences appear under stricter terminal conditions (e.g., 3 m and 2 m). PN generates acceleration commands proportional to the LOS angular rate. As the UAV approaches the target, the LOS rate increases. This amplifies the control input. As a result, terminal accuracy degrades and success rates decrease.
In contrast, the proposed method is trained to minimize the ZEM. The ZEM represents the predicted terminal miss distance. This allows the policy to maintain effective corrective control near tracking. Therefore, the proposed method achieves higher success rates under strict terminal conditions and observation noise. In summary, a comprehensive performance evaluation was conducted using multiple terminal distance thresholds. This demonstrates that the proposed method remains effective under strict terminal precision requirements.
In addition to the success rates summarized in
Table 5, the performance analysis also includes statistical measures computed over 1000 Monte Carlo runs for each noise level. For the SAC-guided trackings, the terminal miss distance achieved an average of 1.52 ± 0.41 m, while the tracking time converged to 24.7 ± 3.9 s across all successful trials. These results confirm that the proposed policy not only maintains high success rates exceeding 90% under all tested noise conditions, but also achieves low variability in both terminal accuracy and tracking duration, even when the observation noise is increased by a factor of 50.
To analyze the performance of the proposed method more closely, trajectories and actions are plotted when the noise variance is 0.05. As shown in
Figure 10,
Figure 11 and
Figure 12, it is confirmed that the target is effectively shot down even though the variance of noise is increased by 50 times compared to the variance applied during training (0.001). In contrast to the significantly increased noise variance, the chattering of the actions derived from the DRL model is relatively small. The proposed technique effectively counters noise, and further training using reward shaping to attenuate action chattering is expected to yield significantly better performance.
In summary, it was confirmed that the proposed DRL-based guidance method guarantees similar performance to the PN-based guidance law in the environments where observation values are mixed with noise, and can be effectively used for missions that require a closer distance to the target.
4.4. Real-Time Implementability
The feasibility of deploying the learned policy on embedded avionics is evaluated by examining the network size, per-cycle computational load, and compatibility with real-time processors typically used in aerospace and defense-grade embedded systems. The actor network contains an input layer (4 units in 2D, 8 units in 3D), two fully connected hidden layers with 256 neurons each, and an output layer of 1–2 units.
For the 3D case, the total parameter count is input to Hidden 1: , Hidden 1 to Hidden 2: , Hidden 2 to Output: , total parameter . The 2D actor contains approximately 66,000 parameters. This corresponds to a memory footprint of about 270 kB in 32-bit floating point format, which is well within the limits of compact embedded avionics architectures.
The computational cost per forward pass is also modest. For the 3D actor network, MAC (multiply-accumulate) operations , equivalent to roughly 140 kFLOPs when counting multiply and addition separately. At a conservative 100 Hz guidance update rate, the total computational load becomes , which is significantly below the processing capability of widely available real-time embedded processors used in applications.
TI C6000 floating-point DSPs provide approximately 1 GFLOPS of single-precision performance (e.g., TMS320C6701 at 167 MHz) [
35].
ARM Cortex-R5/R7–based TI Hercules MCUs deliver hundreds of MFLOPS, with peak figures around 300–330 MFLOPS depending on configuration [
36].
These devices exceed the required throughput by more than an order of magnitude, demonstrating that the actor network can be executed comfortably within real-time constraints on commercially available embedded hardware. Importantly, only the inference stage of the SAC policy runs on the embedded processor. All computationally intensive operations—critic updates, backpropagation, and replay-buffer sampling are performed offline on a workstation. This offline–online separation ensures that the high training cost does not burden the onboard avionics.
Based on these considerations, the proposed DRL-based guidance policy is expected to meet real-time implementation requirements and be readily exploitable on modern embedded processors used in aerospace and defense systems.
5. Conclusions
In this study, a reinforcement learning-based guidance law was proposed to overcome the limitations of conventional methods under uncertain and noisy environments. The proposed method formulates both two-dimensional and three-dimensional tracking scenarios as Markov decision processes (MDPs). And then, the policy was trained using the soft actor–critic (SAC) algorithm to directly generated acceleration commands from noisy observations.
Unlike traditional guidance laws such as proportional navigation (PN), which relied solely on line-of-sight (LOS) angular rate, the proposed method utilized a richer set of state variables. These included relative distance, LOS angle, and angular rate, which allowed it to effectively minimize the zero-effort miss (ZEM), especially in close-range trackings. The training environment incorporates Gaussian noise to simulate realistic sensor uncertainty, enabling the learned policy to generalize beyond its training distribution.
Extensive validation using the OpenAI gym-based tracking simulator showed that the proposed method consistently outperformed the PN baseline. This performance advantage was especially evident under strict terminal constraints and high levels of observation noise. In both 2D and 3D simulations, the learned policy achieved high success rates, which confirmed its robustness, adaptability, and potential for practical deployment in the UAV guidance systems.
Limitation and Future Work
The proposed SAC-based guidance framework shows strong performance in kinematic-level simulations. However, several limitations remain.
Observation disturbances are modeled as Gaussian noise. This does not fully represent real-world disturbances such as wind gusts, aerodynamic uncertainties, and actuator dynamics. In addition, aerodynamic nonlinearities and actuator saturation are not considered. Dynamic coupling between guidance and control is also neglected. These factors may affect performance under realistic flight conditions.
Future work will extend the framework to higher-fidelity environments. This includes aerodynamic effects, actuator dynamics, and environmental disturbances. Hardware-in-the-loop simulation tests will also be conducted to evaluate practical applicability.