Completing Explorer Games with a Deep Reinforcement Learning Framework Based on Behavior Angle Navigation

: In cognitive electronic warfare, when a typical combat vehicle, such as an unmanned combat air vehicle (UCAV), uses radar sensors to explore an unknown space, the target-searching fails due to an inefﬁcient servoing/tracking system. Thus, to solve this problem, we developed an autonomous reasoning search method that can generate efﬁcient decision-making actions and guide the UCAV as early as possible to the target area. For high-dimensional continuous action space, the UCAV’s maneuvering strategies are subject to certain physical constraints. We ﬁrst record the path histories of the UCAV as a sample set of supervised experiments and then construct a grid cell network using long short-term memory (LSTM) to generate a new displacement prediction to replace the target location estimation. Finally, we enable a variety of continuous-control-based deep reinforcement learning algorithms to output optimal/sub-optimal decision-making actions. All these tasks are performed in a three-dimensional target-searching simulator, i.e., the Explorer game. Please note that we use the behavior angle (BHA) for the ﬁrst time as the main factor of the reward-shaping of the deep reinforcement learning framework and successfully make the trained UCAV achieve a 99.96% target destruction rate, i.e., the game win rate, in a 0.1 s operating cycle.


Introduction
The cognitive degree of electronic warfare depends on the adaptability of the autonomous decision-making system of combat vehicles to various tasks.Taking an unmanned combat air vehicle (UCAV), which possesses the strongest maneuverability, as an example, target-searching and target-tracking are two key tasks for achieving precise target strikes.Moreover, target-searching is the preliminary task of target-tracking, and its importance is self-evident.This high-level task is facilitated through the integration of radar sensors with limited detection range, efficient autonomous decision-making systems, multiple intelligent navigation policies, and a representation of situational awareness.To reduce the consumption of labor and material resources, a major improvement can be achieved by employing intelligent UCAV systems for autonomous search tasks.
Although the concept of cognitive electronic warfare (CEW) has become popular in recent years, there is still a lack of approaches or frameworks to enable UCAVs to search for unknown targets without prior knowledge.This lack is mainly caused by the following two factors: (1) due to the sensitivity of field test data and the complexity of traditional simulators, it is difficult to obtain adequate training/testing data, and (2) the essence of target-searching is probabilistic path planning.If the mission area is too spacious or the operating cycle of the UCAV is too short, then the high maneuverability of the UCAV and the sparsity of the battle environment will make it very difficult for the mobile platform to output a feasible end-to-end motion strategy.In fact, even for autonomous robots in 2D space, exploring an unknown workspace and recognizing the target based on its physical properties (location, velocity, volume, and so forth) are a formidable challenge.
For factor 1, we use Explorer as the flight simulator of a UCAV, which is a game based on the versatile CEW (vCEW) framework, so the sensors set by each member of the game have typical radar characteristics, and the interaction between the radar sensors and the game environments is detailed in Reference [1], and to simplify the complexity of the model, no soft/hard-killing weapons (such as jammers and missiles) are introduced in this version of the game.In the original version of this game, a player aims to find the target defender, i.e., an observation station, in a given 3D mission map by controlling a UCAV within a limited perspective.With Explorer, we can randomly generate many trajectory samples.Please note that although Explorer is not a game full of confrontation, we still need its combat vehicles (as control agents) possess excellent maneuverability and certain radar signal processing capability.Therefore, in this paper, the term unmanned combat air vehicle (UAV) is qualitatively described as UCAV for distinction.Actually, the method proposed in this paper for UCAV control is also applicable to conventional UAVs.For factor 2, because all states and behaviors of the UCAV in Explorer can be characterized and the simulation environment can provide rich and direct multiagent interactions, we consider using deep reinforcement learning (DRL) algorithms to obtain the optimal control strategy that is approximately end-to-end for the UCAV.At this time, the input of the system is the UCAV's observation state, and the output is the planned acceleration vector of the UCAV.
In widely used navigation strategies, vector-based navigation (VBN) is primarily used in target-searching/tracking, which is a traditional goal-driven navigation method.For long-term trajectory planning, VBN is designed to simply become closer to the target as soon as possible; thus, other auxiliary algorithms, such as dynamic planning, are required to formulate the UCAV's actions at any time.Unfortunately, when experimenting with the DRL framework that we designed, we found that the ability to complete the Explorer game is quite limited when using the feedback of pure goal-driven navigation as a reward signal.
The organization of this paper is as follows: In Section 2, related works about the development of DRL and CEW are discussed.Section 3 explains the control principle of a UCAV in Explorer.A novel target-searching approach with the DRL framework is proposed in Section 4.Then, different levels of designing experiments are given in Section 5. Finally, Section 6 presents the conclusion and future work.

Related Works
We first introduce some advanced technologies involved in CEW and then highlight some new developments in DRL, especially for engineering applications.

Target-Searching in CEW
For years, the research and popularization of advanced electronic warfare has been in great demand.Novel combat vehicles, intelligent autonomous control systems, and powerful detection sensors simultaneously promote the development of various technologies.Among such technologies, researchers are struggling to optimize the cognitive awareness of UCAVs for circumstances as a key combat agent.At the communication end, the radar transmitter is used as the most basic detection sensor.By combining software and hardware signal processing methods, the interference (jamming) signal in the noisy channel can be distinguished to expand the detection range of combat vehicles [2,3].
A UCAV evaluates the environmental situation depending on its detection results.At this time, the physical states of the flying objects in space, including the spatial position, fuselage posture, and flight speed, are the indicators that are of great importance.The relevant details of the established threat model can be found in [4][5][6].Moreover, for a combat platform, due to the intricate mission space, gridding the map can effectively reduce the difficulty of processing the situational information [7].
Furthermore, in combination with the UCAV historical path, the target position is inferred by Bayesian estimation or similar [8][9][10].Then, real-time or non-real-time path planning algorithms are developed according to various combat requirements [11][12][13].The ultimate goal is to approach the target and confirm the identity of the target as soon as possible.
In fact, completing the target-searching task in partially observable environments has been a hot topic in the field of autonomous robotics [13][14][15].However, the lack of a simple and clear three-dimensional (3D) space simulator and the corresponding continuous control algorithms for autonomous navigation is a key factor restricting the development of CEW [16].

Deep Reinforcement Learning
Based on deep Q learning (DQN), DRL has shown excellent performance in the classic Atari 2600 games and demonstrated the potential to transcend human beings [17].At the same time, the DeepMind team employed a DRL-based agent, AlphaGo, to play Go and even defeated the current world champion [18].Since then, research on DRL algorithms and engineering applications has begun to increase in various areas.Double DQN (DDQN) can solve the overestimation problem caused by the greedy strategy used in the DQN [19].To enhance the learning efficiency, the authors designed a sampling policy named prioritized experience replay instead of using uniform sampling, which can distribute the larger sampling weights to the states of sufficient learning contents [20].The use of dueling DQN is quite suitable when the specificity of the agent's behavior to the state is weak [21].Additionally, a series of advanced algorithms, such as normalized advantage functions (NAF) [22], asynchronous advantage actor-critic (A3C) [23,24], deep deterministic policy gradient (DDPG) [25], proximal policy optimization (PPO) [26], and soft actor-critic (SAC) [27], which have been developed to solve continuous control problems, have attracted considerable attention.
The DRL-based approach also exhibits unprecedented superiority in most engineering applications.However, we are more concerned with how reinforcement learning performs in the field of intelligent navigation and the metrics that can be achieved.Banino et al. [28] found that DRL could enable an intelligent agent to learn VBN strategies using grid-like representations from the trained networks.Simultaneous localization and mapping (SLAM) schemes based on visual sensors have achieved amazing results due to the introduction of DRL approaches [29].In an attack-defense pursuit-warfare of multiple UCAVs, the authors in [30] used DQN to generate air-combat strategies.However, to simplify the dynamic maneuvering model, they sliced the UCAV's actions into a fixed two-group mode and limited the battle space to a two-dimensional plane.
Based on the above, building a framework that adapts to multiple DRL algorithms to guide the UCAV to learn the maneuvering strategies of searching targets in 3D space and analyzing the causes of these actions (or how the UCAV understands these actions) is the core contribution of this work.

Problem Formulation Based on Explorer
In this section, considering the model dependence of the DRL framework, we first introduce the game environment used to simulate search targets, i.e., Explorer.Then, the operation of the entire Explorer system, including the states, actions, and game goals of the UCAV, is described.Finally, the planning process of the aircraft's actions is specifically discussed.This work mainly refers to our previous research results [1] and provides many significant improvements.

Game Environment
The display interface of the Explorer game is presented in Figure 1a.In this game, all we need is to control a UCAV and find an enemy observation station in a mission space of 7.5 km in height (from 0.5 km to 8 km above the ground) and 15 km in length and width (both from −7.5 km to 7.5 km).Explorer assumes that the reconnaissance sensor equipped by each combat vehicle is a radar emitter with a limited detection range.The UCAV's reconnaissance perspective at each operating cycle is illustrated in Figure 1b.Please note that Explorer uses the geocentric coordinate system, OXYZ, as the base reference frame, and in the space, the coordinates of any point are recorded as p = [x, y, z] T or p = [x, y, h] T .The formula for transforming between the height from the ground and the Z-axis coordinate of any point in space is where R e is the Earth's radius (6371 km) and h is the radial height of the object from the Earth.The maps in Figure 1a,b are obtained by projecting the game space in the opposite direction of the Z-axis.

Motion State
At sampling time t, the motion state of the UCAV is defined as U t = [CON(p T u,t , v T u,t , a T u,t )] T , where v u,t = [v ux,t , v uy,t , v uz,t ] T and a u,t = [a ux,t , a ay,t , a uz,t ] T represent the velocity and acceleration of the UCAV, respectively, and CON represents a function used to concatenate vectors.Because the observation station in Explorer remains static, it can sufficiently be described using the position, p a,t .For convenience, in the remainder of this paper, the variables with subscripts a and u are defined as belonging to the target observation station and UCAV, respectively.
Please note that the position of the target can only be captured or predicted by the UCAV's radar, we use p o a,t or po a,t to indicate the estimated observation of the UCAV, and the specific estimation methods will be detailed in Section 3.2.

Action
For each operating cycle, the manipulation of the UCAV's maneuvering in Explorer can be simply divided into the following three steps: (1) Input the expected direction of motion, A t , which is defined by the UCAV's pitching and rotation angles, i.e., A t = [ϕ t , ϑ t ].
(2) By using the approach in [1], A t and certain physical constraints, including the maximum normal overload, maximum radial overload, and maximum velocity of the UCAV, are used to calculate the maximum acceleration that the UCAV can produce in the given direction; we denote it as âu,t .
(3) By using the uniform acceleration matrix, Φ CA , as the motion control matrix, the next motion state of the UCAV, U t+1 , can be obtained by multiplying U t and Φ CA : where , τ is the length of the sampling cycle, I is the third-order unit matrix, and 0 is the third-order zero matrix.

Reward Shaping
For a DRL framework, the exquisite shaping of the game rewards plays an important role in the target model.In Explorer, we consider reward-shaping from a behavior-driven perspective, which involves the following three key factors: (1) Because the target observation station must be within a given task space, it should be punished when the UCAV tends to escape this space.Thus, the following penalty function can be defined: where clip(a, [0, 1]) is a clipping function used to control the parameter a to satisfy the boundary constraint, i.e., not less than 0 and not greater than 1.Define V max as the maximum speed of the UCAV, and l = V max τ is the maximum travel distance of the UCAV in one cycle.(2) When the UCAV is not equipped with a hard-killing weapon, it is worth encouraging the use of a suicide attack that collides with the observation station.Therefore, we directly use a reward signal with a value of 100 to motivate this particular offensive behavior.(3) A continuous reward signal needs to be employed to guide the UCAV to learn efficient navigation strategies when training the DRL framework.Restricted by the detection range of radar, the relative displacement's normalization is adopted to describe the tracking and approaching of the target by the UCAV, which can be computed as follows: where indicates the normalization operator and ( pa,t − pu,t ) = [(x a,t − x u,t )/15, (y a,t − y u,t )/15, (h a,t − h u,t )/7.5] denotes the normalized relative displacement between the target and the UCAV.ρ max is the maximum distance that the UCAV's radar can reach.
The united reward of the UCAV at any time is equal to the sum of the results of the three rewards defined above, i.e., R t = escape R t + tracking R t + (0 or 100).

Supplement
Explorer is based on the vCEW framework, which includes many unique designs about radar equations, such as using the line-of-sight (LOS) angle and the equivalent detection distance (related to the radar cross section) to determine whether the station is within the field of view of the UCAV, assessing the collision caused by objects, and analyzing the stage progression of the radar based on a parametric data processing system (PDPS).The implementation of these model details can be found in [1].

Observation Estimation and Prediction
As a typical partially observable Markov decision process (POMDP), the UCAV receives feedback signals from partially observable environments during flight, and after a successful target search, it stores the information of the target in its memory bank.In Explorer, we allowed the PDPS to filter and predict the target's motion states, but when the radar sensor cannot perceive the target, other approaches are considered, such as variational Bayesian and neural networks (NNs), to predict the point where the target is most likely to appear.
The configuration parameters of the original version of Explorer are shown in Table 1.The game duration of each round is fixed at 400 s, based on different lengths of the operating cycle, and the difficulty of the search task can be divided into four levels: (1) the operating cycle corresponding to level 0 is 1 s, and the maximum number of operational steps is 400; (2) the operating cycle corresponding to level 1 is 0.5 s, and the maximum number of operational steps is 800; (3) the operating cycle corresponding to level 2 is 0.2 s, and the maximum number of operational steps is 2000; and (4) the operating cycle corresponding to level 3 is 0.1 s, and the maximum number of operational steps is 4000.Clearly, the shorter the operating cycle sets, the more complicated the motion planning of the autonomous navigation becomes.The introduction of noise causes a deviation between the observed state and the actual state of the UCAV, which will affect the estimation/prediction accuracy and further affect the action decision-making of the UCAV.
As mentioned in Section 3.1.1,the prediction state can be represented by the station's location po a,t , which differs from our previous work in that it employed a generative network to learn the probability distribution of the feedback state and estimated the prediction based on a variational Bayesian network.However, as verified by a large number of simulations [28], as long as the navigation goal is accurate (reaching the target field), we can consider directly using a NN to generate grid cell prediction agents, which supports route planning across unexplored spaces.Please note that before the UCAV begins to learn the state representation based on the grid network, arranging supervised learning experiments to implement path integration is essential.

Path Integration
Based on the long short-term memory (LSTM) (denoted as "grid LSTM"), the grid cell network of the agent was implemented as in the supervised learning setup by using the cumulative trajectories.The simulated UCAV started at a uniformly sampled position and motion direction within their ranges.For each task's UCAV, the motion model based on Equation ( 2) is used to obtain trajectories that cover the entire environment, where boundary clipping is necessary to avoid persistent collisions with environmental boundaries.

Cell Activations
Ground-truth place cell distribution and head-direction cell distribution are designed to create grid cells using a linear combination.For a given position p u ∈ R 3 , place cell activations, c ∈ [0, 1] N , were simulated by the posterior probability of each component of a mixture of three-dimensional isotropic Gaussians, where the place cell centers µ (c) i are N three-dimensional vectors selected uniformly at random before training, and the place cell scale σ (c) is a positive scalar specified by experiments.
Similarly, for a given motion angle A, head-direction cell activations, h ∈ [0, 1] M , were represented by the posterior probability of each component of a mixture of von Mises distributions, where the M head-direction centers µ (h) i are chosen uniformly from [−π, π] at random before training, and the concentration parameter κ (h) is a positive scalar specified by experiments.

Grid Cell Network Architecture
As shown in Figure 2, the grid cell network architecture consists of three layers: a recurrent layer (an LSTM with 128 hidden units), a linear layer, and an output layer.The input of the LSTM is a composite vector constructed by the normalized position and velocity of the UCAV, which can be expressed as CON( pu,t , v u,t ), where pu,t = [(x u,t +7.5)/15, (y u,t +7.5)/15, (h u,t −0.5)/7.5] is the normalized UCAV's position in Explorer space, and vu,t = [v ux,t , v uy,t , v uz,t ]/V max denotes the normalized velocity.The linear layer is regularized by dropout [31] and is used to map place and head-direction units, and the output layer is used to generate predictions for place cells and head-direction cells.
The output of the LSTM (hidden state, m t ) is connected to the input of a linear decoder, which is used to map the linear layer activations, g t ∈ R 512 .The output of the decoder maps g t to the predicted place cells, c p,t , and the predicted head directions, c h,t , via SoftMax functions, and dropout [31] with a drop probability of 0.5 was applied to each g t unit.Thus, there are three sets of weights and deviations that need to be optimized.

Objective Function
At each time step t, the place and head-direction cell ensemble activations can be predicted via training the grid network.For each task, the network is trained in a fixed environment where the place cell center remains unchanged.The objective function, L, for optimizing the parameters of the grid cell network is to minimize the cross-entropy between the network place cell predictions, c p,t , and the synthetic place-cells targets, c t , and the cross-entropy between the head-direction predictions, c h,t , and their targets, h t : Through time, backpropagation is used to calculate the gradients of the objective function relative to the network parameters, and these parameters are updated using stochastic gradient descent.The network is unrolled into blocks of 200 time steps.Finally, in the UCAV's observation state, S t , the position prediction of the estimated target can be replaced by g t , i.e., S t (pre) = CON( pu,t , v u,t , ( pu,t − po a,t )) ⇒ S t = CON( pu,t , v u,t , g t ).
Please note that this paper mainly draws on the method in Reference [28] when training the grid cell network.Equations ( 5)-( 7) have not considerably changed except for rewriting the symbols.

Behavior-Angle-Based Reward
For autonomous navigation, some special indicators are used to assess the learning behavior of the combat vehicle, such as the behavior angle (BHA).As shown in Figure 3, in Explorer, BHA is the angle between the UCAV's maneuvering direction and the LOS direction n t , which can be represented by ϑ t = âu,t , n t .The BHA can be calculated to observe the movement trend of the UCAV.For convenience of analysis, ϑ is further normalized to ϑ = /π ∈ [0, 1]; when it is greater than 0.5, the UCAV is considered to be moving away from the target, and vice versa.Section 3.1.3defines rewards for different attributes according to the different navigation purposes in the reward-shaping and then weights them together.Based on the conclusion drawn from reference [1], the behavioral reward of approaching (or colliding with) the target allows the UCAV to learn that the navigation policy is to maintain a BHA close to the target field.Although this design award enables the UCAV to perform the task well in an Explorer game with a difficulty level of 2 (the definition of the game difficulty can be found in Section 3.2), it failed in the task with a difficulty level of 3, i.e., the highest level.After a thorough analysis and experiments, the following two aspects are considered to improve the learning system: (1) Reference [1] uses a sparse positive reward to induce the UCAV to collide with the target and uses a continuous negative reward to prevent the UCAV from being too far from the target.However, we find that distance-based reward-shaping is information-redundant for guiding UCAV navigation.This is because the UCAV's velocity is the first-order difference of the distance; in VBN, the planned acceleration can achieve the most effective control of speed and head direction.Therefore, for the UCAV to extract the behavior of adjusting acceleration from the behavior of adjusting distance, the DRL algorithm is required to possess a great sample use.(2) Because the motion model of the object in Explorer is based on vCEW, only the direction of the maneuver needs to be planned to obtain the desired acceleration.To fully use this advantage, we hope to reduce the information redundancy of training samples by modifying the reward function and then generally improve the performance of learning algorithms operating in Explorer.
For (1), we introduced the SAC algorithm in the DRL framework; for (2), as long as the BHA is less than 90 • and the speed is greater than 0, then there is always an approach for the agent to arrive in the mission area.Using the facing angle v u,t , n t ∈ [0, π] to indicate the motion state of the UCAV relative to the target, we modify the tracking reward function represented by Equation (4) to the following: and the composite reward function becomes R t = escape R t + tracking R t + (0 or 100).Simultaneously, we supplement a pure reward function to create a set of comparison simulations, i.e., R t = escape R t + tracking R t + (0 or 100), where tracking R t is Finally, the incentive process of navigation behavior using different reward functions (R t , R t , and R t ) can be analyzed via Figure 4a-c: (1) When using the facing angle shown in Figure 4a to evaluate the accuracy of the search/tracking, the UCAV will directly understand the effect of acceleration from the changes in angular velocity.
It is very easy to learn a VBN policy when the facing angle is less than 90 • .(2) As shown in Figure 4b, when using the relative displacement between the target and the UCAV as a motivating factor, the UCAV will understand the effect of velocity on navigation from the short-term changes in distance, while the resulting policies of motion should be only a by-product of acceleration planning (although the direction of the planned acceleration is what we most expect).Therefore, the training information sampled in this way is insufficient.(3) Figure 4c shows that the UCAV is awarded tracking and collision rewards only after entering the target field, and there is no continuous feedback signal to guide the UCAV's action.
In this way, although the UCAV can explore more behavior strategies, the generated training samples will carry considerable redundant information.Although the UCAV can explore more behavior strategies in this way, the generated training samples will carry substantial redundant information.Therefore, this reward-shaping method has high-performance requirements for the DRL algorithm, particularly in the case of a large amount of sampled data.

Deep Reinforcement Learning Framework for Continuous Control
DRL can carefully choose the feature representations through autonomous learning based on deep learning; thus, the agent can learn more effectively to take actions to maximize the cumulative returns through interaction with the environment.In the domain of continuous control, where actions are continuous and often high-dimensional, practical engineering, such as intelligent navigation, requires a high capability of DRL algorithms, and we argue that there is no unique DRL algorithm that is efficient for all scenarios.With recent progress, based on the actor-critic (AC) framework shown in Figure 5, the DRL algorithms with a policy gradient (PG) as their core have advantages in convergence and efficiency.Therefore, this work integrates DDPG, PPO, A3C and SAC into a DRL framework for the Explorer game, in which the implementation of the PPO algorithm is divided into two types: penalty-based and probability-clipping-based.

Preliminaries
In this section, we define the notation used in the subsequent sections.Commonly, the navigation task model in Explorer conforms to a finite-horizon discounted Markov decision process (MDP), which can be defined by the tuple (S, A, R, P, γ, ), where S denotes the agent's state space of size |S|, A represents the action space of size |A|, P represents the transition probability distribution of the state, R is the reward function, and γ ∈ (0, 1] is the discount factor that weights future rewards. Most of the algorithms implemented in our DRL framework optimize a stochastic policy π θ ∈ R S×A 0 via a network θ (or networks).For a continuous task, at time step t, the DRL-based agent first obtains the observed state, S t ∈ S, the agent then selects the optimal or sub-optimal action A t ∼ π(A t |S t ) from the policy function, and finally receives the reward R t from the environmental feedback and prepares to enter the next state S t+1 ∼ P(S t+1 |S t , A t ) after interacting with the environment.Multiple state transition pairs are concatenated to obtain the whole trajectory of the agent, which can be denoted as φ = {...(S t−1 , A t−1 , R t−1 , S t ), (S t , A t , R t , S t+1 )...} T t=0 .Therefore, the discount sum of future rewards is used to define the expected state-value function for π from S t , as follows: Similarly, the state-action value function, Q, is defined as follows: To output continuous actions, the following PG theorem can be used to update the network: where V t represents the unbiased sample of Q π (S t , A t ), and α is the learning scale.Denote the parameters of actor and critic networks in the AC as θ µ and θ Q , respectively.Please note that for the DRL algorithms of AC structure, the action A t is chosen by a policy network µ θ , i.e., A t = µ θ (S t ) + N t , where N is the artificially introduced exploration noise.

Algorithms
In this section, we briefly summarize the algorithms implemented in our Explorer environment and note any techniques for applying them to general parameterized modules, including Memory replay: For each iteration, the UCAV generates N trajectories {φ n } N n=0 , which contain state transition pairs sufficient for DRL training.However, due to the continuity of the trajectory, the sampled data generated are highly correlated.Thus, if the on-policy training is adopted, the DRL neural network will output an unstable action decision [17], and a divergent accumulative reward.A memory pool plays an important role in decorrelating them by storing and sampling these transfer pairs.Unique sampling methods such as [20] are generally used.Dual networks: In the training process, using NNs to fit V π or Q π for DRL is very unstable.Reference [32] introduces a target network θ µ (or θ Q ) to make the evaluated network of the DRL training independent of the target network.It also recommends that the target network be updated by delay update or soft update.The formula for the soft update is Action clipping: The physical state involved in the engineering model is generally constrained, and the DRL's actor network tends to output a value that exceeds the boundary of the agent's motion state.
There are two corresponding solutions: clipping the action according to the boundary constraints or designing an ideal transformation function at the output of the network.Cross-entropy method (CEM) [33]: Unlike the previous method of exploring through random actions, CEM explores directly in the policy parameter space.First, the outputs µ k and σ k of the actor network are designed to form the Gauss distribution space θ µ n ∼ N (µ k , σ 2 k ).Then, in each iteration, the output is sampled according to θ µ n , and the current policy distribution of the action is evaluated.Finally, the network is optimized by maximizing the entropy.

Advantage function:
The basic idea of the PG method in updating the policy is to increase the probability action with a large reward and reduce the probability action with a small reward.Suppose that there is a scenario agent such that the reward is positive at all times and that the reward is set to be 0 for the actions that are not sampled.In this scenario, if a good action has not been sampled and the sampled action is bad and obtains a small positive reward, then the probability of the good action that is not sampled will become increasingly smaller, which is clearly inappropriate.Thus, creating a reward baseline and balancing the positive and negative rewards is necessary.This baseline can work through the following advantage function: where V π (S t ) can be evaluated by a critic network.

Deep Deterministic Policy Gradient (DDPG)
DDPG [25] adopts a deterministic policy when updating the actor network, which is a very suitable approach for a system with a high-dimensional action space such as a UCAV.Moreover, the DDPG applies gradient descent to the policy with minibatch data sampled from a replay pool, where the gradient for the critic network is computed as follows: and the loss function that should be backpropagated through the deep actor network is where B is the batch size.

Proximal Policy Optimization (PPO)
PPO [26] is an improved algorithm based on trust region policy optimization (TRPO) [34].Because the maximization of the objective function in TRPO is limited by the size of the policy update, the PPO considers replacing the constraint with a penalty term, so Schulman et al. propose an unconstrained policy optimization scheme: where θ old is the vector of policy parameters before the update and β is the penalty coefficient.Denote the probability ratio r t (θ) = π θ (A t |S t ) π θ old (A t |S t ) .Considering that the selection of coefficient β is difficult, Equation ( 18) can be modified as where is a hyperparameter less than 1.For distinction, we denote the PPO algorithms based on Equations ( 18) and (19) as PPO (KL) and PPO (CLIP), respectively.

Asynchronous Advantage Actor-Critic (A3C)
A3C [23,24] is a multi-threaded algorithm based on the AC framework.By asynchronously executing multiple agents and taking different states experienced by parallel agents as training samples, the correlation between state transition samples generated in a single training process is removed.This algorithm can be implemented with only one standard multi-core CPU (very suitable for micro-labs) and is superior to traditional methods in terms of efficiency, time, and resource consumption.The update formula of the actor network in A3C is where θ µ (same as θ Q ) represents the global shared parameters, and θ µ i and θ Q i are the i-th thread-specific parameters.
Additionally, the critic network is updated as follows:

.4. Soft Actor-Critic (SAC)
SAC [27] is an off-policy and model-free DRL algorithm with sample efficiency that is sufficient to solve the problems of real-world robot learning in hours.In addition, SAC's hyperparameters are quite robust, requiring only a single hyperparameter set to perform well in different simulation environments.The SAC is an AC framework with maximum entropy objective: where ρ π denotes the state and state-action margins of the trajectory distribution induced by a policy π(A t |S t ).δ is the temperature parameter that determines the relative importance of the entropy term against the reward and thus controls the stochasticity of the optimal policy.Furthermore, Haarnoja et al. [27] proposed a practical multimodal representation based on a mixture of K Gaussians to provide a distribution in medium-dimensional action spaces.This distribution is more expressive and easier to handle, and it can endow SAC with more effective exploration and robustness in the framework of entropy maximization.

System Implementation and Simulation Details
This section first describes the system framework based on multiple DRL algorithms for testing the Explorer game and configures the core parameters; then, it introduces the simulation environment of hardware and software in the experiment.Finally, based on different goal-driven rewards, the proposed system is run in games with different difficulty levels, and the corresponding metrics are obtained and analyzed.

System Configurations
The complete system framework is shown in Figure 6.The left blue box is used to interact with the Explorer environment and generate the estimated state of the agent.The right grey box is a DRL-based decision framework to guide the behavior of the agent (UCAV).There are five algorithms in the DRL framework, and the detailed hyperparameters of each algorithm are presented in Table 2. Please note that in Table 2, 2 × 2FC-(520, 1024, 1) means that there are two networks, and each network consists of two fully connected (FC) layers, which have 520 input units, 1024 connection units, and 2 output units.

Simulation Platform of Software and Hardware
The software and hardware versions of the simulation platform clearly limit the learning efficiency of the DRL system.We tested the five DRL algorithms horizontally on a single platform.The hardware parameters of the platform are a PC running Windows 8 with a XeonE5-1620 v43.50 GHz CPU and 16 GB of DDR4 RDIMM memory.The software parameters of the platform are listed as follows: (1) Python v3.5.4 is used as the programming language; (2) TensorFlow v1.7.0 is used as the machine learning framework; and (3) Pyglet v1.3.2 is used to realize visualization of maneuvering physical models.

Simulation Details and Metrics
Based on the combination of the game environment of four levels and three reward functions, we design 12 groups of comparison simulations for the DRL framework.At each level, 10,000 episodes of the game are run with a specific random seed used for randomly generating the training data; for each episode, to enhance the generalization ability of the DRL framework, the initial states of the UCAV and target observation station require a random reinitialization at the beginning of the game.To observe the behavior changes of the UCAV before and after training, we introduce a set of random actions as the benchmarking strategy for game agents in each episode.
In the comparison of the simulation results, we divided the total episodes of the game into two parts: the previous 7500 episodes are used to train the DRL framework, and the last 2500 episodes are used to test the performance of the DRL framework after training.Additionally, although each episode of the game has a fixed number of termination steps (the maximum operational steps corresponding to the current task), if the UCAV successfully tracks the target for 50 cycles or collides with the target, then the episode's game will be terminated prematurely.
Four metrics, the win rate (WR), tracking rate (TR), mean accumulated reward (MAR), and computational time (CT), were employed in the experiments, where WR refers to the percentage of episodes in which the UCAV ends the game by destroying the target observation station and TR refers to the percentage of episodes in which the UCAV successfully locates the target and keeps track until the end of the game in the current mission.For the desired combat strategies, WR and TR represent the UCAV's ability to find the optimal strategy and the sub-optimal strategy, respectively.Because the physical constraints make the UCAV need to constantly adjust its attitude when approaching the target, the UCAV at the end of the path planning is more worthy of attention than that at the beginning of the game.Define the calculation formula for MAR as follows: where |φ| denotes the total step size of the current task trajectory and ∨ and ∧ are the maximum and minimum operators, respectively.t=−1 means the reciprocal first step of the trajectory, and ζ is the cumulative step size of MAR after completing the task.The MAR in Equation ( 23) can be used to observe how the UCAV understands the behavior of "target-searching" [1].Please note that for each level in Explorer, WR and TR are statistical values that are calculated after the complete test set has been run, and MAR is the statistical value that is returned immediately at the end of each episode's game.Although colliding with a target can bring a great reward, the distribution of this behavior's policies is sparse.Therefore, the agent tends to fall into a local optimum when learning in a continuous state space.Considering the above phenomenon, the DRL-based agents are always trained with a large variance, which results in an unstable oscillation of the MAR value even when the algorithm converges.Therefore, the method of multi-episode averaging to smooth the data of the MAR is adopted: MAR I , where i = 1, 2, 3, ..., (10, 000/ζ e ) and ζ e is the smoothing scale.To adaptively observe the performance of the DRL framework at different game levels (from level 0 to level 3), we set ζ to be 1/40 of the maximum operational steps of the current task and set ζ e to 40.

Simulation Results
In this section, simulation results are analyzed from the three aspects of reward function, behavior strategy, and algorithmic robustness.
Analysis of reward-shaping effects: the convergence performance of five algorithms in the DRL framework for different level tasks is shown in Figure 7.The long-term stability of the mean value of MAR indicates that the trained UCAV successfully masters the policy of target-searching, and the higher the peak of the MAR curve jitters, the more wins the UCAV has, i.e., the stronger the ability of the UCAV to find the global optimal solution.As demonstrated by Figure 7, the DRL framework with a pure goal-driven incentive can maintain better performance in simple tasks (such as level 0 and level 1), but for high-level tasks (such as level 2 and level 3), the quality of the behavior generated by the framework decreases dramatically, which leads to the divergence of MAR values.When using distance-based rewards, the SAC algorithm has a certain improvement in the WR of all levels of tasks, and the performance of the PPO (KL) and PPO (CLIP) algorithms changes drastically, while the performances of the A3C and DDPG algorithms hardly change.Fortunately, the BHA-driven incentive enabled the performance of the DRL framework to be state-of-the-art because in the most difficult game, algorithms including SAC show an obvious convergence trend, and the PPO algorithm even reaches 88.68% WR in 8 hours.Analysis of behavior strategies: from Figure 8, we can clearly understand which maneuvering strategy is well performed for the UCAV to search and destroy the target.Through careful observation, we conclude that the algorithm that performs well in searching targets aims to output a low and stable BHA value in motion planning.Clearly, the BHA value obtained by the DRL framework using R t as the incentive is smaller and more stable than the BHA values of using the other two reward functions.Therefore, it is feasible to optimize the convergence efficiency of the DRL algorithm by reducing the learning of redundant control information.Analysis of the algorithmic robustness: based on the detailed performance values reported in Table 3, the robustness of the DRL framework can be comprehensively evaluated.The DRL framework always maintains outstanding performance in low-difficulty tasks: currently, the A3C algorithm not only has the best searching ability but also has a satisfactory time-consuming ability.The combination of DRL algorithms and the reward function R t has a good effect on solving the tasks of high levels, among which PPO (CLIP)'s performance is the most commendable.Additionally, the high use rate of SAC algorithm for data samples makes its MAR value possible to converge quickly and achieve 63.48%/23.44%WR even when using R t /R t as the reward function in the most difficult task (up to 30 million total samples).However, one unsatisfactory point is that the SAC algorithm is more likely to fall into a local optimum (a high TR rather than a high WR), which makes it often take more time to find the global optimal strategy in simple tasks.To summarize, the application of reward-shaping based on BHA in the DRL framework can enable the UCAV to automatically and quickly complete fine-and long-term navigation tasks, including intelligent target-searching and target tracking.
Videos of all Explorer experiments are available at the following website: https://github.com/youshixun/vCEW.

Conclusions
In the framework, five DRL algorithms are adapted to the Explorer game, aiming to solve long-term intelligent target search tasks and other well-characterized electronic warfare tasks, including target-tracking or target striking.First, to enable the UCAV to accurately estimate the target position in 3D space, we use path integration to generate a large number of data samples to perform supervised training on the designed mesh unit network to obtain the predicted features of VBN rather than the traditional relative displacement.Second, by analyzing the motion state of the continuous control model and the information redundancy existing between different states, a reward-shaping method based on the BHA is designed.Finally, the network configuration details of the five DRL algorithms are refined and combined with the three goal-driven incentives to form a complete DRL framework.It has been proven through sufficient experiments that the combination of the BHA-driven incentive and A3C algorithm is suitable for the autonomous motion planning of low-precision short-term navigation tasks; for any difficulty of the Explorer game, the SAC algorithm has excellent performance when using BHA-based rewards.
Our work not only pioneered verifying the ability of popular DRL algorithms to perform target-searching tasks in 3D continuous action space, but also achieved breakthrough results in optimizing the end-to-end action policies.However, in the future, there are still many aspects worthy of in-depth research, among which we are most concerned about solving navigation problems in the environments of multiagent cooperation and competition, such as real-time obstacle avoidance and moving target-tracking.In addition, when introducing weapon models such as missiles and jammers into the vCEW framework, a visual physics engine software is needed to better evaluate the interaction between these models and various game environments.At the same time, with the increasing complexity of engineering models, the development of high-performance DRL algorithms is imminent.

Figure 1 .
Figure 1.(a) displays the interface of the Explorer game, with a total pixel size of 600 × 1050; the 600 × 600 panel on the left side displays the map from the global perspective, and the 600 × 450 panel on the right side displays the status information of the UCAV and the observation station.(b) shows the maximum detection range of the UCAV (reconnaissance perspective).The ratio with respect to the real environment is 1:25, i.e., 1 pixel corresponds to 25 m.

Figure 2 .
Figure 2. Grid network architecture in the supervised learning experiment.The recurrent layer is an LSTM with 128 hidden units.c and h represent the place cell activations and head-direction cell activations, respectively.

Figure 3 .
Figure 3. Schematic diagram of the behavior angle of UCAV in motion.

Figure 4 .
Figure 4.The behavior incentives provided by different reward functions: (a-c) correspond to R t , R t , and R t , respectively.The thickness of the red arrow indicates the proportion of information dissemination.

Figure 6 .
Figure 6.DRL-based cognitive system for Explorer.Please note that the S t is the cognitive state of the UCAV, and the R t is the real reward induced by the real state of the Explorer game.

Figure 7 .
Figure 7. Evaluating the convergence performance of the DRL framework in Explorer.The twelve subfigures from left to right correspond to the results for game levels 0 to 3, and from top to bottom correspond to the reward functions R t , R t , and R t , respectively.

Figure 8 .
Figure8.Investigating the potential behavior policies learned by the UCAV in Explorer.The twelve subfigures from left to right correspond to the results for game levels 0 to 3, and from top to bottom correspond to the reward functions R t , R t , and R t , respectively.

Table 3 .
Performance of the proposed DRL framework at different game levels and reward functions, metrics includes the WR (%), TR (%), and CT (h).