DQN-Based Shaped Reward Function Mold for UAV Emergency Communication

Ye, Chenhao; Zhu, Wei; Guo, Shiluo; Bai, Jinyin

doi:10.3390/app142210496

Open AccessArticle

DQN-Based Shaped Reward Function Mold for UAV Emergency Communication

School of Information and Communication, National University of Defense Technology, Wuhan 430000, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(22), 10496; https://doi.org/10.3390/app142210496

Submission received: 2 August 2024 / Revised: 8 November 2024 / Accepted: 12 November 2024 / Published: 14 November 2024

(This article belongs to the Section Aerospace Science and Engineering)

Download

Browse Figures

Versions Notes

Abstract

Unmanned aerial vehicles (UAVs) have emerged as pivotal tools in emergency communication scenarios. In the aftermath of disasters, UAVs can be communication nodes to provide communication services for users in the area. In this paper, we establish a meticulously crafted virtual simulation environment and leverage advanced deep reinforcement learning algorithms to train UAVs agents. Notwithstanding, the development of reinforcement learning algorithms is beset with challenges such as sparse rewards and protracted training durations. To mitigate these issues, we devise an enhanced reward function aimed at bolstering training efficiency. Initially, we delineate a specific mountainous emergency communication scenario and integrate it with the particularized application of UAVs to undertake virtual simulations, constructing a realistic virtual environment. Furthermore, we introduce a supplementary shaped reward function tailored to alleviate the problem of sparse rewards. By refining the DQN algorithm and devising a reward structure grounded on potential functions, we observe marked improvements in the final evaluation metrics, substantiating the efficacy of our approach. The experimental outcomes underscore the prowess of our methodology in effectively curtailing training time while augmenting convergence rates. In summary, our work underscores the potential of leveraging sophisticated virtual environments and refined reinforcement learning techniques to optimize UAVs deployment in emergency communication contexts.

Keywords:

unmanned aerial vehicle UAV; deep Q-learning network DQN; reward shaping

1. Introduction

Emergency communication pertains to scenarios where the primary communication infrastructure sustains significant damage, prompting emergency response departments to coordinate efforts across various agencies and address crises efficiently. Given the expansive territories of many countries and the prevalence of natural disasters in certain regions, establishing a scientific and effective emergency communication system is of paramount importance.

The mobility and positioning of base stations are further constrained by factors such as road accessibility, terrain undulations, and vegetation cover. Additionally, the placement of these base stations must adhere to specific topographic and geomorphic conditions [1]. Consequently, in numerous instances, large-scale UAV networking for emergency support is impracticable due to terrain, resource, and other limitations. In such circumstances, it becomes crucial to implement dynamic planning strategies for individual or groups of UAVs, with the objective of maximizing coverage and protection for users in need.

After the initiation of post-disaster relief operations, commanders require a range of fundamental information, including personnel locations and critical areas, to support their decision-making processes. In such situations, UAVs can serve as nodes, tasked with traversing all sensors within the disaster zone and promptly relaying the gathered information back to the command post [2]. Under emergency conditions, command post nodes play a pivotal role in swiftly integrating information reported from various levels.

However, it remains challenging for the command post’s guarantee node to receive timely situation updates from the disaster area and make accurate decisions. Therefore, it is worthwhile to consider how to optimize the efficiency of a single UAV, ensuring it can cover and support as many users as possible. To address this, this paper establishes a virtual simulation environment and employs deep reinforcement learning to enable agents to autonomously explore solutions, providing valuable insights for intelligent UAV planning.

The UAV node is characterized by its strong mobility and immunity to the terrain environment. In military contexts, UAVs play a crucial role in various operations such as battlefield reconnaissance, surveillance, border patrol, and precision strikes [3]. Yin et al. utilized deep neural networks for the topology planning of emergency communication networks [4], while Chen et al. employed the DQN algorithm for hierarchical design of such networks [5]. Jiangbin et al. proposed a method for determining the optimal deployment location of UAVs [6].

However, the aforementioned research has not yet fully maximized the support efficiency of a single UAV. In this paper, we enhance the UAV system by integrating the DQN algorithm with reward shaping. The simulation environment serves as the input, and through training, the agent UAV is able to automatically adjust its position. Ultimately, the improved efficiency of the trained UAV is effectively demonstrated.

Focusing on the design of the reward function, Ng et al. crafted a heuristic reward function that incorporates both distance-based and subgoal-based elements [7]. Experimental results demonstrated its significant impact on reducing training time. Meanwhile, Dong [8] introduced a theoretical optimization framework rooted in reward shaping drive, aiming to enhance the training efficiency and stability of reinforcement learning, addressing the current challenge of low efficiency in reinforcement learning training.

In recent years, despite the notable advancements in the application of UAVs in emergency communication, the challenge of swiftly locating all nodes within a given environment remains pressing, given the necessity to ensure timely communication. While existing communication routing methods offer some relief, they fail to fundamentally address the underlying bottleneck. Furthermore, with the progression of artificial intelligence, there is an escalating demand for heightened speed and accuracy in data processing, necessitating improvements in the timeliness of current technological approaches.

Currently, the majority of research on UAV emergency communication networks centers on routing planning for UAVs with limited exploration of deep learning and other methodologies. A systematic investigation into these areas is lacking. Additionally, the existing theoretical frameworks and models may exhibit constraints when applied in practical scenarios. Consequently, developing an algorithm that not only facilitates rational action selection but also accelerates the learning pace of the agent has emerged as a crucial research gap.

In this paper, the application of reward shaping to the DQN algorithm is explored, resulting in a certain enhancement of the algorithm’s training efficiency. The primary contribution of this work lies in the continuous adjustment of parameters, optimization of network architecture, and improvement of agent performance. Furthermore, a reward function grounded in potential functions is incorporated to further elevate the algorithm’s overall performance. The work of this paper includes three aspects:

Abstract modeling is carried out for typical application scenarios of communication equipment such as UAV, and environmental design is carried out based on Atari platform.
A more effective action selection strategy is applied to replace the greedy strategy with the traditional DQN algorithm, and its performance is better than before.
The reward function is redesigned to improve the efficiency and stability of the algorithm, and experimental verification is carried out.

In Section 1, we summarize the current research status. In Section 2, we establish the system model and introduce the DQN algorithms. Section 3 contains the experiments and analysis of the experimental results. Then, Section 4 provides a summary and directions for further improvement.

2. Materials and Methods

2.1. Reinforcement Learning

Reinforcement learning (RL) primarily involves abstracting real-world scenarios and transforming actual problems into situations where agents employ actions and strategies to pursue the optimal reward [9]. This approach does not necessitate extensive human prior knowledge but instead relies heavily on substantial computational power to conduct numerous experiments, enhancing the performance of the agents.

The action value function represents the cumulative return of the agent from the initial state to the end of the turn by taking an action. It can be expressed as outlined below:

Q^{*} (s, a) =_{Π}^{m a x} E [R_{t} | s_{t} = s, a_{t} = a, Π] .

(1)

At the same time, the optimal action value function may have more than one kind, the optimal action value function can be expressed as

Q^{*} (s, a) = E_{s^{'} - s} [r + γ {}_{a^{'}}^{m a x}{Q (s^{'}, a^{'}) | s, a]} .

(2)

2.2. Deep Q-Learning Network

In the professional realm, research on deep reinforcement learning algorithms is primarily bifurcated into value function-based and policy function-based methodologies. The DQN algorithm falls under the category of value function-based deep reinforcement learning. By incorporating an experience replay mechanism, an experience pool is established to store training samples. Subsequently, a small batch of these samples is selected for iterative updating. This technique effectively mitigates the correlation between samples [10].

At the same time, the target network and the evaluation network are separated to improve the stability of the algorithm. The DQN algorithm sets up two neural networks, namely a target network and action network. Firstly, the neural network is used to fit the Q value, and the F function is designed to reset the reward function according to the agent collision effect. The parameter value of the network is updated in real time, and it can be assigned to the target network after N iterations. By separating two networks, the stability and convergence of the algorithm can be improved [11]. Its optimal value function can be expressed as

Q^{*} (s, a) = {}_{Π}^{m a x}{E (r_{t} + γ r_{t + 1} + γ^{2} r_{t + 2}} + . . . | s_{t} = s, a_{t} = a, Π) .

(3)

The basic DQN algorithm uses the

ε

-greedy strategy to balance exploration and utilization to improve the agent’s exploration of the environment. The strategy can be expressed as

Π^{*} = \{\begin{matrix} a r g m a x V^{Π} (s) p = 1 - ε \\ r a n d (a) p = ε \end{matrix} .

(4)

The agent randomly selects the action in the action space with the probability of

ε

, and then it selects the known optimal action with the probability of 1 −

ε

. The traditional DQN algorithm can solve the problem of dimensionality disaster when the Q-learning algorithm deals with a complex environment. However, the traditional DQN algorithm still cannot overcome the sparse reward problem, which requires a lot of extra training time.

The following is the frame of DQN, as shown in Figure 1.

2.3. System Model

This paper focuses on the application of Unmanned Aerial Vehicles (UAVs) in mountain emergency communication scenarios. An illustration depicting the UAV emergency communication networks is provided. In this study, we specifically concentrate on the flight speed and communication range of the UAV. Upon locating the target point, it is presumed that the UAV automatically establishes a communication connection. Given that only the UAV and the target point can engage in bidirectional communication, while the target points cannot communicate with each other, we can reasonably model this network as a star network.

To effectively address the primary research questions, this paper narrows its scope to the UAV maneuver model and the transmission model, disregarding other factors. We operate under the assumptions that the UAV maintains a constant speed and has adequate endurance, and that transmission loss of the communication signal during transmission is negligible. The typical communication scenario is depicted in Figure 2.

This paper posits a scenario in a mountainous region where N users are located at unknown positions alongside one UAV communication node and multiple relay nodes. Occasionally, due to the devastating impact of landslides, all base stations within the area suffer severe damage and become inoperable. Given the narrow terrain and other constraints, only one UAV can be deployed as a communication node, while the remaining relay nodes remain functional. In such instances, the UAV must establish a route that connects all users and conveys crucial information to each node. We designate both ends of the communication path as relay nodes, and information is returned directly after being transmitted to the backbone node. Upon the information’s arrival at the node, a reward is granted, and the user subsequently exits the network. The simplified test platform simulates time as a discrete time step with the agent executing each action at regular intervals.

The disaster area is defined as a 25 * 43 grid in which all the objects are to be observed by the UAV. The UAV moves at a constant speed, moving 3 grids per step. In order to facilitate learning, it is assumed that the UAV agent does not change the starting position. At the initial moment, the UAV agent can first select the first observation target arbitrarily and start movement. After completing the observation round for all objects, the UAV agent will receive the corresponding reward. If the observation cannot be completed within the specified time, the round ends and the UAV agent cannot receive the reward.

This article assumes the following:

The paper does not consider attenuation and other factors in the transmission process, simplifying the system model.
The relay node directly transmits the signal downward after receiving it, and there is no delay in the middle.
The UAV flies at a constant slow speed and only moves in two-dimensional space.

2.4. Proposed Approach

We select a typical scene and make an abstract model. According to the real scene modeling, a simulation platform is built. The specific parameter settings have been discussed in the previous section. In order to achieve the training effect of UAV agent, we use a DQN algorithm for the experiment. At the same time, the action selection strategy and reward function design are modified. The main work is shown in Figure 3.

3. Improvement and Experiment

3.1. Improvement of Algorithm

The simple reinforcement learning algorithm has high computational cost, which often requires millions of training data or more, and the training cost is relatively large. At the same time, it still faces the problems of poor training effect and lengthy training time. Above all, the main reason is the sparse reward problem. As for the agent, the reward can only be obtained in the final state of the game, while the intermediate state cannot be learned [12].

In the discrete space, the current stage focuses on the algorithm level, while the reward function is designed from the perspective of general theory. The reward function of DQN also mainly depends on the environment, which has a large room for improvement. The Q function is represented by a convolutional layer [13].

The training efficiency, convergence speed and stability of RL can be improved by reasonable reward function design. RL requires a lot of training resources for training, so if the experiment is repeated on the reward function, it will bring additional reward, and reward shaping refers to the training process of acceleration reinforcement learning by adding a potential energy function according to the corresponding mathematical form on the basis of the original reward function. However, in many practical cases, the agent will receive little or no reward, which will lead to learning failure and ineffective exploration [14].

The shaping function refers to the sparse reward environment and the reward environment with action evaluation [15]. There is a reward-shaping module and a control module that outputs a series of actions to maximize the reward signal [16]. Under the condition of constant external rewards, the designs of internal rewards are equivalent to adding prior human knowledge. But at the same time, strategy invariance is very important for reward shaping.

The following is the frame of reward shaping, as shown in Figure 4.

The following formula can prove that potential-based reward function shaping does not change the original strategy:

R^{’} (s, a, s^{'}) = R (s, a, s^{'}) + F (s^{'}) .

(5)

F (s^{'}) = φ (s^{'}) - γ φ (s) .

(6)

The shaping function based on potential energy is that the optimal strategy of reinforcement learning remains unchanged after the dynamic potential energy is introduced. And based on the DQN algorithm, this paper introduces a reward-shaping mechanism. At the same time, we classify the actions of the agent and set the progress toward the goal as a positive reward, and the rest is classified as a negative or no reward. When the agent takes the next action according to the action value function, it will preferentially select the action that develops toward the goal progress.

The reward function based on dynamic potential energy can be expressed as

F (s, t, s^{'}, t^{'}) = y φ (s^{'}, t^{'}) - φ (s, t) .

(7)

And the formula proves that the Q value of the shaping function is independent of the action selection, and the optimal strategy remains unchanged:

Q_{i, φ}^{*} (s, a) = \sum_{s^{'}} P r (s^{'} | s, a) U_{i, φ} (s^{'}) = \sum_{s^{'}} P r (s^{'} | s, a) (U_{i} (s^{'}) - φ (s, t)) = \sum_{s^{'}} P r (s^{'} | s, a) U_{i} (s^{'}) - \sum_{s^{'}} P r (s^{'} | s, a) φ (s, t) = Q_{i}^{*} (s, a) - φ (s, t) .

(8)

where F represents the additional reward function, which is the state value after the action is taken at the next moment, and the formula indicates that both F and will change dynamically with time.

This paper adds the reward shaping on DQN, and the following is the pseudocode, as Algorithm 1 shows.

Algorithm 1 DQN with reward shaping.

1. Initialize replay memory to capacity

2. Initialize action-value function Q with random weights

3. Initialize target action-value function with weights

4. For episode = 1, X do

5. Update the position of UAV

6. For step = 1, M do

7. Select an action with softmax policy

8. Set the reward based on the step

9. Implement and move on to the next state

10. Store transition

11. Gradient descent policy to loss function to improve

12. Every K step reset

13. End for

14. End for

3.2. Experiment

Our experiments were performed on a computer with an Intel Core i7-8700K CPU (Intel, Santa Clara, CA, USA) and an NVIDIA GeForce RTX 3060 (NVIDIA, Santa Clara, CA, USA). The experiments were run on the Windows system, while the machine learning-related components made use of PyTorch 1.8.1. In this paper, the algorithm was designed and simulated based on the environment of Atari from Open AI. We used Open AI’s Atari platform to build a new experimental environment. Only one UAV in the experiment was able to move left and right at different speeds and was able to explore different areas.

In the experiment, baseline3 of PyTorch was used as the experimental baseline, and the DQN algorithm was compared with the DQN algorithm, adding reward shaping. The hyperparameter settings are shown in Table 1:

In this experiment, 2,000,000 rounds of training are set for the UAV agent. The comparison results between the improved algorithm with the reward shaping and the baseline algorithm are shown in Figure 5 and Figure 6.

As shown in Figure 5 and Figure 6, after 2 million training sessions, the final DQN with a reward function can obtain a higher reward than before. The improved algorithm obtained a reward of 7 on the 500,000th training, while the original DQN algorithm still obtained a reward of about 5 after training on 2 million.

This means that the improved algorithm is able to train the UAV agent faster, make the agent adapt to the new environment, and receive the reward quicker. In summary, compared with the baseline algorithm and the traditional DQN algorithm, the final DQN with a reward function has better performance and convergence, and it can enable the UAV agent to obtain better scores in shorter training rounds.

4. Conclusions and Future Work

Aiming at the problem of using UAVs as communication nodes in emergency situations, this paper constructs a virtual environment through abstract modeling. At the same time, based on DQN and combined with the dynamic reward function-shaping method, the reward function is set to improve. At the same time, the hyperparameters are modified repeatedly according to the experimental results to improve the performance of the algorithm. Experiments show that this algorithm can optimize the performance of the original algorithm, shorten the training time, improve the training efficiency, and provide a reference for the autonomous location selection of UAVs under emergency conditions.

A high-performance intelligent decision-making approach for wargames that still has some limitations was presented in this paper. This paper does not evaluate the merits and demerits of the reward function, and it only verifies the positive effect of the reward function on improving the training efficiency without verifying whether it will affect the final training effect of the agent. So, the next step is to optimize the shaping method of the dynamic reward function and make a reasonable design for specific problems.

Author Contributions

Methodology, C.Y.; Data curation, C.Y.; Writing—original draft, C.Y.; Writing—review & editing, S.G.; Supervision, J.B.; Funding acquisition, W.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare that they have no conflict of interest.

References

Zhang, L.; Fan, Q.; Ansari, N. 3-D Drone-Base-Station Placement with In-Band Full-Duplex Communications. IEEE Commun. Lett. 2018, 22, 1902–1905. [Google Scholar] [CrossRef]
Wang, T. Research on UAV Deployment and Path Planning for Emergency Communication; Beijing University of Posts and Telecommunications: Beijing, China, 2022. [Google Scholar]
Yan, L.; Guo, W.; Xu, D.; Yang, H. A Method for Station Site Planning of Maneuverable Communication Systems Based on NSGA Algorithm. Appl. Res. Comput. 2022, 39, 226–230, 235. [Google Scholar]
Yin, C.; Yang, R.; Zhu, W.; Zou, X. Emergency communication network planning method based on deep reinforcement learning. Syst. Eng. Electron. 2020, 42, 2091–2097. [Google Scholar]
Chen, H.; Zhu, W.; Yu, S. Emergency communication network planning method based on deep reinforcement learning. Command. Control. Simul. 2023, 45, 150–156. [Google Scholar]
Lyu, J.; Zeng, Y.; Zhang, R.; Lim, T.J. Placement Optimization of UAV-Mounted Mobile Base Stations. IEEE Commun. Lett. 2017, 21, 604–607. [Google Scholar] [CrossRef]
Ng, A.Y.; Harada, D.; Russell, S. Policy invariance under reward transformations: Theory and application to reward shaping. In Proceedings of the International Conference on Machine Learning, Bled, Slovenia, 27–30 June 1999. [Google Scholar]
Dong, Y. Research and Applicati-on of Reinforcement Learning Based on Reward Shaping; Huazhong University of Science and Technology: Wuhan, China, 2022. [Google Scholar]
Yu, F.; Hao, J.; Zhang, Z. Action exploration strategy in reinforcement learning based on action probability. J. Comput. Appl. Softw. 2023, 40, 184–189, 226. [Google Scholar]
Shi, H. Research on DQN Algorithm in Complex Environment; Nanjing University of Information Science and Technology: Nanjing, China, 2023. [Google Scholar]
Wu, J. Research on Overestimation of Value Function for DQN; Sooc-how University: Taipei, China, 2020. [Google Scholar]
Yang, W.; Bai, C.; Cai, C.; Zhao, Y.; Liu, P. Sparse rewardproblem in deep reinforcement learning. Comput. Sci. 2020, 47, 182–191. [Google Scholar]
Liu, H. Research on UAV Communication Trajectory Optimization Based on Deep Reinforcement Learning; Nanchang University: Nanchang, China, 2023. [Google Scholar]
Li, Q.; Geng, X. Robot path Planning based on improved DQN Algorithm. Comput. Eng. 2023, 12, 111–120. [Google Scholar]
Yang, D. Research on Reward Strategy Techniques of Deep Reinforcement Learning for Complex Confrontation Scenarios; National University of Defense Technology: Changsha, China, 2020. [Google Scholar]
Niu, S. Research on Student Motivation Based on Reinforcement Learning; University of Electronic Science and Technology of China: Chengdu, China, 2022. [Google Scholar]

Figure 1. DQN.

Figure 2. Typical communication scenario.

Figure 3. Main work chart.

Figure 4. Reward shaping.

Figure 5. Training reward.

Figure 6. Training reward over episodes.

Table 1. Number of hyperparameters.

Hyperparameter	Number
Seed	25
Entry 2	data
Learning_rate	0.0001
Grad_clipping_value	5
Replay_buffer_size	1,000,000
Batch_size	32
Gamma	0.98
Discount factor	0.99

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ye, C.; Zhu, W.; Guo, S.; Bai, J. DQN-Based Shaped Reward Function Mold for UAV Emergency Communication. Appl. Sci. 2024, 14, 10496. https://doi.org/10.3390/app142210496

AMA Style

Ye C, Zhu W, Guo S, Bai J. DQN-Based Shaped Reward Function Mold for UAV Emergency Communication. Applied Sciences. 2024; 14(22):10496. https://doi.org/10.3390/app142210496

Chicago/Turabian Style

Ye, Chenhao, Wei Zhu, Shiluo Guo, and Jinyin Bai. 2024. "DQN-Based Shaped Reward Function Mold for UAV Emergency Communication" Applied Sciences 14, no. 22: 10496. https://doi.org/10.3390/app142210496

APA Style

Ye, C., Zhu, W., Guo, S., & Bai, J. (2024). DQN-Based Shaped Reward Function Mold for UAV Emergency Communication. Applied Sciences, 14(22), 10496. https://doi.org/10.3390/app142210496

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DQN-Based Shaped Reward Function Mold for UAV Emergency Communication

Abstract

1. Introduction

2. Materials and Methods

2.1. Reinforcement Learning

2.2. Deep Q-Learning Network

2.3. System Model

2.4. Proposed Approach

3. Improvement and Experiment

3.1. Improvement of Algorithm

3.2. Experiment

4. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI