A Deep Q-Learning Based UAV Detouring Algorithm in a Constrained Wireless Sensor Network Environment

Rahman, Shakila; Akter, Shathee; Yoon, Seokhoon

doi:10.3390/electronics14010001

Open AccessArticle

A Deep Q-Learning Based UAV Detouring Algorithm in a Constrained Wireless Sensor Network Environment

by

Shakila Rahman

¹

,

Shathee Akter

²

and

Seokhoon Yoon

^2,*

¹

Department of Computer Science, American International University-Bangladesh (AIUB), Dhaka 1229, Bangladesh

²

Department of Electrical, Electronic, and Computer Engineering, University of Ulsan, Ulsan 44610, Republic of Korea

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(1), 1; https://doi.org/10.3390/electronics14010001

Submission received: 15 November 2024 / Revised: 16 December 2024 / Accepted: 19 December 2024 / Published: 24 December 2024

(This article belongs to the Special Issue Advanced Unmanned Aerial Vehicle (UAV) Wireless Communication and Networking Technologies)

Download

Browse Figures

Versions Notes

Abstract

:

Unmanned aerial vehicles (UAVs) play a crucial role in various applications, including environmental monitoring, disaster management, and surveillance, where timely data collection is vital. However, their effectiveness is often hindered by the limitations of wireless sensor networks (WSNs), which can restrict communications due to bandwidth constraints and limited energy resources. Thus, the operational context of the UAV is intertwined with the constraints on WSNs, influencing how they are deployed and the strategies used to optimize their performance in these environments. Considering the issues, this paper addresses the challenge of efficient UAV navigation in constrained environments while reliably collecting data from WSN nodes, recharging the sensor nodes’ power supplies, and ensuring the UAV detours around obstacles in the flight path. First, an integer linear programming (ILP) optimization problem named deadline and obstacle-constrained energy minimization (DOCEM) is defined and formulated to minimize the total energy consumption of the UAV. Then, a deep reinforcement learning-based algorithm, named the DQN-based UAV detouring algorithm, is proposed to enable the UAV to make intelligent detour decisions in the constrained environment. The UAV must finish its tour (data collection and recharging sensors) without exceeding its battery capacity, ensuring each sensor has the minimum residual energy and consuming energy for transmitting and generating data, after being recharged by the UAV at the end of the tour. Finally, simulation results demonstrate the effectiveness of the proposed DQN-based UAV detouring algorithm in data collection and recharging the sensors while minimizing the total energy consumption of the UAV. Compared to other baseline algorithm variants, the proposed algorithm outperforms all of them.

Keywords:

deep Q-learning (DQN); wireless sensor networks (WSNs); deadline and obstacle-constrained energy minimization (DOCEM); unmanned aerial vehicles (UAVs); DQN-based UAV detouring algorithm

1. Introduction

Unmanned aerial vehicles (UAVs) have emerged as versatile tools for applications ranging from surveillance and reconnaissance to environmental monitoring and disaster management, particularly due to their ability to rapidly deploy and cover large areas. This capability makes them invaluable in scenarios where conventional ground-based approaches are impractical or insufficient. Simultaneously, wireless sensor networks (WSNs) have gained prominence for their role in distributed sensing and monitoring [1], enabling data collection from inaccessible or remote locations [2,3,4] while facilitating the wireless recharging of sensors power supplies [5,6,7]. However, deploying UAVs alongside WSNs introduces unique challenges, especially in constrained environments where energy and communication resources are limited. In such scenarios, optimizing UAV trajectories to ensure effective data collection from WSN nodes and recharging them while mitigating resource constraints becomes critical. Traditional approaches often rely on predefined routes or heuristics-based methods, which may fail to adapt to dynamic environmental conditions or utilize available resources efficiently. In the WSN area, the sensors or any other functioning devices have limited resources, such as battery capacity. A lack of proper infrastructure makes the energy hole a problem in any scenario. For example, in most of the traditional path planning studies [5,6,7] no obstacles have been considered, but obstacles existence is a must in any path planning problem.

To address these challenges, this paper proposes a novel approach based on Deep Q-learning (a reinforcement learning technique) for optimizing UAV trajectories in constrained WSN environments while collecting data from sensors and recharging them. By leveraging the capabilities of Deep Q-learning, the proposed algorithm enables the UAV to make intelligent detour decisions in real-time. It takes into account the sensor deadlines, minimum residual energy constraints, and the UAV’s battery limitations, all while ensuring effective obstacle avoidance. The goal is to minimize the total energy consumption of the UAV while prolonging the operational lifetime of the WSN.

Since UAVs must visit the sensors in order to gather data and recharge power supplies, deploying them presents difficulties. The variety of trajectories (the order the UAV visits the sensors) results in various overall traveling distances and data transmission times. In a lot of research, trajectory planning has been investigated for UAV-assisted data gathering and sensor charging in WSNs [2,3,4,5,6,7,8,9,10,11,12,13,14]. These studies, however, did not take into account the possibility of obstacles along the flight paths, which can include buildings, mountains, large trees, and regions designated as no-fly zones, that must be avoided as the UAV moves from one sensor to another. UAVs must return to the sink node to recharge their batteries after the predetermined period of data collection and sensor recharging. While some studies took obstacles into account [15,16,17,18,19,20], they did not ensure the UAV follows a Hamiltonian path, which is impractical. Also, most of these studies consider UAV data collection, and a few have considered UAV WSN charging. However, our proposed DOCEM problem focused on data collection and sensor charging.

In the proposed system, the UAV begins and ends its journey at the sink node, collecting data wirelessly from sensor nodes along the way and recharging them after data collection. It creates a Hamiltonian tour for the UAV, making the problem NP-hard, which we formulated using integer linear programming (ILP) a challenging task. This work extends our previous research [4], which focused on a system model addressing UAV obstacle avoidance along with UAV flight time constraints. In that study, we applied a Genetic Algorithm (GA)-based approach combined with a Vector Rotation Angle-based Obstacle Avoidance (VRAOA) algorithm to determine a near-optimal flight path for a group of UAVs collecting data from a group of sensors in a wireless sensor network. In this study, we develop the system model further to incorporate both data collection from sensors and recharging them while considering UAV energy constraints, the sensors’ residual energy and deadline constraints, and obstacle avoidance by the UAV. Given the increased constraints on the proposed system, we initially focused on a single UAV, and developed a Deep Q-Network (DQN)-based UAV detour algorithm to find a near-optimal solution, comparing the performance with algorithm variants across different simulation scenarios and obtaining superior results.

Figure 1 presents an exemplary problem scenario for the proposed system model’s WSN, which includes one sink node, four sensor nodes, a detour in the flight path around obstacles (buildings, trees, vehicles, etc.) that are converted to a convex polygon using the VRAOA algorithm [4].

The main contributions of this paper are as follows.

This paper introduces deadline and obstacle-constrained energy minimization (DOCEM) to address problems in UAV-aided WSN data collection and sensor power supply recharging, which considers the UAV’s travel deadline and energy constraints along with the sensors’ individual deadlines and residual energy constraints.
First, the DOCEM problem is defined and formulated as an ILP problem, where the objective is to minimize the total energy consumption of the UAV.
Second, A deep Q-learning UAV detouring algorithm is designed to determine the UAV’s near-optimal flight path considering the proposed scenario. A Markov decision process (MDP) is designed for the proposed algorithm where the detour path between any two sensors is obtained by combining the VRAOA algorithm and Dijkstra’s algorithm [4] when there is an obstacle.
Finally, simulations are performed under different scenarios where the DQN-based UAV detouring algorithm minimizes the total energy consumption of the UAV and minimizes its average arrival time at each sensor, compared to other baseline algorithm variants.

The article is structured with Section 2 providing background information on prior studies through a review of the literature. The proposed work’s complete system model and problem formulation are shown in Section 3, and the algorithms are presented in Section 4. Experimental results are thoroughly analyzed in Section 5 to provide insight into the efficiency of the proposed method. Section 6 brings the paper to a conclusion.

2. Related Works

Over the past two decades, advancements in robotics, computer vision, and AI have revolutionized UAV technology, transitioning it from basic military surveillance tools to fully autonomous systems. Early UAVs relied on simple algorithms and manual controls, but the integration of GPS, IMUs, and cameras enabled semi-autonomous capabilities like waypoint navigation and obstacle avoidance. Modern UAVs leverage deep learning, computer vision, and SLAM algorithms for tasks like mapping, target tracking, and obstacle avoidance. Despite these advancements, challenges like battery life, processing power, and secure communication continue to limit their full potential. These issues highlight the need for further research to enhance scalability and reliability. In this section, some existing literature has been added and compared with the proposed DOCEM problem.

DRL-based solution considering the traveling salesman problem: Recently, lots of work related to the TSP has been solved using DRL-based algorithms. For example, the paper in [21] presents enhanced heuristic solutions for routing problems through machine learning, focusing on the integration of neural networks with traditional improvement heuristics. This allows the system to iteratively refine solutions based on learned patterns. This allows the system to iteratively refine solutions based on learned patterns. The study in [22] addressed the traveling salesman problem by incorporating drone technology with Deep Reinforcement Learning (DRL), accounting for unique constraints such as a limited flight range and payload capacity. In ref. [23], NP-hard routing challenges like the TSP and the Vehicle Routing Problem were similarly tackled by learning collaborative policies through reinforcement learning (RL). This approach leverages multiple agents to explore and optimize routes efficiently. This approach leverages multiple agents to explore and optimize routes efficiently. In a different vein, the authors of [24] employed policy gradient methods to enhance TSP solutions, developing a neural network-based model trained to generate near-optimal tours using RL. The authors in [25] proposed a decomposition method for the TSP, breaking it into smaller, manageable sub-problems that are solved individually and then integrated into a complete tour. This hierarchical approach combines classical optimization techniques with modern machine learning, effectively addressing large datasets that traditional solvers struggle with. Another innovative study in [26] introduced a DRL-inspired architecture for the TSP, combining neural networks and RL to develop an agent that can generate high-quality solutions. The agent is trained on various TSP instances, allowing it to adapt to different problem configurations, thereby overcoming the limitations of traditional heuristics and exact algorithms. In ref. [27], a focus on learning the three-opt heuristic—a well-known local search technique for the TSP—was presented, showing how DRL can enhance iterative improvements to solution quality. The work in [28] similarly explored two-opt heuristics through DRL, illustrating the potential for neural network agents to iteratively refine TSP solutions. The pickup and drop-off problem in logistics was addressed in [29] by combining pointer networks with DRL, allowing effective learning of optimal routing strategies. Meanwhile, the study in [30] tackled the TSP with time windows and rejection using DRL, finding efficient routes that respect time constraints and handle rejected visits. The optimization of UAV trajectory planning in WSNs to minimize energy consumption was explored in [31], demonstrating how DRL can adapt UAV paths for effective data collection. In ref. [32], a double-level DRL framework was introduced for managing task scheduling among multiple UAVs, enhancing scalability and efficiency through a hierarchical policy structure. The autonomous navigation of UAVs in obstacle-rich environments was addressed in [33], leveraging a DRL-based framework for dynamic terrain navigation. Furthermore, the study in [34] optimizes routes for a truck-and-drone delivery system using DRL, aiming to improve delivery efficiency while adapting to various constraints. The above studies focused on solving the TSP problem using DRL-based algorithms, but none of them considered any other system constraints such as UAV battery power or deadline constraints and sensor node residual energy or deadline constraints along with obstacle avoidance or detouring. In contrast, our approach considers obstacles in the environment, ensuring that UAV data collection and sensor recharging tours are completed within designated flight times and energy limits, all while minimizing total energy consumption. We also account for sensor data collection and charging deadlines, along with the sensors’ residual energy constraints. Thus, the proposed scheme cannot use these other approaches directly, and their considered scenarios are different from ours.

A DRL-based solution without considering the TSP: Recently a lot of problems have been solved using DRL-based algorithms, but they did not consider the traveling salesman problem. For example, The work in [35] employed DRL to develop optimal transmission policies for UAV-aided WSNs, optimizing UAV flight paths and transmission schedules to enhance network performance and reliability. In ref. [36], a similar focus on multi-UAV-assisted sensor networks was presented where DRL is used to optimize communication scheduling and velocity control, addressing the coordination of UAV movements and management of communication tasks to improve overall network performance. Another study offers a comprehensive approach using DRL to optimize joint trajectory planning, transmission scheduling, and access control in UAV-assisted WSNs, dynamically adjusting UAV trajectories and transmission schedules to tackle challenges related to energy efficiency, communication reliability, and latency [37]. The application of DRL to enhance data collection in UAV-assisted Internet of Things (IoT) networks was explored in [38], where the trajectory of UAVs is optimized to ensure timely and efficient data gathering from IoT devices. Additionally, the authors in [39] presented a novel method for real-time trajectory planning for UAVs, optimizing paths based on factors such as energy efficiency, collision avoidance, and communications connectivity. In ref. [40], a unique approach leverages DRL to optimize the data collection process of UAVs that harvest energy from the environment. This method dynamically adjusts flight paths and data collection strategies to maximize efficiency and data throughput while ensuring sustainable energy usage. Similarly, the work in [41] explored multi-agent DRL techniques in wireless-powered UAV networks, optimizing UAV trajectories and energy consumption while ensuring efficient communication. The study in [42] incorporates long short-term memory networks within DRL frameworks to tackle continuous flight control and resource allocation challenges in UAV-assisted sensor networks. By capturing sequential dependencies in flight control actions and resource allocation decisions, this integration offers enhanced adaptability and efficiency in dynamic environments. Another approach in [43] utilizes DRL for timely data collection in UAV-based IoT networks, training UAVs to autonomously optimize their trajectories for efficient data gathering while considering energy consumption and communications link quality. The paper in [44] explored the use of DQNs to enhance aerial data collection efficiency in multi-UAV-assisted WSNs, addressing challenges such as energy consumption, communication reliability, and data latency. Finally, the authors of [45] investigated the application of DRL in optimizing UAV path planning for energy-efficient multi-tier cooperative computing within WSNs, dynamically adjusting UAV flight paths to minimize energy consumption and enhance overall network performance. Although the above studies considered a DRL-based solution, they did not ensure the UAV’s path is Hamiltonian.

In contrast, in our proposal, the UAV’s trajectory is a Hamiltonian tour where data collection and sensor charging are finished within the given flight time and energy allotments in an environment with obstacles. Furthermore, sensor deadline and residual energy constraints are considered to finish data collection and recharge each sensor. In this case, the shortest path between each pair of sensors—avoiding obstacles—is the sub-tour. Furthermore, the order the UAV visits the sensors (i.e., its trajectory or flight path) is Hamiltonian; to the best of our knowledge, no prior research has been uncovered where the UAV collects data and recharges the sensor nodes while avoiding multiple obstacles by solving the problem with a DQN-based algorithm, which is very hard to do. On the other hand, the Hamiltonian tour is a cycle that visits every vertex in a graph exactly once and ends at the same vertex it started at, which has a similar interest to our proposed DOCEM problem. Table 1 represents the comparison between the proposed DOCEM problem with other existing literature, which visualizes our novelty in the scientific field.

3. System Model and Problem Formulation

This study considered a WSN with a set of sensors

V (V = {0, 1, \dots, N})

, where 0 represents the sink node. Sensor nodes are deployed in a square area at random; the positions are known a priori. Each sensor generates data and stores the data in its buffer until the UAV collects it. To keep the sensors from completely draining their energy and to keep the WSN alive, the UAV periodically recharges them to a certain energy level.

In addition, we assume that

O (O = {1, 2, \dots, s})

is the set of obstacles in the environment and that the positions of the obstacles are known. If an obstacle is in the path of the UAV, i.e., between any two sensors the UAV visits serially for data collection and recharging, it detours the obstacle to reach the next sensor. Thus, the problem can be defined as follows:

Deadline and obstacle-constrained energy minimization (DOCEM): A set of sensors

V

, a UAV, and a set of obstacles

O

, find the UAV trajectories that minimize the total energy consumption of the UAV, where the UAV needs to finish the tour of data collection and recharging the sensors before a given deadline D and the UAV total energy consumption must not exceed its battery capacity

c_{u a v}

while avoiding all the obstacles in its path. Also, the UAV needs to arrive at each sensor j within a given deadline

d_{j}

and each sensor should have at least the minimum residual energy

e_{m i n}

, all the time.

Define a binary decision variable,

x_{i j}

where the value is 1 when the UAV visits sensor j right after visiting sensor i; otherwise, it is zero

x_{i j} = \{\begin{matrix} 1, & if the u a v visits sensor j right after visiting sensor i; i, j \in V, \\ 0, & otherwise \end{matrix}

(1)

Let us assume

α

,

β

, and

γ

, respectively, represent the power consumed by the UAV while traveling, the energy transfer rate from the UAV to sensor i while collecting data, and the energy transfer rate from the UAV to sensor i while recharging it. The UAV’s communication time is calculated as follows:

C_{i} = c_{i} + d_{i}

(2)

where

d_{i}

is the time spent by the UAV while charging sensor i, and

c_{i}

is the time spent while collecting data from sensor i. Considering the proposed system model, the UAV’s total energy consumption

E_{u a v}

is calculated as follows,

E_{u a v} = E_{t r a v e l} + E_{c o l l e c t i n g} + E_{c h a r g i n g}

(3)

E_{t r a v e l}

,

E_{c o l l e c t i n g}

, and

E_{c h a r g i n g}

, respectively, are the energy consumed by the UAV while traveling, while collecting data from the sensors, and while charging the sensors. Equation (3) can be rewritten from the above equations and notations as

E_{u a v} = \sum_{i \in V} \sum_{j \in V, i \neq j} (α (\frac{D_{i j}}{v_{u a v}}) + β c_{i} + γ d_{i}) x_{i j}

(4)

Here,

D_{i j}

is the UAV travel distance from sensor i to j and

v_{u a v}

represent the velocity of the UAV. The

D_{i j}

is calculated from the following equations,

D_{i j} = (a_{i j} R_{(i, j)} + b_{i j} (1 - R_{(i, j)}) x_{i j}

(5)

where

a_{i j}

and

b_{i j}

are the distances traveled from sensor i to sensor j when there is an obstacle and when there is no obstacle between them, respectively, with

a_{i j}

calculated using the Dijkstra algorithm [4], and

b_{i j}

is the Euclidean distance.

To incorporate the obstacles into the problem formulation, the sensor–obstacle relationship matrix

R

is defined to describe the relationship between sensors and obstacles. That is,

R_{(i, j)} = \{\begin{matrix} 1, & if there is an obstacle between the sensors i and j . \\ 0, & otherwise \end{matrix}

(6)

R

is used in Equation (5), and

α

[31] in Equation (4) is defined as,

α = (\frac{P_{m a x} - P_{i d l e}}{v_{m a x}}) v_{u a v} + P_{i d l e}

(7)

Here,

v_{m a x}

is the maximum velocity of the UAV, while

P_{m a x}

and

P_{i d l e}

denote hardware power levels when the UAV is traveling at full speed and when the UAV is idle, respectively.

When the UAV arrives at sensor i, it will spend time

c_{i}

collecting data from it. After that, the UAV spends time

d_{i}

charging the sensor’s battery via wireless power transfer [46]. We assume that while charging, the sensor’s battery is fully charged and does not exceed its battery capacity. After spending

C_{i}

time, the UAV leaves sensor i and travels to the next sensor on its flight path. Each sensor should have at least the minimum residual energy,

e_{m i n}

, all the time. Each sensor consumes energy for generating data and transmitting buffered data to the UAV. Let

e_{t}

and

e_{g}

, respectively, represent the energy consumed for transmitting and generating one bit of data. Furthermore,

δ

,

ρ

, and C, respectively, represent the data generation rate of the sensors, energy transfer efficiency between the UAV and sensor while recharging, and the energy transfer rate from the UAV to sensor i. After generating the data, transmitting it to the UAV, and replenishing the energy, the residual energy at sensor i by the end of a tour is calculated as follows [47]:

C ρ T_{i} - δ τ (e_{t} + e_{g}), i = 1, \dots, n

(8)

Here,

τ

is the time needed for the UAV to make one round trip, which is the sum of the traveling times and communication times (data collection and sensor recharging) represented by,

τ = \sum_{i \in V} \sum_{j \in V, i \neq j} (\frac{D_{i j}}{v_{u a v}} + C_{i}) x_{i j}

(9)

Table 2 lists the definitions for the rest of the symbols used in the problem formulation.

Therefore, the problem can be formulated as,

minimize \sum_{i \in V} \sum_{j \in V, i \neq j} (α (\frac{D_{i j}}{v_{u a v}}) + β c_{i} + γ d_{i}) x_{i j}

(10)

Subject to

\sum_{j \in V, j \neq i} x_{i j} = 1, i \in V \ {0}

(11)

\sum_{i \in V, i \neq j} x_{i j} = 1, j \in V \ {0}

(12)

\sum_{j \in V, j \neq 0} x_{o j} = 1

(13)

\sum_{i \in V, i \neq 0} x_{i 0} = 1

(14)

\sum_{i \in V} \sum_{j \in V, i \neq j} (α (\frac{D_{i j}}{v_{u a v}}) + β c_{i} + γ d_{i}) x_{i j} \leq c_{u a v}

(15)

C ρ T_{i} - (δ (\sum_{i \in V} \sum_{j \in V, i \neq j} ((\frac{D_{i j}}{v_{u a v}}) + C_{i}) x_{i j})) (e_{t} + e_{g}) \geq e_{m i n}, i \in V \ {0}

(16)

\sum_{i \in V} \sum_{j \in V, i \neq j} (\frac{D_{i j}}{v_{u a v}} + C_{i}) x_{i j} \leq D

(17)

u_{0} = 1, i \in V \ 0

(18)

2 \leq u_{i} \leq n, \forall i \in V \ 0

(19)

u_{j} \geq u_{i} + 1 - n (1 - x_{i j}), i \geq 0, j \geq 1, i \neq j, x_{i j} \in {0, 1}

(20)

\sum_{i^{'} = 1}^{{(u_{i^{'}} = j)}^{- 1}} [\frac{D_{i^{*} j^{*}}}{v_{u a v}} + C_{i^{*}} |_{i^{*} | u_{i^{*} = i^{'}, j^{*}} | u_{j^{*} = i^{'} + 1}}] \leq d_{j}, \forall j \in V \ 0, i^{*}, j^{*} \in V \ 0

(21)

Formulation (10) represents the objective function of the DOCEM problem, which minimizes the total energy consumption of the UAV. Constraints (11) and (12) ensure that each sensor in

V

(except the sink node) is visited once by the UAV. Equation (13) indicates that the UAV starts its journey from the sink to any other sensor j, and Equation (14) indicates the UAV returns to the sink node from sensor i. Inequality (15) is the UAV’s total energy consumption constraint that ensures it finishes data collection and recharging within its battery capacity,

c_{u a v}

, while (16) ensures that at the end of the current period, every sensor has minimum residual energy

e_{m i n}

after consuming energy for generating and transmitting data and then being recharged by the UAV. Constraint (17) is the delay for the UAV. It ensures that the time for one round trip, which includes travel and communication times, cannot exceed the given deadline, D. Expressions (18), (19) and (20) eliminate the sub-tour in the path of the UAV while visiting sensor nodes and avoiding obstacles. The deadline constraints in (21) ensure the UAV will visit sensor j before deadline

d_{j}

.

Since the traveling salesman problem, which is NP-hard, is time-reducible in a polynomial fashion to DOCEM, DOCEM is an NP-hard problem.

4. Algorithms: The DRL-Based UAV Detouring Framework

This section presents the deep reinforcement learning (DRL)-based UAV detouring algorithm designed to gather sensor data and recharge the sensors while minimizing the UAV’s overall energy consumption, satisfying the given constraints.

The DRL-based UAV detour algorithm’s DQN framework is illustrated in Figure 2. Initially, the sensor locations are fed into the Deep Q-learning algorithm (the DQN), which trains the agent for a predetermined number of episodes while optimizing the cumulative reward. Finally, it produces the optimal solution, i.e., the optimal UAV flight path. In other words, the DQN guides the algorithm towards a better solution. At each time step t, the agent interacts with a sample environment, observes the state,

S_{t}

, and takes an action,

a_{t}

, based on its policy, usually selecting the action that maximizes the Q-value:

a_{t} = \arg \max Q (S_{t}, a; θ)

. The transition tuple

(S_{t}, a_{t}, S_{t + 1}, r_{t})

consisting of the current state

S_{t}

, action

a_{t}

, next state

S_{t + 1}

, and reward

r_{t}

is stored in a memory pool (also called the replay buffer). A random sample of transitions (a minibatch) is drawn from the memory pool, which helps break the correlation between consecutive transitions and stabilizes training. The Q-Network, parameterized by

θ

, takes the current state

S_{t}

as input and outputs Q-values for all possible actions, with the action having the highest Q-value being selected. The Target Network, parameterized by

θ^{-}

, is a copy of the Q Network and is used to compute the target Q-value:

\max Q (S_{t + 1}, a_{t + 1}; θ^{-})

. Parameters

θ^{-}

are periodically updated with parameters

θ

from the Q-Network every t time steps. The Q Network is updated by minimizing the loss function, which measures the difference between the predicted Q-values and the target Q-values. The loss function can be expressed as follows:

L (θ) = E [{(r_{t} + γ \max Q (S_{t + 1}, a_{t + 1}; θ^{-}) - Q (S_{t}, a_{t}; θ))}^{2}]

(22)

where

γ

is the discount factor. The gradient of the loss function with respect to Q-Network parameters

θ

is computed and used to update

θ

through gradient descent. The agent can take actions that maximize cumulative rewards over time by repeating these procedures, which teach it to approximate the ideal Q-values. Here, the use of the target network and experience replay (the memory pool) helps stabilize and improve the training of the Q-Network.

In the remainder of this section, the environment design for the state, action, and reward is presented. Later, a detailed description of the DQN-based UAV detour strategy with Q-network training is presented.

4.1. The MDP Model

Reinforcement learning problems can, generally, be phrased as a Markov decision process in which the action is taken and the current state determines the upcoming state [48]. Consequently, we model the UAV detouring problem as an MDP characterized by the tuple

〈 S, A, R 〉

reflecting the interaction model between the UAV and the environment. At each time step t, the agent observes state

s \in S

, performs action

a \in A

, transitions to the next state

s^{'}

and receives a reward,

r \in R

. Figure 3 shows an MDP diagram of the DQN-based UAV detouring algorithm.

Reinforcement learning involves an agent executing sequences of actions to maximize its cumulative reward over time. In our scenario, the agent (the UAV) finds a tour that consumes less energy while avoiding obstacles and satisfying certain system constraints. The state of the agent includes the entire graph connecting all sensors, the sensors’ coordinates, as well as the history of sensors visited. The environment is fully observed and straightforward, allowing the agent to make deterministic moves along graph edges and to observe the distance traveled.

State and Action Spaces

In the proposed problem, the environment consists of a sink (which serves as both the starting and the ending sensor), the sensors (which the agent, i.e., the UAV, will visit only once), and the obstacles. The current state,

S_{t} = C_{t}

, is a partial solution constructed at time step t. Here,

C_{t}

includes all sensors visited, representing a sequence of nodes (or a sequence of previously selected actions) for time step t.

C_{0}

denotes the sink node, which is fixed at the start of the tour. The state space is represented as a sequence of states:

S = S_{(0,)}, S_{(1,)}, S_{(2,)}, S_{(3,)}, \dots, S_{t}

(23)

where each

S_{(t,)}

represents the state of the system at different time steps t.

Here, the environment is defined as the components involved in the problem, including the start and end points (the sink), the sensors the UAV will visit, and the obstacles. The agent is the entity performing actions within the environment (in this case, a UAV). The current state of the system at a given time step t is represented as a partial solution (

C_{t}

) that includes all the visited sensors up to that point. The initial state,

C_{0}

, is fixed at the sink node.

State Space (S)

It lists all possible states in the system, from the initial state

S_{(0)}

, to the state at time t, forming a sequence of states the agent encounters over time. If we consider the exemplary problem in Figure 1,

d_{1}, d_{2}, d_{3}, d_{4}

represent the deadlines for sensors

S_{1}, S_{2}, S_{3}, S_{4}

, respectively. An agent must visit each sensor by its respective deadline while the current state is built. According to the proposed DQN-based UAV detouring algorithm, an episode corresponds to one instance of the UAV sensor visiting tour (similar to a TSP tour), which includes all sensors without repetition. But obstacle avoidance and system constraints are considered, which makes the DOCEM problem different from the TSP problem. The state sequences, actions, and rewards at each step in an episode are as follows for the DOCEM problem, considering the scenario in Figure 1:

Step 1: $S_{0}$ ; action $a_{0}$ , is taken and estimated reward $r_{1}$ is obtained.
Step 2: $S_{0}, S_{1}$ ; action $a_{1}$ , is taken and estimated reward $r_{2}$ is obtained.
Step 3: $S_{0}, S_{1}, S_{2}$ ; action $a_{2}$ , is taken and estimated reward $r_{3}$ is obtained.
Step 4: $S_{0}, S_{1}, S_{2}, S_{3}$ ; action $a_{3}$ , is taken and estimated reward $r_{4}$ is obtained.
Step 5: $S_{0}, S_{1}, S_{2}, S_{3}, S_{4}$ ; cumulative reward R, is obtained after completing one iteration.

Action (A)

Action

a_{t} = v_{j}

indicates that

v_{j}

is the next visiting node at time step t. Here,

v_{j}

is an unvisited node that the agent is allowed to visit;

v_{j} \in V

, where

V

is the set of sensors. The action space is

A = {a_{1}, a_{2}, \dots, a_{t}}

(24)

State Transition

The next state

S_{(t + 1)} = C_{(t + 1)} = (C_{t}; v_{j})

, is obtained by transitioning from state

S_{t}

after choosing a sensor node for time step t. The symbol ‘;’ denotes concatenation, meaning the new solution will include previously visited sensors along with the newly added sensor,

v_{j}

, at step t. In Figure 4, the state transitions of the DQN-based UAV detouring algorithm are presented.

4.2. Immediate Reward

Immediate reward

r_{t}

is defined as follows:

r_{t} = \{\begin{matrix} \frac{1}{\sum_{t = 0}^{t + 1} (T (S_{t}, S_{(t + 1)}) + C (S_{(t + 1)})) + d (S_{(t + 1)})} & if all constraints are feasible \\ \frac{1}{\sum_{t = 0}^{t + 1} (T (S_{t}, S_{(t + 1)}) + C (S_{(t + 1)})) + d (S_{(t + 1)}) + P} & if any constraint is violated \end{matrix}

(25)

where

$T (S_{t}, S_{(t + 1)})$ denotes the UAV’s travel time from state $S_{t}$ to $S_{(t + 1)}$ .
$C (S_{(t + 1)})$ represents the UAV’s communication time (data collection and sensor recharging) at state $S_{(t + 1)}$ .
$d (S_{(t + 1)})$ is the deadline associated with state $S_{(t + 1)}$ .
P is a penalty for any constraint violations.

This reward design encourages the agent to minimize travel and communication times and to meet deadlines while avoiding penalties for constraint violations. This means DRL is set to obtain the maximum reward (the minimal reciprocal of the sum of UAV travel time, UAV communication time, and the given deadline for the sensor). The immediate reward is calculated differently based on whether all constraints are met or if any are violated. When any constraint is violated, a large penalty, P, is added to the denominator, significantly reducing the reward.

Figure 5 represents the reward design for the proposed DQN-based UAV detouring algorithm.

Immediate Reward Calculations

From Figure 5, calculation of the immediate reward,

r_{1}

, for step 1 in a state will be as follows:

r_{1} = \frac{1}{T (S_{0}, S_{1}) + C (S_{1}) + d (S_{1})}

(26)

when all system constraints are satisfied.

Similarly, the reward

r_{2}

for step 2 is expressed as

r_{2} = \frac{1}{\sum (T (S_{0}, S_{2}) + C (S_{2})) + d (S_{2})}

(27)

where the states represent the sequence of sensors, which is built with multiple steps based on immediate rewards.

4.3. DQN-Based UAV Detouring Algorithm

Algorithm 1 identifies a detour path that minimizes total energy consumption for a UAV. It begins by initializing the step counter t, the memory size, samples in the memory pool, and parameters for the Q-value and target networks. For each episode, the algorithm sets the initial state and iterates each step, where the UAV’s current state is updated, and an action is chosen based on a policy such as the

ϵ

-greedy strategy and derived from Q-value samples. Then, the chosen action is executed, and the next state is observed. After that, an immediate reward is computed using the specified equation, and new samples are stored in the memory pool. Subsequently, a random sample is drawn from the memory pool, and the Q-value for the next state is calculated using the target network. Gradient descent is then executed to update network parameters. Periodically, the target network parameters are reset to align with the Q-network parameters. After all episodes are completed, the model is tested, and the best result is saved. Ultimately, the algorithm returns the optimal detour TSP path and computes the total energy consumption of the tour by using the designated equation.

Algorithm 1 DQN-based UAV detouring Algorithm

Input: Set of sensors

V

Output: A detour traveling path of UAV with minimum total energy consumption

E_{u a v}

1:: $t \leftarrow 0$ ▹t: number of steps the agent has taken
2:: Initialize memory size
3:: Initialize samples $(S_{t}, a_{t}, S_{(t + 1)}, r_{t})$ in a memory pool
4:: Initialize the Q-value parameter $θ_{t}$ and target network parameter $θ_{t}^{-}$
5:: for $e \leftarrow 1$ to E do ▹E: Total number of episodes
6:: Current state $C_{0} \leftarrow S_{0}$ ▹ Initial state $S_{0}$
7:: for $t \leftarrow 1$ to $| V |$ do ▹ $| V |$ : Length of one episode
8:: Current state $C_{t} \leftarrow S_{t}$
9:: Choose an action $a_{t}$ in $S_{t}$ using policy derived equation from Q-value samples (e.g., $ϵ$ -greedy)
10:: Perform action $a_{t}$ in state $S_{t}$ and observe next state $S_{(t + 1)}$
11:: Next state $N_{t} \leftarrow S_{(t + 1)}$
12:: Calculate immediate reward $r_{t}$ using Equation (23)
13:: Store new samples $(S_{t}, a_{t}, S_{(t + 1)}, r_{t})$ in the memory pool
14:: Randomly select samples $(S_{k}, a_{k}, S_{(k + 1)}, r_{k})$ from the memory pool
15:: Calculate the Q-value of the next state using the target network
16:: Apply gradient descent with respect to the network parameter values $θ$ to minimize the squared difference between target Q-value and current Q-value using Equation (24)
17:: if periodically then
18:: Reset the target Q-network’s parameter values from the Q-network: $θ^{-} \leftarrow θ$
19:: end if
20:: Set $S_{t} \leftarrow S_{(t + 1)}$
21:: end for
22:: end for
23:: Test the model periodically and save the best result
24:: Return a complete detour TSP tour from the best-found result and calculate the total energy consumption $E_{u a v}$ of the tour using Equation (4)

5. Performance Evaluation

This section evaluates the proposed methods by employing simulation findings. First, we discuss the simulation setup and the other algorithms. The performance of the proposed algorithm is then compared in detail against the baseline algorithms in terms of the total energy consumption of the UAV (i.e., the value of the objective function) and the average arrival time of the UAV at each sensor. We vary the arrival deadlines at the sensors and the number of sensors, obstacles, and iterations for the proposed DQN-based algorithm.

5.1. Experimental Settings

The simulations were run over a 7 km × 7 km area, with the sink node at (0, 0) and sensors and obstacles arranged at random. From an established pool, obstacles are chosen at random and placed to avoid overriding one another. We take into account barriers with sizes ranging from 1 to 2 km² [4]. A UAV gathers and recharges each sensor in a duration of two seconds [4]. The total energy used by the UAV in the DOCEM problem is the sum of the energy used for data downloads, traveling to the sensors, and sensor node recharging. For the sake of simplicity, we assume the UAV, traveling at 55 km

h^{- 1}

(fits well for industrial or commercial Drones based on their operations), completes the data collection and charging tour in less than 25 min. Table 3 displays the parameters of the experiment.

Simulations were performed in a Python environment with an Intel Core i9-10900 CPU clocked at 2.8 GHz and with 64 GB of RAM, manufactured by Micro-Star International Co., Ltd., Taipei city, Taiwan. To get results that are fair, five different seed values were used in the simulations and the results averaged.

The DQN-based UAV detour algorithm simulation parameters are shown in Table 4.

5.2. Description of the Compared Algorithms

The proposed DQN-based UAV detouring algorithm has been compared to some existing baseline algorithms, detailed below.

5.2.1. DQN-Based UAV Detouring Algorithm with a Negative Reward

In this algorithm, a negative reward is developed to find the possible shortest path for a UAV with the minimum total energy consumption between two sensors when there are obstacles between them. By using the DQN-based UAV detouring algorithm, the UAV’s trajectory is determined.

The agent’s states and actions are similar to the proposed DQN-based UAV detouring algorithm, only a new negative reward is proposed because most of the existing DQN-based (TSP-related) work follows the negative reward discussed in the literature review.

Proposed negative reward:

The immediate reward is

r_{t} = - ω E (S_{t}, S_{t + 1}) - (1 - ω) P

. If

E (S_{t}, S_{t + 1}) \geq P

and all constraints are satisfied, then

ω = 1

; otherwise,

ω = 0 .

Besides,

E (S_{t}, S_{t + 1})

is the total UAV energy consumption between states

S_{t}

and

S_{t + 1},

and P is a constraint missing penalty (a big number).

5.2.2. The Greedy Algorithm

The Greedy algorithm is designed according to the mentioned fitness function:

F (i, j) = \{\begin{matrix} \frac{1}{(T_{i j} + C_{i} + d_{j})}, & if all constraints are feasible \\ \frac{1}{(T_{i j} + C_{i} + d_{j})} + P, & if any constraint is violated \end{matrix}

Here,

T_{i j}

is the UAV traveling time from sensor i to sensor j, and

d_{j}

indicates the deadline for sensor j.

5.2.3. The Genetic Algorithm (GA)

The GA [29] uses natural selection as the inspiration for optimization, allowing a population of potential solutions to develop over many generations. In order to enhance the population, the fitness function assesses each candidate’s quality and directs crossover, mutation, and selection. In the GA, while calculating the fitness value for each chromosome, the fitness value

F_{n}

is

(\sum_{i \in V} \sum_{j \in V, i \neq j} \frac{1}{(T_{i j} + C_{i} + d_{j})}) x_{i j}

if all constraints are feasible.

When any constraint is violated, new chromosomes are generated using the crossover and mutation operators with a mutation and crossover probability. Here,

T_{i j}

is the UAV travel time from sensor i to sensor j.

C_{i}

is the communication time (data collection and sensor recharging) at sensor i, and

d_{j}

indicates the deadline for sensor j.

5.2.4. Ant Colony Optimization (ACO) Algorithm

In the nature-inspired method known as ant colony optimization (ACO) [50], artificial ants construct solutions by following paths determined by pheromone trails, maximizing a fitness function that assesses the quality of the solutions. The pheromone update is guided by the fitness function, which reinforces better paths until they converge on optimal or nearly optimal solutions. For the ACO algorithm, while updating the pheromone for the visited sensors, the fitness function from the GA is used:

(\sum_{i \in V} \sum_{j \in V, i \neq j} \frac{1}{(T_{i j} + C_{i} + d_{j})}) x_{i j}

This is performed until a number of iterations is reached, ensuring the constraints remain feasible.

5.3. Performance Comparison Between the Proposed DQN-Based UAV Detouring Algorithm with a Negative Reward

Figure 6, Figure 7 and Figure 8 show the learning curves for total energy consumption (the convergence process) of different DQNs with increasing numbers of episodes for 10 sensors and six obstacles, 30 sensors and 6 obstacles, and 10 sensors and 10 obstacles, respectively. Here, an episode represents one instance of a TSP tour that includes all the sensors. As training progresses, epsilon decreases, leading to increasingly better decision-making by the algorithm variants. From Figure 6, Figure 7 and Figure 8, we can see that the total energy consumption of the UAV obtained by the algorithm variants decreases as the number of episodes increases. This is because the agent spends more time exploring its surroundings at first, and as it gains decision-making skills, it eventually learns to make choices that maximize the total aggregated reward, resulting in a state with a lower total cost. Among them, the proposed DQN-based UAV detouring algorithm obtains the lowest total energy consumption for the UAV in all cases, which can be seen in Figure 6a,b, Figure 7a,b and Figure 8a,b. The training curve eventually flattens out, signifying that the proposed DQN-based UAV detouring algorithm has found a solution or a set of nearly optimal solutions. At this stage, further training does not yield significant improvements, and the model consistently identifies paths that are as short as possible. The slope may fluctuate slightly as the program balances between exploitation and exploration. Overall, the trend indicates a decline in the overall route cost, demonstrating the DQN’s efficacy in solving the TSP. Additionally, the trend shows a decrease in the total energy consumption of the UAV, highlighting the effectiveness of the proposed DQN-based UAV detouring algorithm in addressing the DOCEM problem.

On the other hand, the performance of the proposed DQN-based UAV detouring algorithm was compared with the other DQN-based UAV detouring algorithm with negative reward when increasing the number of episodes, as shown in Figure 9a–c. The results were averaged from five simulations with different numbers of sensors and obstacle positions. In all cases, the proposed algorithm performed better.

5.4. Training Efficiency of the DQN-Based Detouring Algorithm

Figure 10 shows the training curve of the proposed DQN-based UAV detouring algorithm when there are 30 sensors and six obstacles. We can see that the rewards gradually increased from the start as the number of training episodes increased. The overall reward grows noticeably after the 4000th episode from the 7000 in total. This happens because the agent uses probability to randomly explore new positions at first, which might not produce the best results. Initially, there are fewer workable methods for experience replay memory. But as the neural network starts to learn, experience replay memory accumulates more answers that are practical. UAVs can, therefore, carry out tasks repeatedly and learn from errors to maximize rewards, which eventually results in an improved system goal. The curve is continuously rising despite variations and lack of convergence.

5.5. Performance Comparison Between the DQN-Based UAV Detouring Algorithm and Other Baseline Algorithms

In this subsection, Figure 11a–c compares the total energy consumption of a UAV for the proposed DQN-based UAV detouring algorithm—under varying conditions(the number of sensors, the number of obstacles, and arrival deadlines) comparing the proposed algorithm to the three baseline algorithms—Greedy, ACO, and GA. In the top-left Figure 11a shows the energy consumption increases as the number of sensors rises, with the Greedy algorithm consuming significantly more energy than the other approaches. The DQN-based algorithm consistently performed the best, demonstrating superior energy efficiency, where the energy gaps between the sensors are 90.003%, 50.26%, 31.86%, 27.87%, and 17.47%. The top-right Figure 11b shows that as the number of obstacles increased, energy consumption grew for all algorithms. Again, the DQN-based UAV detouring algorithm outperformed the others, while the Greedy algorithm again exhibits the highest energy consumption. In the bottom Figure 11c, energy consumption decreased as the arrival deadlines became more relaxed, with the DQN-based UAV detouring algorithm consistently showing the lowest energy usage and the values are 33,465.21 Joules, 29,671.78 Joules, 26,793.01 Joules, 25,971.68 Joules, 23,081.79 Joules for arrival deadlines at [5–10], [10–15], [15–20], [20–25], and [25–30], respectively. These results emphasize the advantages of the DQN-based UAV detouring algorithm in reducing UAV energy consumption, especially in complex environments with multiple obstacles and stringent deadlines.

Figure 12 illustrates the average arrival time of the UAV at each sensor when the DQN-based UAV detouring algorithm performance is compared with the Greedy, ACO, and GA algorithms. Figure 12a shows that the arrival time increased as the number of sensors rose, and the Greedy algorithm with the highest average arrival time, followed by the ACO and GA algorithms. The DQN-based algorithm performed the best, maintaining the lowest arrival times across different sensor counts and the values are 18.64 s, 35.12 s, 47.79 s, 58.78 s, 89.02 s, and 145.90 s for 5, 10, 15, 20, 25, and 30 sensors, respectively. Figure 12b highlights the effect of increasing the number of obstacles, where all algorithms experienced increased arrival times with more obstacles. Again, the DQN-based algorithm consistently outperformed the others in maintaining earlier arrivals. The bottom Figure 12c shows the impact of changes in arrival deadlines, with all algorithms demonstrating decreasing arrival times as the deadlines became more relaxed. The DQN-based algorithm maintained the lowest average arrival times, emphasizing its efficiency across different scenarios compared to the traditional approaches.

6. Concluding Remarks

In this research, we addressed a dual problem (i.e., DOCEM of data collection and recharging of WSN sensor nodes) when using an unmanned aerial vehicle. We considered the UAV’s flight path as it detoured around obstacles in a constrained environment. Constraints on the UAV were on arrival deadlines and energy consumption; constraints on the sensors were on deadlines and residual energy, which was formulated as an integer linear programming problem. To address this, we employed a deep Q-network (DQN)-based UAV detour algorithm and a reinforcement learning approach to find near-optimal solutions in a reasonable time frame with minimum total energy consumption of the UAV. The results from the proposed DQN-based solution were compared with other state-of-the-art algorithms: the GA, ACO, and a Greedy algorithm. The comparison showed that the DQN-based UAV detouring algorithm consistently outperformed the alternatives in key performance metrics, such as total energy consumption of the UAV, average arrival time when varying the number of sensors and obstacles, along with multiple deadlines for arriving at the sensors.

For future work, we plan to explore several enhancements of the DQN algorithm, including solutions for multiple UAVs operating in the environment, integrating the DQN with other machine learning techniques, and conducting real-world testing to validate and refine the proposed solutions. In the case of multiple UAVs, there can be some challenges, such as the optimization problem being changed, and the objective function may be changed. Also, paths between UAVs may collide; to solve this problem, further exploration may be needed. Furthermore, if we want to solve the problem using a DRL-based solution, then the action space, state space, and variables can be increased, which will make the system more complex to solve.

Author Contributions

Conceptualization, S.R., S.A. and S.Y.; methodology, S.R.; software, S.R.; validation, S.R.; formal analysis, S.R., S.A. and S.Y.; investigation, S.R., S.A. and S.Y.; resources, S.R.; data curation, S.R.; writing—original draft preparation, S.R.; writing—review and editing, S.R., S.A. and S.Y.; visualization, S.R., S.A. and S.Y.; supervision, S.Y.; project administration, S.Y.; funding acquisition, S.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education under Grant 2021R1I1A3051364.

Data Availability Statement

Dataset available on request from the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Gao, H.; Feng, J.; Xiao, Y.; Zhang, B.; Wang, W. A uav-assisted multi-task allocation method for mobile crowd sensing. IEEE Trans. Mob. Comput. 2022, 22, 3790–3804. [Google Scholar] [CrossRef]
Zhan, C.; Zeng, Y.; Zhang, R. Energy-efficient data collection in uav enabled wireless sensor network. IEEE Wirel. Commun. Lett. 2017, 7, 328–331. [Google Scholar] [CrossRef]
Messaoudi, K.; Oubbati, O.S.; Rachedi, A.; Lakas, A.; Bendouma, T.; Chaib, N. A survey of uav-based data collection: Challenges, solutions and future perspectives. J. Netw. Comput. Appl. 2023, 216, 103670. [Google Scholar] [CrossRef]
Rahman, S.; Akter, S.; Yoon, S. Oadc: An obstacle-avoidance data collection scheme using multiple unmanned aerial vehicles. Appl. Sci. 2022, 12, 11509. [Google Scholar] [CrossRef]
Wu, P.; Xiao, F.; Sha, C.; Huang, H.; Sun, L. Trajectory optimization for UAVs’ efficient charging in wireless rechargeable sensor networks. IEEE Trans. Veh. Technol. 2020, 69, 4207–4220. [Google Scholar] [CrossRef]
Baek, J.; Han, S.I.; Han, Y. Optimal uav route in wireless charging sensor networks. IEEE Internet Things J. 2019, 7, 1327–1335. [Google Scholar] [CrossRef]
Rahman, S.; Akter, S.; Yoon, S. Energy-efficient charging of sensors for uav-aided wireless sensor network. Int. J. Internet Broadcast. Commun. 2022, 14, 80–87. [Google Scholar]
Nazib, R.A.; Moh, S. Energy-efficient and fast data collection in uav-aided wireless sensor networks for hilly terrains. IEEE Access 2021, 9, 23168–23190. [Google Scholar] [CrossRef]
Say, S.; Inata, H.; Liu, J.; Shimamoto, S. Priority-based data gathering framework in uav-assisted wireless sensor networks. IEEE Sensors J. 2016, 16, 5785–5794. [Google Scholar] [CrossRef]
Samir, M.; Sharafeddine, S.; Assi, C.M.; Nguyen, T.M.; Ghrayeb, A. Uav trajectory planning for data collection from time-constrained iot devices. IEEE Trans. Wirel. Commun. 2019, 19, 34–46. [Google Scholar] [CrossRef]
Ebrahimi, D.; Sharafeddine, S.; Ho, P.-H.; Assi, C. Uav-aided projection-based compressive data gathering in wireless sensor networks. IEEE Internet Things J. 2018, 6, 1893–1905. [Google Scholar] [CrossRef]
Dong, M.; Ota, K.; Lin, M.; Tang, Z.; Du, S.; Zhu, H. Uav-assisted data gathering in wireless sensor networks. J. Supercomput. 2014, 70, 1142–1155. [Google Scholar] [CrossRef]
Cao, H.; Liu, Y.; Yue, X.; Zhu, W. Cloud-assisted uav data collection for multiple emerging events in distributed wsns. Sensors 2017, 17, 1818. [Google Scholar] [CrossRef]
Alfattani, S.; Jaafar, W.; Yanikomeroglu, H.; Yongacoglu, A. Multi-uav data collection framework for wireless sensor networks. In Proceedings of the 2019 IEEE Global Communications Conference (GLOBECOM), Waikoloa, HI, USA, 9–13 December 2019; pp. 1–6. [Google Scholar]
Binol, H.; Bulut, E.; Akkaya, K.; Guvenc, I. Time optimal multi-uav path planning for gathering its data from roadside units. In Proceedings of the 2018 IEEE 88th Vehicular Technology Conference (VTC-Fall), Chicago, IL, USA, 27–30 August 2018; pp. 1–5. [Google Scholar]
Raj, P.P.; Khedr, A.M.; Aghbari, Z.A. Edgo: Uav-based effective data gathering scheme for wireless sensor networks with obstacles. Wirel. Netw. 2022, 28, 2499–2518. [Google Scholar]
Wang, X.; Gursoy, M.C.; Erpek, T.; Sagduyu, Y.E. Learning-based uav path planning for data collection with integrated collision avoidance. IEEE Internet Things J. 2022, 9, 16663–16676. [Google Scholar] [CrossRef]
Poudel, S.; Moh, S. Hybrid path planning for efficient data collection in uav-aided wsns for emergency applications. Sensors 2021, 21, 2839. [Google Scholar] [CrossRef] [PubMed]
Ghdiri, O.; Jaafar, W.; Alfattani, S.; Abderrazak, J.B.; Yanikomeroglu, H. Offline and online uav-enabled data collection in time-constrained iot networks. IEEE Trans. Green Commun. Netw. 2021, 5, 1918–1933. [Google Scholar] [CrossRef]
Bouhamed, O.; Ghazzai, H.; Besbes, H.; Massoud, Y. A uav-assisted data collection for wireless sensor networks: Autonomous navigation and scheduling. IEEE Access 2020, 8, 110446–110460. [Google Scholar] [CrossRef]
Wu, Y.; Song, W.; Cao, Z.; Zhang, J.; Lim, A. Learning improvement heuristics for solving routing problems. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 5057–5069. [Google Scholar] [CrossRef] [PubMed]
Bogyrbayeva, A.; Yoon, T.; Ko, H.; Lim, S.; Yun, H.; Kwon, C. A deep reinforcement learning approach for solving the traveling salesman problem with drone. Transp. Res. Part C Emerg. Technol. 2023, 148, 103981. [Google Scholar] [CrossRef]
Kim, M.; Park, J. Learning collaborative policies to solve np-hard routing problems. Adv. Neural Inf. Process. Syst. 2021, 34, 10418–10430. [Google Scholar]
Deudon, M.; Cournut, P.; Lacoste, A.; Adulyasak, Y.; Rousseau, L.-M. Learning heuristics for the tsp by policy gradient. In Integration of Constraint Programming, Artificial Intelligence, and Operations Research: 15th International Conference, CPAIOR 2018, Delft, The Netherlands, 26–29 June 2018; Proceedings 15; Springer: Cham, Switzerland, 2018; pp. 170–181. [Google Scholar]
Pan, X.; Jin, Y.; Ding, Y.; Feng, M.; Zhao, L.; Song, L.; Bian, J. H-tsp: Hierarchically solving the large-scale traveling salesman problem. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 9345–9353. [Google Scholar]
Malazgirt, G.A.; Unsal, O.S.; Kestelman, A.C. Tauriel: Targeting traveling salesman problem with a deep reinforcement learning inspired architecture. arXiv 2019, arXiv:1905.05567. [Google Scholar]
Sui, J.; Ding, S.; Liu, R.; Xu, L.; Bu, D. Learning 3-opt heuristics for traveling salesman problem via deep reinforcement learning. In Proceedings of the Asian Conference on Machine Learning, PMLR, Virtual, 17–19 November 2021; pp. 1301–1316. [Google Scholar]
Costa, P.R.d.O.; Rhuggenaath, J.; Zhang, Y.; Akcay, A. Learning 2-opt heuristics for the traveling salesman problem via deep reinforcement learning. In Proceedings of the Asian Conference on Machine Learning, PMLR, Bangkok, Thailand, 18–20 November 2020; pp. 465–480. [Google Scholar]
Alharbi, M.G.; Stohy, A.; Elhenawy, M.; Masoud, M.; Khalifa, H.A.E.-W. Solving pickup and drop-off problem using hybrid pointer networks with deep reinforcement learning. PLoS ONE 2022, 17, e0267199. [Google Scholar] [CrossRef] [PubMed]
Zhang, R.; Prokhorchuk, A.; Dauwels, J. Deep reinforcement learning for traveling salesman problem with time windows and rejections. In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 19–24 July 2020; pp. 1–8. [Google Scholar]
Zhu, B.; Bedeer, E.; Nguyen, H.H.; Barton, R.; Henry, J. Uav trajectory planning in wireless sensor networks for energy consumption minimization by deep reinforcement learning. IEEE Trans. Veh. Technol. 2021, 70, 9540–9554. [Google Scholar] [CrossRef]
Mao, X.; Wu, G.; Fan, M.; Cao, Z.; Pedrycz, W. Dl-drl: A double-level deep reinforcement learning approach for large-scale task scheduling of multi-uav. IEEE Trans. Autom. Sci. Eng. 2024. [Google Scholar] [CrossRef]
Zhang, S.; Li, Y.; Dong, Q. Autonomous navigation of uav in multi-obstacle environments based on a deep reinforcement learning approach. Appl. Soft Comput. 2022, 115, 108194. [Google Scholar] [CrossRef]
Su, X.; Ren, Y.; Cai, Z.; Liang, Y.; Guo, L. A q-learning based routing approach for energy efficient information transmission in wireless sensor network. IEEE Trans. Netw. Serv. Manag. 2022, 20, 1949–1961. [Google Scholar] [CrossRef]
Liu, Y.; Yan, J.; Zhao, X. Deep-reinforcement-learning-based optimal transmission policies for opportunistic uav-aided wireless sensor network. IEEE Internet Things J. 2022, 9, 13823–13836. [Google Scholar] [CrossRef]
Emami, Y.; Wei, B.; Li, K.; Ni, W.; Tovar, E. Joint communication scheduling and velocity control in multi-uav-assisted sensor networks: A deep reinforcement learning approach. IEEE Trans. Veh. Technol. 2021, 70, 10986–10998. [Google Scholar] [CrossRef]
Luo, X.; Chen, C.; Zeng, C.; Li, C.; Xu, J.; Gong, S. Deep reinforcement learning for joint trajectory planning, transmission scheduling, and access control in uav-assisted wireless sensor networks. Sensors 2023, 23, 4691. [Google Scholar] [CrossRef]
Yi, M.; Wang, X.; Liu, J.; Zhang, Y.; Bai, B. Deep reinforcement learning for fresh data collection in uav-assisted iot networks. In Proceedings of the IEEE INFOCOM 2020-IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), Toronto, ON, Canada, 6–9 July 2020; pp. 716–721. [Google Scholar]
Li, K.; Ni, W.; Tovar, E.; Guizani, M. Deep reinforcement learning for real-time trajectory planning in uav networks. In Proceedings of the 2020 International Wireless Communications and Mobile Computing (IWCMC), Limassol, Cyprus, 15–19 June 2020; pp. 958–963. [Google Scholar]
Zhang, N.; Liu, J.; Xie, L.; Tong, P. A deep reinforcement learning approach to energy-harvesting uav-aided data collection. In Proceedings of the 2020 International Conference on Wireless Communications and Signal Processing (WCSP), Nanjing, China, 21–23 October 2020; pp. 93–98. [Google Scholar]
Oubbati, O.S.; Lakas, A.; Guizani, M. Multiagent deep reinforcement learning for wireless-powered uav networks. IEEE Internet Things J. 2022, 9, 16044–16059. [Google Scholar] [CrossRef]
Li, K.; Ni, W.; Dressler, F. Lstm-characterized deep reinforcement learning for continuous flight control and resource allocation in uav-assisted sensor network. IEEE Internet Things J. 2021, 9, 4179–4189. [Google Scholar] [CrossRef]
Hu, Y.; Liu, Y.; Kaushik, A.; Masouros, C.; Thompson, J.S. Timely data collection for uav-based iot networks: A deep reinforcement learning approach. IEEE Sens. J. 2023, 23, 12295–12308. [Google Scholar] [CrossRef]
Emami, Y.; Wei, B.; Li, K.; Ni, W.; Tovar, E. Deep q-networks for aerial data collection in multi-uav-assisted wireless sensor networks. In Proceedings of the 2021 International Wireless Communications and Mobile Computing (IWCMC), Harbin, China, 28 June–2 July 2021; pp. 669–674. [Google Scholar]
Guo, Z.; Chen, H.; Li, S. Deep reinforcement learning-based uav path planning for energy-efficient multitier cooperative computing in wireless sensor networks. J. Sens. 2023, 2023, 2804943. [Google Scholar] [CrossRef]
Kurs, A.; Karalis, A.; Moffatt, R.; Joannopoulos, J.D.; Fisher, P.; Soljacic, M. Wireless power transfer via strongly coupled magnetic resonances. Science 2007, 317, 83–86. [Google Scholar] [CrossRef] [PubMed]
Liu, H.; Chu, X.; Leung, Y.-W.; Du, R. Minimum-cost sensor placement for required lifetime in wireless sensor-target surveillance networks. IEEE Trans. Parallel Distrib. Syst. 2012, 24, 1783–1796. [Google Scholar] [CrossRef]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Akter, S.; Dao, T.-N.; Yoon, S. Time-constrained task allocation and worker routing in mobile crowd-sensing using a decomposition technique and deep q-learning. IEEE Access 2021, 9, 95808–95822. [Google Scholar] [CrossRef]
Dorigo, M.; Birattari, M.; Stutzle, T. Ant colony optimization. IEEE Comput. Intell. Mag. 2006, 1, 28–39. [Google Scholar] [CrossRef]

Figure 1. An exemplary problem scenario [1].

Figure 2. DRL-based UAV detouring Framework.

Figure 3. MDP diagram of the DQN-based UAV detouring algorithm.

Figure 4. State transitions of the DQN-based UAV detouring algorithm.

Figure 5. The reward design for the proposed DQN-based UAV detouring algorithm.

Figure 6. Total energy consumption learning curves of different DQN along with the increasing numbers of episodes for 10 sensors and 6 obstacles by (a) the proposed DQN-based UAV detouring algorithm; (b) the DQN-based UAV detouring algorithm with negative reward.

Figure 7. Total energy consumption learning curves of different DQN along with the increasing numbers of episodes for 30 sensors and 6 obstacles by (a) the proposed DQN-based UAV detouring algorithm; (b) the DQN-based UAV detouring algorithm with negative reward.

Figure 8. Total energy consumption learning curves of different DQN along with the increasing numbers of episodes for 10 sensors and 10 obstacles by (a) the proposed DQN-based UAV detouring algorithm; (b) the DQN-based UAV detouring algorithm with negative reward.

Figure 9. Total energy consumption learning curves for different DQN along with the increasing number of episodes for (a) 10 sensors and 6 obstacles, (b) 30 sensors and 6 obstacles, and (c) 10 sensors, 10 obstacles.

Figure 10. Training curve of the proposed a-DQN-based UAV detouring algorithm.

Figure 11. Total energy consumption of the UAV (J) for (a) the number of sensors, (b) the number of obstacles, and (c) the arrival deadlines of sensors in minutes.

Figure 12. Average arrival times of the UAV at each sensor (s) for (a) the number of sensors; (b) the number of obstacles; (c) the arrival deadlines of sensors in minutes.

Table 1. Comparison of proposed work with existing literature.

Study Reference	Proposed Algorithm	Original TSP Problem	Considered Problem	Objectives	Considered Constraints	Obstacles Existence
proposed work	proposed DQN-based UAV detouring algorithm	increment of TSP tour as obstacles have been considered in the TSP tour	proposed DOCEM problem	minimize total energy consumption of UAV	UAV’s travel deadline and energy constraints, sensors’ individual deadlines and residual energy constraints	Yes
[21]	deep reinforcement learning	Yes	routing problems	improvement heuristics for routing problems	No	No
[33]	deep reinforcement learning	Yes	autonomous navigation of UAV	reduce the reality gap and analyze UAV behaviors	No	Yes
[34]	Q learning	Yes	transmission Routing	reduce and balance the energy consumption of wireless sensors	No	No
[42]	deep reinforcement learning	No	Continuous Flight Control and Resource Allocation	minimize the overall data loss of the ground devices	No	No
[43]	deep reinforcement learning	No	Efficient data collection	maximizing the number of collected nodes	No	No
[44]	deep Q-Network	No	Optimal ground sensor selection at UAVs	minimize data packet losses	No	No

Table 2. List of notation.

Notation	Definition
$n$	number of sensors
$o$	number of obstacles
$V$	set of sensors, $V = {0, 1, \dots, n}$
$O$	set of obstacles, $O = {0, 1, \dots, o}$
$c_{u a v}$	UAV battery capacity
$C_{i}$	UAV communication time
$d_{j}$	Arrival deadline for UAV at each sensor j
$ρ$	Energy transfer efficiency when UAV recharges a sensor
$v_{u a v}$	UAV’s flight speed
$P_{m a x}$	UAV’s power level at full speed
$e_{g}$	Energy consumption for generating one bit of data
$e_{t}$	Energy usage per bit transmitted
$β$	Energy transfer rate from UAV to a sensor while collecting data
$γ$	Energy transfer rate from UAV to sensor while recharging it
$δ$	Data generation rate of the sensor
D	Deadline for data delivery to the sink for one UAV tour
$R$	a sensor–obstacle relationship matrix to define the relationships between sensors $(i, j)$ and the obstacles between them

Table 3. Experimental parameters (underlined values are default values).

Parameters	Values
Area size $({km}^{2})$	$7 \times 7$
Number of sensors, n	5, 10, 15, 20, 25, 30
Number of obstacles, s	$2, 4, \underset{̲}{6}, 8, 10$
Velocity of the UAV, $v_{u a v} (km / h)$	55
Deadline for data delivery to the sink for one tour, $D (m)$	25 [4]
UAV’s power level at full speed, $P_{m a x} (watts)$	5 [31]
UAV’s power level when it hovers, $P_{i d l e} (watts)$	0 [31]
UAV communication time, $C_{i} (s)$	2 [4]
UAV battery capacity, $C_{u a v} (joules)$	50 × 10⁴ [49]
UAV Arrival deadlines, $d_{j} (minutes)$	$[5 - 10], [10 - 15], \underset{̲}{[15 - 20]}, [20 - 25], [25 - 30]$
Energy transfer efficiency, $ρ$	40% [46]
Energy consumption to generate one bit of data, $e_{g} (nJ / bit)$	50 [47]
Energy usage per bit of transmission electronics, $e_{t} (nJ / bit)$	50 [47]
Energy transfer rate from UAV to sensor i while collecting data from sensor i, $β (J / s)$	5
Energy usage per bit of transmission electronics, $e_{t} (nJ / bit)$	5
Energy transfer rate from UAV to sensor i while charging, $γ (J / s)$	10
Data generation rate of sensor, $δ (bits / s)$	0.5 [47]
Size of the data	64 bytes or 512 bits [47]

Table 4. DQN-based UAV detouring algorithm parameters.

Parameters	Values
Total episodes	7000
Number of steps for Q-learning	3
Learning rate	$0.005$
Discount factor	$0.9$
Learning decay rate	0.00001–2
Epsilon min	0.1
Epsilon decay rate	0 0.0006
Memory buffer	10,000
Batch size	16
Optimizer	Adam

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Rahman, S.; Akter, S.; Yoon, S. A Deep Q-Learning Based UAV Detouring Algorithm in a Constrained Wireless Sensor Network Environment. Electronics 2025, 14, 1. https://doi.org/10.3390/electronics14010001

AMA Style

Rahman S, Akter S, Yoon S. A Deep Q-Learning Based UAV Detouring Algorithm in a Constrained Wireless Sensor Network Environment. Electronics. 2025; 14(1):1. https://doi.org/10.3390/electronics14010001

Chicago/Turabian Style

Rahman, Shakila, Shathee Akter, and Seokhoon Yoon. 2025. "A Deep Q-Learning Based UAV Detouring Algorithm in a Constrained Wireless Sensor Network Environment" Electronics 14, no. 1: 1. https://doi.org/10.3390/electronics14010001

APA Style

Rahman, S., Akter, S., & Yoon, S. (2025). A Deep Q-Learning Based UAV Detouring Algorithm in a Constrained Wireless Sensor Network Environment. Electronics, 14(1), 1. https://doi.org/10.3390/electronics14010001

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Deep Q-Learning Based UAV Detouring Algorithm in a Constrained Wireless Sensor Network Environment

Abstract

1. Introduction

2. Related Works

3. System Model and Problem Formulation

4. Algorithms: The DRL-Based UAV Detouring Framework

4.1. The MDP Model

4.2. Immediate Reward

4.3. DQN-Based UAV Detouring Algorithm

5. Performance Evaluation

5.1. Experimental Settings

5.2. Description of the Compared Algorithms

5.2.1. DQN-Based UAV Detouring Algorithm with a Negative Reward

5.2.2. The Greedy Algorithm

5.2.3. The Genetic Algorithm (GA)

5.2.4. Ant Colony Optimization (ACO) Algorithm

5.3. Performance Comparison Between the Proposed DQN-Based UAV Detouring Algorithm with a Negative Reward

5.4. Training Efficiency of the DQN-Based Detouring Algorithm

5.5. Performance Comparison Between the DQN-Based UAV Detouring Algorithm and Other Baseline Algorithms

6. Concluding Remarks

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI