A Deep Reinforcement Learning Algorithm for Trajectory Planning of Swarm UAV Fulfilling Wildfire Reconnaissance

Demir, Kubilay; Tumen, Vedat; Kosunalp, Selahattin; Iliev, Teodor

doi:10.3390/electronics13132568

Open AccessArticle

A Deep Reinforcement Learning Algorithm for Trajectory Planning of Swarm UAV Fulfilling Wildfire Reconnaissance

¹

Department of Computer Engineering, Faculty of Engineering and Architecture, Bitlis Eren University, 13100 Bitlis, Türkiye

²

Department of Computer Technologies, Gonen Vocational School, Bandirma Onyedi Eylul University, 10200 Bandirma, Türkiye

³

Department of Telecommunication, University of Ruse, 7017 Ruse, Bulgaria

^*

Authors to whom correspondence should be addressed.

Electronics 2024, 13(13), 2568; https://doi.org/10.3390/electronics13132568

Submission received: 15 May 2024 / Revised: 22 June 2024 / Accepted: 28 June 2024 / Published: 30 June 2024

(This article belongs to the Special Issue UAV (Unmanned Aerial Vehicles) Networks: Recent Developments and Emerging Trends)

Download

Browse Figures

Versions Notes

Abstract

Wildfires have long been one of the critical environmental disasters that require a careful monitoring system. An intelligent system has the potential to both prevent/extinguish the fire and deliver urgent requirements postfire. In recent years, unmanned aerial vehicles (UAVs), with the ability to detect missions in high-risk areas, have been gaining increasing interest, particularly in forest fire monitoring. Taking a large-scale area involved in a fire into consideration, a single UAV is often insufficient to accomplish the task of covering the whole disaster zone. This poses the challenge of multi-UAVs optimum path planning with a key focus on limitations such as energy constraints and connectivity. To narrow down this issue, this paper proposes a deep reinforcement learning-based trajectory planning approach for multi-UAVs that permits UAVs to extract the required information within the disaster area on time. A target area is partitioned into several identical subareas in terms of size to enable UAVs to perform their patrol duties over the subareas. This subarea-based arrangement converts the issue of trajectory planning into allowing UAVs to frequently visit each subarea. Each subarea is initiated with a risk level by creating a fire risk map optimizing the UAV patrol route more precisely. Through a set of simulations conducted with a real trace of the dataset, the performance outcomes confirmed the superiority of the proposed idea.

Keywords:

trajectory design; wildfire detection; deep reinforcement learning; risk map; multi-UAV

1. Introduction

Natural disasters have a nature of undesirable, unpredictable, and uncontrollable circumstances that may have a widespread hazard in different severities [1,2,3]. A typical damaging disaster is subject to random occurrence periods in specific locations. Unfortunately, nature experiences dramatic suffering as a result of a high-intensity disaster. Due to the random nature of disasters, a precise detection of the onset of a disaster is frequently impossible. With the current abilities of technology, researchers make efforts to monitor disasters in real time [4,5] to mitigate the negative impacts and navigate the search and rescue (SAR) activities. The process of real-time monitoring of disaster zones is usually carried out under constrained infrastructure resources due to the inevitable corruption of technological infrastructures. Therefore, designing an efficient monitoring system has a pivotal role in supplying observations in exploring the requirements of a disaster area [6,7,8].

Recently, wildfires have been a critical threatening natural disaster, leading to significant damages in forest zones [9]. The rise in global climate and human activities have resulted in increased forest fires on a large scale, making a simple monitoring approach impractical when extracting the required information from strategic zones of the area. In recent research, an effective scientific solution in the domain of wildfire monitoring and early detection has placed the main emphasis on the theme of wireless sensor networks (WSNs) [10]. The main body of WSNs includes a number of nodes with limited capacity, each of which focuses on the completion of a common task such as the timely detection of wildfires [11]. Taking the vast and harsh forest areas into consideration, deploying a huge amount of nodes for full coverage is imperative, posing financial issues and robust self-organization issues. On the other hand, manned aerial vehicles have the potential to ensure complete and fresh information incurring at the expense of costs and risks.

Currently, the innovations and advancements have extensively attracted the ever-increasing popularity of unmanned aerial vehicles (UAVs) [12,13,14,15,16]. An integral task component of UAVs is their ability to collect real-time images of remote zones rendering the implementation of UAVs a feasible and flexible solution in wildfire monitoring [17]. UAVS are designed to fly at lower altitudes strengthening the assessment of a disaster, with improved quality of real-time data. Therefore, deploying a fleet of UAVs can complete the mission of a full survey of a disaster area of interest [18]. This, of course, allows a fast and accurate detection of the presence of wildfire. The utilization of multiple UAVs poses new challenges, particularly the collaboration of UAVS to harness the potential of the fleet entirely. A prevalent limitation associated with UAVs is the restricted energy sources within the scope of surveillance over a broad disaster area. One notable limitation that often impedes the process of full coverage tasks of UAVs relies on the communication range, necessitating the operation of multiple UAVs in a collaborative manner.

To achieve the goal of maximum coverage under the aforementioned constraints, many of the recent studies for managing a team of UAVs lock on the development of planning trajectories for swarm UAVs. This article has its basics on a deep reinforcement learning (DRL)-based trajectory planning approach to enhance the corresponding monitoring efficiency, treating each UAV as an agent [19,20]. The proposed idea formulates trajectory planning specifically for cooperative and homogeneous UAVs, thereby empowering the UAVs to more efficiently observe the area and collect data. The formulation of the proposed DRL algorithm fulfilling wildfire reconnaissance assumes no prior knowledge about the environment. A typical DRL approach utilizes a numerical formation of the reward function for complex environments, letting each agent extract an optimal policy by interacting with the environment via a trial-and-error fashion [21]. Building on this fact, the proposed DRL approach allows each UAV to autonomously learn a trajectory by maximizing the cumulative reward value and maintaining a time-susceptible data collection structure.

Another important point of the proposed idea to tackle the patrol task of UAVs more precisely is to divide the target disaster area into identical subareas, or so-called grids. This grid-based framework intends to convert the issue of exploring optimum trajectory into enabling UAVs to visit the grids in a periodic cycle. Then, we generate a forest fire risk mapping that forecasts the likelihood of wildfire in a particular grid. In other words, as the probability of fire activities may differ in various parts of a forest area, this kind of mapping indicates the level of forest fire risk in each grid separately. The rationale behind the proposed path planning approach is to create routes with the purpose of fire prevention by letting UAVS visit high-risk grids with an adequately great frequency. The arrangement of risk levels depends mostly on regional and human-made factors such as topographic conditions and unattended campfires.

The specific contributions of this article are the following.

We propose a new trajectory planning algorithm based on reinforcement learning for multi-UAV scenarios within realistic environments with the purpose of early detection of wildfire.
The target area is explicitly split into equal-length sub-areas, and each sub-area is initiated with a risk level through a risk mapping strategy. Therefore, the proposed framework encourages the UAVs to visit high-risk sub-areas more frequently.
A double-deep reinforcement learning-based multi-agent trajectory approach is designed for multiple UAVs providing an efficient coverage of forest areas with high risk in terms of fire.
Extensive performance evaluations to verify the superior capacity of the proposed approach have been conducted in changing practical scenarios.

The remainder of the present paper shall be summarized as follows. The recent research domain in the scope of RL-based trajectory design for multi-UAVs is covered by Section 2. Section 3 aims to describe the essential parts of the proposed model. The underlying points of the proposed multi-UAV DDQN trajectory approach are elaborated in Section 4. Section 5 demonstrates the performance outputs to display the working principles of the proposed approach. A clear conclusion is provided to mark the remarkable performance measurements in Section 5.

2. Related Work

The purpose of this section is to study the existing literature pertaining to the relevant scope of this article. A prominent focus has been dedicated to the trajectory design for multiple UAV applications. A recent study proposed a deep-reinforcement learning-based path-planning approach with the extra property of optimizing the transmission power level [22]. This joint combination of the trajectory and transmission power level of UAVs yields an optimal solution leading to the maximum long-term network utility. The complete problem was treated as a stochastic game to be formulated with respect to the dynamics of the network and the actions of UAVs. To alleviate the burden of computational complexity due to the high volume of action and state spaces, the proposed method employs a deterministic policy gradient technique. Extensive simulations have been carried out to confirm the performance superiority of the proposed method over existing optimization methods. The obtained outputs indicate that the system capacity and network utility can be enhanced by at least 15%.

Data collection from remote zones via multiple UAVs has been an attractive research direction. An autonomous exploration of the most suitable routes for UAVs in smart farming areas with reinforcement learning was thoroughly studied in [23]. It implements a location and energy-aware Q-learning algorithm to arrange UAV paths for mitigating power consumption, delay performance, and flight duration with increased data collection properties. This work partitions the farm zone of interest into grids, where each grid is further labeled with respect to its geographical features. This sort of mapping assigns a target point to the grids, thereby the grids with a high target point are visited frequently by UAVs to collect attractive agricultural information. Consequently, UAVs are expected to prevent flying over some grids treated as no-fly with no interesting information. A set of simulation experiments has been conducted to prove the efficiency of the proposed method in terms of data collection robustness and UAV resource consumption in comparison to well-known benchmarks.

To accurately detect the propagating trend of the wildfire, a trajectory planning approach of a fleet of UAVs was constructed to offer efficient situational awareness about the current status of an ongoing wildfire perimeter [24]. The proposed solid idea relies on situation assessment and observation planning (SAOP), operating the activities of perception, decision, and action in turn. The SAOP approach produces a fire map with a prediction of fire spreading through UAV observations. As a centralized approximation, SOAP makes use of a ground station to handle all aforementioned processes. A metaheuristic approach called variable neighborhood search (VNS) is exploited to formulate the wildfire observation problem (WOP), resulting in an output of trajectories depending on application requirements. Realistic and simulation experiments validated the capability of the proposed solution in mapping wildfire spread.

A recent study has made a lot of effort to explore a cost-efficient trajectory through an RL-based path planning strategy with the aid of sensor network (SN) technology [25]. The target application environment is animal tracking in harsh ambient areas and the Q-learning strategy is the core part of the study to timely acquire the information from the area. A typical sensor network is deployed on the ground to initially detect the animals to be reported to the UAVs. The Q-learning algorithm provides a robust operation for UAVs to visit the sensor nodes containing fresh animal appearance information. A numerical reward function is defined to consider application dynamics that can maximize the total reward value. The proposed strategy reaches an optimal policy that permits the UAVs to fly over the zones with a high animal mobility property. Numerous simulation results proved the applicability of the combination of RL, SN, and UAVs in tracking wild animals in a timely manner.

Quick search and rescue (SAR) activities during a natural disaster have been a research topic with the aid of UAVs achieving the purpose of maximum coverage of the affected zone [26]. The present study eliminates the coverage issue of a single UAV with a multi-UAV solution with a solid trajectory design algorithm. To accomplish this objective, a target area is split into sub-areas, and each sub-area is associated with a risk level. A multiagent Q-learning-based UAV trajectory strategy is then proposed to enable UAVS to visit sub-areas prioritized with high risk. The proposed learning strategy satisfies the requirements of both the connectivity and energy limitations of UAVs. This is achieved by remaining UAVs connected to a central station on the ground directly or in a multi-hop manner. Comprehensive simulations have been performed to determine the performance efficiency of the designed trajectory with respect to a prioritized map. The supplied results demonstrate a significant performance improvement in ensuring an ideal UAV trajectory over existing ideas such as Monte Carlo and random and greedy algorithms.

Collecting useful information from Internet of Things (IoT) nodes in complex environments by the action of UAVs as central data collection has been thoroughly studied in [27]. The main goal of the work is to lighten the critical issues faced in UAVs, such as collision avoidance and communication interference between UAVs. An accurate trajectory planning model is established through a three-step approach. The first step implements the K-means algorithm to overcome the task allocation, thereby attaining a collision-free channel among UAVs. The next two steps cover the development of a centralized multi-agent-based DRL algorithm for UAV trajectory design. The effectiveness of the proposed work in simulations has been evaluated for a network setting of 4 UAVs and 30 IoT nodes in comparison to two current multi-agent approaches.

A multi-UAV path planning algorithm with multi-agent RL is proposed to supply robust coordination among UAVs in dynamic environmental conditions [28]. A recurrent neural network (RNN) is used to collect historical information when the observations can be partially completed with incomplete data. The value of the reward function in the proposed RL architecture was arranged in the light of multi-objective optimization problems with specific attributes such as coverage area and security. A simulation environment with three UAVs is created to simulate multi-UAV reconnaissance tasks with significant performance improvements.

The trajectory design of a group of UAVs for the intention of wireless energy charging to a ground station has been investigated in [29]. It tries to maximize the total amount of received power on the ground through the optimization of trajectories, taking the main restrictions in terms of flying speed and collision avoidance into consideration. The optimization problem is solved by adapting the Lagrange multiplier method for a specific case of two-UAV-included scenarios. This supports the understanding of the development of trajectories with an increasing number of UAVs, up to 7. The remarkable output of this study suggests deploying the UAVs around a circle, making the radius a safe distance for UAVs. To examine the performance in the sense of the path design coordination of UAVs, simulation results are presented to approve the validity of the proposed trajectory algorithm.

The importance of deployment of UAVs as aerial base stations lying on traditional terrestrial communication network domain has been thoroughly researched with a primary focus on developing a new framework for the joint trajectory design of UAVs [30]. This scenario includes a high volume of mobile users in the case of dense traffic. The main problems to tackle comprise a joint issue of designing trajectory and controlling power with a satisfied quality of service. To delve into developing a solution, a multi-agent Q-learning-based approach is initially utilized to specify the best positions of the UAVs. Then, real information on the current positions of the users is used to forecast the future positions of the users. Finally, another multi-agent Q-learning-based approach is devised for trajectory design by concurrently assigning the position and transmit power of UAVs. Numerical results in the simulation environment demonstrate a good level of prediction accuracy converging to an ideal steady state.

Similarly, the fundamental goal of another study is to tackle the practical issues of power allocation and trajectory design for multi-UAV communications systems with a DRL-based scheme [31]. The proposed structure introduces both reinforcement learning and deep learning to discover the optimum behavior of the UAVs, whereby a signaling exchange process is performed among UAVs. Therefore, a network of UAVs is installed to dedicate sufficient bandwidth to fulfilling mission-critical issues in complex practices. As a consequence of this study, two important remarks are pointed out: (1) a centralized structure is committed to the learning process in which UAVs act as the agents, and (2) a decentralized framework is applied to execute the management of bandwidth usage allocating a reliable communication channel among UAVs. To show the benefit of the suggested idea, numerous simulation efforts were carried out with two benchmark methods.

Another study addresses the path-planning challenges of rescue UAVs in multi-regional scenarios with priority constraints [32]. It introduces a mixed-integer programming model that combines coverage path planning (CPP) and the hierarchical traveling salesman problem (HTSP) to optimize UAV flight paths. The proposed idea suggests an enhanced method for intra-regional path planning using reciprocating flight paths for complete coverage of convex polygonal regions, optimizing these paths with Bezier curves to reduce path length and counteract drone jitter. For inter-regional path planning, a variable neighborhood descent algorithm based on k-nearest neighbors is used to determine the optimal access order of regions according to their priority. The simulation results demonstrate that the proposed algorithm effectively supports UAVs in performing prioritized path-planning tasks, improving the efficiency of information collection in critical areas, and aiding rescue operations by ensuring quicker and safer exploration of disaster sites.

A highly relevant paper presents a comprehensive study on optimizing the flight paths and trajectories of fixed-wing UAVs at a constant altitude [33]. It is conducted in two primary phases: path planning using the Bezier curve to minimize path length while adhering to curvature and collision avoidance constraints, and flight trajectory planning aimed at minimizing maneuvering time and load factors under various aerodynamic constraints. The study utilizes three meta-heuristic optimization techniques—particle swarm optimization (PSO), genetic algorithm (GA), and grey wolf optimization (GWO)—to determine the optimal solutions for both phases. Results indicate that PSO outperforms GA and GWO in both path planning and trajectory planning, particularly when employing variable speed strategies, which significantly reduce the load factor during maneuvers. Additionally, the paper demonstrates the successful application of these optimized strategies in simultaneous target arrival missions for UAV swarms, highlighting the effectiveness of variable speed in scenarios involving tighter turning radius.

To sum up, recent studies in UAV technology have significantly enhanced wildfire detection and monitoring capabilities. However, these studies often fall short of addressing the energy constraints and communication challenges faced by multi-UAV systems in vast and harsh forest environments. The proposed work addresses these gaps by introducing a novel DRL-based trajectory planning algorithm for multi-UAV scenarios in wildfire reconnaissance. We partition the target area into identical subareas and assign risk levels using a new risk mapping strategy, allowing UAVs to prioritize high-risk areas. Additionally, our double-deep reinforcement learning approach enhances trajectory planning efficiency, ensuring effective coverage of high-risk forest areas. Extensive simulations with real-world datasets validate our approach, demonstrating significant improvements in monitoring efficiency, energy consumption, and data collection accuracy.

3. Definition of System Models

This section denotes the definitions of the principle models for the proposed UAV trajectory projection.

3.1. Network Model

This study refers to utilizing a typical UAV with basic functionalities, flying over the boundaries of the grid. An explicit assumption of UAVs on the flying altitude is that each UAV should have the same altitude. The goal of detecting the fire has a time limit due to the energy constraints of UAVs. Therefore, upon completion of the goal, UAVs are required to land at a ground station. The total distance to be covered by a UAV with a constant velocity in a slot corresponds to the size of the grid. We consider a set of UAVs (U =

U_{1}

,

U_{2}

, …

U_{k}

) in connection with a ground station (GS) as a basic network structure. This sort of multi-hop network plan can sustain a larger monitoring capacity. A representation of the network with a three-UAV scenario is depicted in Figure 1. One assumption about the altitudes is that each UAV can fly at different altitudes, provided that each altitude remains constant throughout the mission. During one time slot, the travel distance of UAVs is assumed to be equal to the slot length. The UAVs are confined to moving with a horizontal velocity or standing inactive.

To simplify the execution of the DRL-based planning approach, the proposed system appeals to a number of simplifications and assumptions. The target zone is virtually fragmented into distinct small sub-areas that are identical in terms of size. Each sub-partition is initiated with a risk level depending on environment characteristics. This representation of the concerned area transforms the issue of trajectory exploration into forcing UAVs to visit high-risk sub-areas more frequently. The starting and ending points of UAVs are pre-determined at the onset of each episode, which can be positioned within the boundary of the environment. To give better insight into this grid type of the application scenario, an example illustration of a typical forest area is depicted in Figure 2. Since communication systems generally operate on a smaller timescale, we utilize the concept of communication time slots in addition to mission time slots. The number of communication time slots is chosen to be sufficiently large. This ensures that the UAV’s position and the channel gain can be considered constant within each communication time slot. The communication links between UAVs and the ground station (GS) are modeled as point-to-point channels, incorporating classical long-distance path loss and shadow fading. Nevertheless, we would like to emphasize that the proposed DQN-based trajectory planning approach can be operated with a different channel model, subject to an acceptable complexity level. On the other hand, the communication range and bandwidth are important parameters in practical applications. The proposed architecture can support the dynamic nature of the values of these parameters, treating it as an application-specific property.

3.2. Risk Mapping

Wildfires are the most common hazard in forests and occur in an uncontrollable fashion due to various factors such as ecological conditions, human activities, and seasonal variations. They may have destructive implications for the fragmentation of the natural landscape, requiring early detection solutions to reduce negative effects. One way to detect a fire in a timely manner is to identify the risk of fire with the creation of a risk map. This is because determining the potential risk level of a forest zone with a detailed map can support early detection post-fire. There have been plenty of related works regarding forest fire risk mapping. An important portion of the state-of-the-art studies has centered around remote sensing (RS) and geographic information systems (GIS), which are capable of analyzing forest areas qualitatively and quantitatively. We selected a GIS-based multi-criteria decision analysis (MCDA) method, as successfully implemented in [34]. Four risk labels, “high”, “medium”, “low”, and “no risk”, are used to classify the risk levels depending on several variables. Here, each variable/criterion holds a risk factor score (

F S

), which is used to calculate the fire risk zone index (

F R Z I

) at a specific weight in order to decide the actual risk level. The weight values (W) can be selected uniquely for each criterion, such as elevation, which may have the lowest influence on the formation of areas with fire potential. This article assigns the highest weight value to the type of flora, especially dry-type vegetation as the main concern. The FRZI value is calculated by the following equation, including the multiplication of FS and W, up to n criteria:

F R Z I = \sum_{i = 1}^{n} W_{i} \cdot F S_{i}

(1)

3.3. Preliminaries: Q-Learning

Q-learning has long been a popular and widely-used reinforcement learning (RL) algorithm in solving a certain duty within a harsh environment [35]. It enables a number of system actors, renamed as agents, to explore an optimal solution of the duty. To cope with this operation, each agent performs a cycle of trial-and-error interactions with no prior complete environmental background. A typical application scenario contains an agent, a group of states (

s_{t} \in S

), each of which is associated with diverse actions (

a_{t} \in A

) at time step t. The essence job of an agent is to select consecutive actions in an intelligent way, starting from a random nature. The agent changes the current state to one of the possible future states after taking an action. The output of the execution process of an action in a given state is represented by a numerical score, the so-called reward (

R (S, A)

). This results, eventually, in a policy of maximizing the reward value, which can also be seen as an indicator of the success grade of an executed action. Therefore, this acquired policy will play a key role in upcoming decisions on the selection of actions. The learning process is promoted with the support of a table, named Q-table (

Q (S, A)

). The Q-value is the actual quality levels of actions, which is the core component of the iteration. After an action a in a state s, the corresponding

Q (s, a)

is updated based on a recurrence form of Bellman equation as follow:

Q_{t + 1} (s, a) = (1 - α) Q_{t} (s, a) + \underset{Learning rate}{\underset{|}{\underset{|}{α}}} [\underset{Reward}{\underset{︸}{R (s, a)}} + \underset{Discount rate}{\underset{|}{\underset{|}{\underset{|}{γ}}}} \overset{\begin{matrix} Discounted estimate of optimal \\ future Q - value \end{matrix}}{\overset{︷}{max Q_{t} (s^{'}, a^{'})}}]

(2)

3.4. Objective Formulation

The main objective of the UAVs is to visit the high-risk grids frequently by collecting the maximum amount of points (reward) with battery restriction. To obtain the highest reward value, the UAVs have a communication range limit without losing the connection to the GS, satisfying the connectivity issue. In case of losing direct connection, the proposed multi-hop scenario allows a UAV to send its data via intermediate UAVs. On the other hand, the UAVs should avoid visiting the grids, which are already observed. Under these restrictions, the maximization process of the obtained reward can be defined in an equation form for a given state s and N steps as:

max \sum^{N} R (s)

(3)

4. DDQN-Based Trajectory Planning Approach

In order to mathematically address the objective function stated in the previous part, we apply a decentralized Markov decision process (MDP) as a powerful solution in a finite-state manner. This type of MDP is simply composed via an expandable tuple (S, A, P, R,

γ

), and each piece of the tuple can be defined for the proposed trajectory solution by the following:

S symbolizes the space of a set of states with the aid of environment information such as landing areas (grids).
A is the feasible action list according to the current location of the UAVs, that is, the possible directions of UAV movement when passing a particular grid.
P state transition probability value, which is actually set to 1 due to the deterministic nature of the model.
R represents the reward function, which is calculated upon the arrival of the UAV in a grid.
$γ$ indicates a discount element in the range of 0 to 1 to decide the weighting ratio of long or short-term rewards, which typically devotes a higher reputation to recently acquired knowledge.

An iterative type of Bellman optimality equation with the purpose of maximizing the reward in solving such an MDP problem has been successfully implemented in [36]. A series repetition of the Belman equation will eventually result in exploring an optimal policy of choosing the best action for a given state. This correlation can be denoted for state values by the following:

V^{*} (s) = max_{a} R_{a} (s, s^{'}) + γ \sum_{s^{'} \in S} P_{a} (s, s^{'}) V^{*} (s^{'})

(4)

This paper has its basics in Q-learning, as described in the previous section, to solve the defined MDPs, relying on no prior knowledge of both the environment and reward. However, a common problem related to the standard Q-learning structure is its relatively low performance with the increasing number of actions/states. This, in fact, may transfer the problem to a more complex level, impeding the exploration of an optimal policy. To overcome this problem, deep reinforcement learning (DRL) was proposed by incorporating a deep neural network, treating the elements of a typical RL as a neural network. In many practical multi-agent applications, a common policy can be easily obtained by using a single deep network. The main body of a typical DRL model can be represented in Figure 3. Here, the number of output layers decides the dimension of action space for a given state. In this study, the DQN model is the leading component, and UAVs are the actual agents of the DQN model. The states (S) in our Q-learning model represent the positions of the UAVs in a three-dimensional plane. Each state corresponds to a specific grid location in the monitored area. The state space encompasses all possible positions that a UAV can occupy during its flight. This allows the model to comprehensively evaluate the environment and determine the best trajectories for the UAVs to follow. The actions (A) in the Q-learning model correspond to the possible movements of the UAVs from one grid location to another. These actions include movements in various directions such as north, south, east, west, and potentially upward and downward in the 3D space. The UAVs decide on these actions based on the current state and the learned Q-values, aiming to maximize the total reward. The maximum coverage area of UAVs yields reward points, subject to a penalty point when flying to an area beyond the communication range or not returning to the center before exhausting the energy. In this model, at each time step, a UAV evaluates its current state (its position on the grid) and selects an action from the set of possible actions that will maximize its expected reward based on the Q-values stored in the Q-table. It is obvious that pre-flight trajectory determination for a typical multi-UAV system has been delved into a game with the objective of finishing the game with as many points as possible.

DQN may suffer from systematic overestimation, particularly in large or continuous domains of states and actions. This can force the agent to overestimate or underestimate the probability of taking a particular action, thereby learning suboptimal policies. The concept of double DQN (DDQN) has been recently introduced to be a useful algorithm to decrease overestimations [19]. It uses two distinct Q-value estimators, one for action selection and the other for estimating the value of the selected action. To further support an effective and stable learning process of the DDQN algorithm, the experience replay method was proposed to drop correlations in the training phase and adjust variance in the data distribution [38]. An agent stores new experiences in a replay memory. Then, a random minibatch from the memory is sampled for the learning process in order to estimate the loss value. The length of the memory is an important hyperparameter that should be carefully regulated according to the dynamics of scenarios. This random process allows an agent to use each experience multiple times and extract more information from previous experiences. Additionally, the resilience level of stability of the learning process is improved through the random selection of the experiences, as a selection of consecutive similar experiences may easily destroy the learning process. The underlying emphasis of this study is placed on the combination of DDQN and experience replay to accelerate the learning process. In addition to experience replay, a different target network is suggested with parameters

θ

for the calculation of the next maximum Q-value. By applying this suggestion, the overestimation of action values satisfying particular circumstances is mitigated with the following loss function [20]:

L (θ) = E_{(s, a, s^{'}) \sim D} [\begin{matrix} (Q_{θ} (s, a) - Y (s, a, s^{'}))^{2} \end{matrix}]

(5)

where Y indicates the target value as:

Y (s, a, s^{'}) = r (s, a) + γ Q_{\bar{θ}} (s^{'}, \underset{a^{'}}{arg max} Q_{θ} (s^{'}, a^{'}))

(6)

\bar{θ}

is the parameter of the target network which can be updated with success as:

\bar{θ} \leftarrow (1 - τ) \bar{θ} + θ τ

(7)

Here,

τ

is the update parameter specifying the adaptation speed in the range of 0 to 1.

In this article, special importance is given to deciding the optimum reward function, possessing a direct impact on the performance. The reward function in our model is designed to encourage UAVs to cover the maximum area while avoiding areas outside the communication range and ensuring they return to the center before depleting their energy. The reward is calculated based on the coverage area and penalties for undesirable actions, which helps guide the UAVs to optimal trajectories. The analytical expression of the reward function can be defined along with the following parts as:

r_{i} (t) = α \sum_{k \in K} (Δ_{k, i} (t)) + θ_{i} (t) + β_{i} (t) + γ_{i} (t) + ϵ

(8)

The initial fragment of the sum shows an aggregate reward for FRZI points

\sum_{k \in K} (Δ_{k, i} (t))

collected from the grids through agent i during the duration of the task t.

Δ_{k, i}

denotes the fire risk values (FRZI) of a grid, which is kth step of agent i. The next step is to parameterize the aggregated reward by a multiplication factor

α

, constituting the shared reward function between all agents.

θ_{i} (t)

is a penalty parameter that encourages UAVs to stay within the communication limit. This term penalizes UAVs that navigate outside the predefined communication range, ensuring they maintain connectivity. The next term,

β_{i} (t)

, appoints an exclusive punishment when an action is rejected due to undesirable requests such as border violation and visiting a grid currently being visited by another UAV. It discourages UAVs from taking actions that lead to conflicts or redundancies. The third term,

γ_{i} (t)

, is another punishment when landing outside the designated time. It ensures that UAVs complete their tasks within a specific timeframe, promoting efficiency in task execution. The last term is a fixed movement punishment for the purpose of encouraging UAVs to shorten their flight durations and giving priority to more efficient trajectories. It acts as a general penalty for unnecessary movements, incentivizing the UAVs to find the shortest and most effective paths. To avoid multiple visits by different UAVs in a visited grid, the risk point of this grid is assigned to 0. The general structure of the neural network model used is inspired by the one successfully implemented in [39], as shown in Figure 4.

5. Performance Evaluations

This article intends to develop an approach with the attribute of generalizable UAV policy along a wide range of parameter domains that define a certain coverage/scanning scenario. The rationale behind this solid idea is to randomly pick a group of scenario conditions at the onset of successive training episodes. After this stage, the mission is triggered for deployed agents to accumulate as many points as possible in the selected conditions. In other words, The scenario conditions/parameters describe new missions which are mainly: (1) the number of deployed UAVs, (2) the available flight duration before the mission starts, (3) the starting points of UAVs, (4) the final status of the risk map, and (5) communication range limit. To assess the performance efficiency of the proposed idea with a particular focus on agents with varying scenario circumstances, the subsequent widely-used popular criteria are considered:

Violation of boundaries: It defines the negative cases for UAVs leaving the grid boundaries, entering the same grid simultaneously, and violating the communication range. A low number of such cases will prove a measure of success.
Point collection ratio: It is the proportion of the sum of the aggregated data to the existing data at the start of a mission.
Landing with success: It reports the ratio of landing with the success of UAVs upon completion of an episode.

We run each experiment 1000 times and present the mean value of these experiments for each criterion. In most RL-based problems, the optimal performance can be reached by diving into the hyperparameter selection. The hyperparameters listed in Table 1 are passed to the agents and result in robust stability on the performance.

5.1. Example Use Case

In the first phase of the experiments, we start with a synthetic motivating scenario covering a 32 by 32 grid structure as presented in Figure 5. There are four different scenarios, each of which is associated with a unique landing location. Each grid with a color is assigned a specific point to be collected by UAVs (gray-colored spots are less than 7 points, blue-colored spots are between 15 and 7 points, green-colored spots are between 15 and 22 points, orange-colored spots are between 22 and 30 points, and red-colored spots are between 30 and 35 points). This figure verified the strong side of the proposed algorithm in exploring trajectories, collecting the maximum points by UAVs when starting from any location in the target area. It is worth pointing out that the proposed method aims to ensure overall maximum collected points when a UAV completes its flight. Therefore, the UAV can select its next destination from medium-risk areas. As a consequence, a high coverage is achieved with respect to varying starting points, also subject to battery constraints.

5.2. A Practical Case Study

To observe the performance metrics in a practical case, a specific area with high forest fire risk was preferred: Kapıdağ Peninsula in Türkiye. A satellite view of the area is shown in Figure 6. Another main reason for selecting this area is its popularity, as a lot of international and domestic people spend their holiday there in the summer term. The proportion of forest parts to the whole Peninsula is at least 75%, rendering it a strong high-risk area, as outlined in a recent study [40]. On the other hand, agricultural and farming activities hold a significant part in the human lifes of the target field. Therefore, the possibility of a forest fire may have a destructive influence on a broad range.

The initial step is to construct the grid structure where the target area has a length of 16 × 32 Km. This motivated us to compose a 16 × 16 grid structure (see Figure 7). The risk classification of each grid in terms of risk level was extracted through a range of criteria such as human activities, characteristics of the geography, distance to settlement and main roads, meteorology, and elevation data. Using Equation (1), each grid is reconciled with a risk level, and a risk level is assigned to a specific point, as explained in the previous section. Therefore, the complete problem is converted into letting each UAV collect maximum points within an episode, subject to the constraint of the limited battery of UAVs. In this figure, an instance of experiments indicating the path of a single UAV is depicted to verify the effective route plan.

5.2.1. Boundary Violation

The first experiments indicate the number of violations of the boundaries of the trajectories drawn by the model during training. It is worth noting that the number of violations reduces with increasing episodes, approaching nearly 0. This trend proves the learning tendency of the model. Figure 8 demonstrated how the number of boundary violations by the trajectories during training decreases over time. This graph indicates that the model has successfully learned to plot safe and efficient trajectories that do not violate boundaries and that these trajectories have become increasingly optimized over time. This is seen as a measure of success, showing that the model’s learning trend is correct and yields good results.

5.2.2. Point Collection Rate

In Figure 9, the results present the model’s performance in terms of how effectively it collects points over 5 million episodes. The graph in Figure 9 shows that the model begins to converge after a significant number of episodes, reaching a steady state where it consistently scores a stable rate of points. This point collection rate is a crucial measure of the model’s effectiveness in maximizing the coverage of the survey area under different operational conditions. It suggests that the model has learned to optimize its path to maximize point collection efficiently. The convergence of training and testing results indicates that the model generalizes well to unseen scenarios, maintaining performance consistency.

The model’s ability to cover approximately 30% of the points in the field with the current maximum battery capacity and UAV count is highlighted. This performance value represents the average of 1000 consequent experiments. In 98% of the experiments, the point collection ratio fell within the range of 28% to 32%. It is suggested that increasing the number of UAVs or the battery duration could proportionally increase this coverage rate. This is verified in our previous paper [25], which was presented in Section 2. This study clearly proved that an increasing number of UAVs can visit more grids and collect more points. Despite the introduction of variable risk maps and communication range limits, the model has successfully completed its learning to collect an optimal amount of points. This accomplishment under challenging constraints showcases the robustness and success of the developed algorithm.

5.2.3. Successful Landing

Figure 10 focuses on monitoring the successful landing attempts of unmanned aerial vehicles (UAVs) during simulations across 5 million episodes. It indicates the number of landing attempts made by the UAVs, which is shown to stabilize between two and three attempts per episode towards the latter part of the simulations.

This metric has a significant role as it evaluates the model’s ability to ensure UAVs can land safely and timely at the designated locations after completing their missions. The consistent number of landing attempts nearing the end of the simulation period suggests that the model effectively learns and adheres to the required end-of-mission procedures. It implies that the UAVs are following the trajectories generated by the model that guide them to correct landing zones at the right times, demonstrating the model’s reliability and precision in handling the end-phase of flight operations. The stable performance in successful landing attempts indicates robustness in the model’s operational planning and execution, providing confidence in the UAVs’ ability to operate autonomously in varied scenarios without deviating from critical mission-ending protocols.

5.2.4. Point Collection Rate and Landing

Figure 11 shows the results of the model’s performance in terms of its ability to collect points efficiently while also achieving successful landings. It presents a graph that shows the correlation between successful landings and the point collection rate over approximately 5 million episodes. The results demonstrate that as the model progresses through the episodes, both the landing success and the point collection rate begin to converge, particularly after 3.5 million episodes. This convergence indicates that the model not only collects points effectively but also ensures that UAVs can return and land successfully at the end of their missions, highlighting an important dual aspect of operational efficiency.

The convergence of these two metrics, (1) point collection and (2) successful landings, suggests that the algorithm has successfully integrated the operational goals of maximizing data collection with the safety and logistical requirement of landing the UAVs safely after the mission. This shows an advanced level of learning and adaptation by the model, as it balances between achieving high scores (collecting more points) and adhering to mission-critical constraints like timely landings. This result is particularly significant because it illustrates that the model has not only learned to optimize for one aspect of the mission (such as maximizing coverage or points) but has also successfully managed to integrate this with the crucial end-of-mission process of landing the UAVs safely, which is essential for practical deployment in real-world scenarios.

6. Conclusions and Future Work Directions

In this article, we explore the application of a novel deep reinforcement learning-based algorithm for the trajectory planning of multi-UAV systems tasked with wildfire reconnaissance and monitoring. The framework effectively partitions a target disaster area into grids, optimizing UAV paths based on calculated fire risk levels to ensure efficient and comprehensive surveillance. The use of multiple UAVs facilitates a collaborative approach to covering extensive forested areas, enhancing detection and monitoring capabilities in regions prone to wildfires. The simulations demonstrate that our approach significantly covers the requirements, particularly in terms of optimizing trajectories under constraints such as limited battery life and communication range. The simulation results reveal substantial improvements in the efficiency and effectiveness of UAV operations, including significant reductions in boundary violations and increases in point collection rates.

Furthermore, the integration of a deep Q-network (DQN) model with double Q-learning ensures that our algorithm adapts to various environmental conditions, continually learning from each interaction to enhance decision-making processes. This adaptability is crucial for managing the dynamic and often unpredictable nature of wildfire spread. The results highlight the potential of advanced machine learning techniques in addressing critical environmental challenges. The successful deployment of such UAV systems could lead to more timely and accurate responses to wildfires, potentially saving vast expanses of forest and wildlife and reducing human and economic losses. In conclusion, this study not only presents a robust technological solution to a pressing environmental issue but also paves the way for further research into the application of machine learning in disaster management and response systems.

It is essential to highlight the possible limitations and drawbacks to give an insight into potential future work directions. The proposed idea is based on small, inexpensive UAVs, allowing for flexibility in terms of complexity and cost. The optimum number of UAVs in practical cases should be carefully arranged. When the application scenario changes, it might take a long time to train the new system again. An important drawback of the proposed method is the lack of practical implementations. We believe that practical scenarios have the potential to experience new problems to be solved, requiring a lot of effort and time to handle. The future work direction of this study will mainly focus on practical implementations and solutions to the problems encountered.

Author Contributions

Conceptualization, K.D., V.T., S.K. and T.I.; methodology, K.D., V.T. and S.K.; software, K.D., V.T. and S.K.; validation, K.D., V.T., S.K. and T.I.; investigation, K.D., V.T., S.K. and T.I.; writing—original draft preparation, K.D., V.T., S.K. and T.I.; writing—review and editing, K.D., V.T., S.K. and T.I.; visualization, K.D., V.T., S.K. and T.I. All authors have read and agreed to the published version of the manuscript.

Funding

This study is financed by the European Union-NextGenerationEU, through the National Recovery and Resilience Plan of the Republic of Bulgaria, project № BG-RRP-2.013-0001-C01.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yusoff, S.; Yusoff, N.H. Disaster Risks Management through Adaptive Actions from Human-Based Perspective: Case Study of 2014 Flood Disaster. Sustainability 2022, 14, 7405. [Google Scholar] [CrossRef]
Caldera, H.J.; Wirasinghe, S.C. A universal severity classification for natural disasters. Nat. Hazards 2022, 111, 1533–1573. [Google Scholar] [CrossRef] [PubMed]
Dhall, A.; Dhasade, A.; Nalwade, A.; Mohan, M.R.; Kulkarni, V. A survey on systematic approaches in managing forest fires. Appl. Geogr. 2020, 121, 102266. [Google Scholar] [CrossRef]
Esposito, M.; Palma, L.; Belli, A.; Sabbatini, L.; Pierleoni, P. Recent Advances in Internet of Things Solutions for Early Warning Systems: A Review. Sensors 2022, 22, 2124. [Google Scholar] [CrossRef]
Abdalzaher, M.S.; Krichen, M.; Yiltas-Kaplan, D.; Ben Dhaou, I.; Adoni, W.Y.H. Early Detection of Earthquakes Using IoT and Cloud Infrastructure: A Survey. Sustainability 2023, 15, 11713. [Google Scholar] [CrossRef]
AlAli, Z.T.; Alabady, S.A. A survey of disaster management and SAR operations using sensors and supporting techniques. Int. J. Disaster Risk Reduct. 2022, 82, 103295. [Google Scholar] [CrossRef]
Khan, A.; Gupta, S.; Gupta, S.K. Multi-hazard disaster studies: Monitoring, detection, recovery, and management, based on emerging technologies and optimal techniques. Int. J. Disaster Risk Reduct. 2020, 47, 101642. [Google Scholar] [CrossRef]
Cicek, D.; Kantarci, B. Use of Mobile Crowdsensing in Disaster Management: A Systematic Review, Challenges, and Open Issues. Sensors 2023, 23, 1699. [Google Scholar] [CrossRef] [PubMed]
Kim, E.; Kwon, Y.J. Analyzing indirect economic impacts of wildfire damages on regional economies. Risk Anal. 2023, 43, 2631–2643. [Google Scholar] [CrossRef]
Chen, D.; Zhang, Y.; Pang, G.; Gao, F.; Duan, L. A Hybrid Scheme for Disaster-Monitoring Applications in Wireless Sensor Networks. Sensors 2023, 23, 5068. [Google Scholar] [CrossRef]
Dampage, U.; Bandaranayake, L.; Wanasinghe, R.; Kottahachchi, K.; Jayasanka, B. Forest fire detection system using wireless sensor networks and machine learning. Sci. Rep. 2022, 12, 46. [Google Scholar] [CrossRef]
Mohsan, S.A.H.; Khan, M.A.; Noor, F.; Ullah, I.; Alsharif, M.H. Towards the Unmanned Aerial Vehicles (UAVs): A Comprehensive Review. Drones 2022, 6, 147. [Google Scholar] [CrossRef]
Hadi, H.J.; Cao, Y.; Nisa, K.U.; Jamil, A.M.; Ni, Q. A comprehensive survey on security, privacy issues and emerging defence technologies for UAVs. J. Netw. Comput. Appl. 2023, 213, 103607. [Google Scholar] [CrossRef]
Sharma, A.; Vanjani, P.; Paliwal, N.; Basnayaka, C.M.; Jayakody, D.N.K.; Wang, H.C.; Muthuchidambaranathan, P. Communication and networking technologies for UAVs: A survey. J. Netw. Comput. Appl. 2020, 168, 102739. [Google Scholar] [CrossRef]
McEnroe, P.; Wang, S.; Liyanage, M. A Survey on the Convergence of Edge Computing and AI for UAVs: Opportunities and Challenges. IEEE Internet Things J. 2022, 9, 15435–15459. [Google Scholar] [CrossRef]
Telli, K.; Kraa, O.; Himeur, Y.; Ouamane, A.; Boumehraz, M.; Atalla, S.; Mansoor, W. A Comprehensive Review of Recent Research Trends on Unmanned Aerial Vehicles (UAVs). Systems 2023, 11, 400. [Google Scholar] [CrossRef]
Ghali, R.; Akhloufi, M.A.; Mseddi, W.S. Deep Learning and Transformer Approaches for UAV-Based Wildfire Detection and Segmentation. Sensors 2022, 22, 1977. [Google Scholar] [CrossRef]
Akhloufi, M.A.; Couturier, A.; Castro, N.A. Unmanned aerial vehicles for wildland fires: Sensing, perception, cooperation and assistance. Drones 2021, 5, 15. [Google Scholar] [CrossRef]
van Hasselt, H.; Guez, A.; Silver, D. Deep Reinforcement Learning with Double Q-learning. arXiv 2015, arXiv:1509.06461. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
Moon, J.; Papaioannou, S.; Laoudias, C.; Kolios, P.; Kim, S. Deep Reinforcement Learning Multi-UAV Trajectory Control for Target Tracking. IEEE Internet Things J. 2021, 8, 15441–15455. [Google Scholar] [CrossRef]
Zhao, N.; Liu, Z.; Cheng, Y. Multi-Agent Deep Reinforcement Learning for Trajectory Design and Power Allocation in Multi-UAV Networks. IEEE Access 2020, 8, 139670–139679. [Google Scholar] [CrossRef]
Pourroostaei Ardakani, S.; Cheshmehzangi, A. Reinforcement Learning-Enabled UAV Itinerary Planning for Remote Sensing Applications in Smart Farming. Telecom 2021, 2, 255–270. [Google Scholar] [CrossRef]
Bailon-Ruiz, R.; Bit-Monnot, A.; Lacroix, S. Real-time wildfire monitoring with a fleet of UAVs. Robot. Auton. Syst. 2022, 152, 104071. [Google Scholar] [CrossRef]
Ergunsah, S.; Tümen, V.; Kosunalp, S.; Demir, K. Energy-efficient animal tracking with multi-unmanned aerial vehicle path planning using reinforcement learning and wireless sensor networks. Concurr. Comput. Pract. Exp. 2023, 35, e7527. [Google Scholar] [CrossRef]
Akin, E.; Demir, K.; Yetgin, H. Multiagent Q-learning based UAV trajectory planning for effective situational awareness. Turk. J. Electr. Eng. Comput. Sci. 2021, 25, 2561–2579. [Google Scholar] [CrossRef]
Xu, S.; Zhang, X.; Li, C.; Wang, D.; Yang, L. Deep Reinforcement Learning Approach for Joint Trajectory Design in Multi-UAV IoT Networks. IEEE Trans. Veh. Technol. 2022, 71, 3389–3394. [Google Scholar] [CrossRef]
Chen, Y.; Dong, Q.; Shang, X.; Wu, Z.; Wang, J. Multi-UAV Autonomous Path Planning in Reconnaissance Missions Considering Incomplete Information: A Reinforcement Learning Method. Drones 2023, 7, 10. [Google Scholar] [CrossRef]
Mu, J.; Sun, Z. Trajectory Design for Multi-UAV-Aided Wireless Power Transfer toward Future Wireless Systems. Sensors 2022, 22, 6859. [Google Scholar] [CrossRef]
Liu, X.; Liu, Y.; Chen, Y.; Hanzo, L. Trajectory design and power control for multi-UAV assisted wireless networks: A machine learning approach. IEEE Trans. Veh. Technol. 2019, 68, 7957–7969. [Google Scholar] [CrossRef]
Chang, Z.; Deng, H.; You, L.; Min, G.; Garg, S.; Kaddoum, G. Trajectory Design and Resource Allocation for Multi-UAV Networks: Deep Reinforcement Learning Approaches. IEEE Trans. Netw. Sci. Eng. 2023, 10, 2940–2951. [Google Scholar] [CrossRef]
Du, L.; Fan, Y.; Gui, M.; Zhao, D. A Multi-Regional Path-Planning Method for Rescue UAVs with Priority Constraints. Drones 2023, 7, 692. [Google Scholar] [CrossRef]
Machmudah, A.; Shanmugavel, M.; Parman, S.; Manan, T.S.A.; Dutykh, D.; Beddu, S.; Rajabi, A. Flight Trajectories Optimization of Fixed-Wing UAV by Bank-Turn Mechanism. Drones 2022, 6, 69. [Google Scholar] [CrossRef]
Gülçin, D.; Deniz, B. Remote sensing and GIS-based forest fire risk zone mapping: The case of Manisa, Turkey. Turk. J. For.|Turk. Orman. Derg. 2020, 21, 15–24. [Google Scholar] [CrossRef]
Watkins, C.J.C.H.; Dayan, P. Q-Learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
Cui, J.; Liu, Y.; Nallanathan, A. Multi-Agent Reinforcement Learning-Based Resource Allocation for UAV Networks. IEEE Trans. Wirel. Commun. 2020, 19, 729–743. [Google Scholar] [CrossRef]
Fotouhi, A.; Ding, M.; Hassan, M. Deep q-learning for two-hop communications of drone base stations. Sensors 2021, 21, 1960. [Google Scholar] [CrossRef]
Zhang, S.; Sutton, R.S. A Deeper Look at Experience Replay. arXiv 2017, arXiv:1712.01275. [Google Scholar]
Bayerlein, H.; Theile, M.; Caccamo, M.; Gesbert, D. Multi-UAV Path Planning for Wireless Data Harvesting with Deep Reinforcement Learning. IEEE Open J. Commun. Soc. 2021, 2, 1171–1187. [Google Scholar] [CrossRef]
Sevinç, V. Mapping the forest fire risk zones using artificial intelligence with risk factors data. Environ. Sci. Pollut. Res. 2023, 30, 4721–4732. [Google Scholar] [CrossRef]

Figure 1. Three-UAV network scenario.

Figure 2. Example of 3-UAV scenario with a default starting location.

Figure 3. The main structure of DRL model [37].

Figure 4. The proposed DQN neural network architecture.

Figure 5. Example trajectories for different landing positions and number of agents.

Figure 6. A view from the study zone. (a) Kapıdağ Peninsula. (b) The exact location in the country.

Figure 7. Forest fire risk zone map of Kapıdağ Peninsula.

Figure 8. The number of boundary violations.

Figure 9. The point collection ratio.

Figure 10. Landing with success performance.

Figure 11. The point collection ratio when landing successfully.

Table 1. The hyperparameters of DDQN.

Parameter	Precision	Description
$θ$	1,175,302	the number of parameters to be trained
N	5,000,000	maximum training steps
D	50,000	replay memory buffer size
m	128	minibatch size
$τ$	0.005	soft update factor
$γ$	0.95	discount rate

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Demir, K.; Tumen, V.; Kosunalp, S.; Iliev, T. A Deep Reinforcement Learning Algorithm for Trajectory Planning of Swarm UAV Fulfilling Wildfire Reconnaissance. Electronics 2024, 13, 2568. https://doi.org/10.3390/electronics13132568

AMA Style

Demir K, Tumen V, Kosunalp S, Iliev T. A Deep Reinforcement Learning Algorithm for Trajectory Planning of Swarm UAV Fulfilling Wildfire Reconnaissance. Electronics. 2024; 13(13):2568. https://doi.org/10.3390/electronics13132568

Chicago/Turabian Style

Demir, Kubilay, Vedat Tumen, Selahattin Kosunalp, and Teodor Iliev. 2024. "A Deep Reinforcement Learning Algorithm for Trajectory Planning of Swarm UAV Fulfilling Wildfire Reconnaissance" Electronics 13, no. 13: 2568. https://doi.org/10.3390/electronics13132568

APA Style

Demir, K., Tumen, V., Kosunalp, S., & Iliev, T. (2024). A Deep Reinforcement Learning Algorithm for Trajectory Planning of Swarm UAV Fulfilling Wildfire Reconnaissance. Electronics, 13(13), 2568. https://doi.org/10.3390/electronics13132568

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Deep Reinforcement Learning Algorithm for Trajectory Planning of Swarm UAV Fulfilling Wildfire Reconnaissance

Abstract

1. Introduction

2. Related Work

3. Definition of System Models

3.1. Network Model

3.2. Risk Mapping

3.3. Preliminaries: Q-Learning

3.4. Objective Formulation

4. DDQN-Based Trajectory Planning Approach

5. Performance Evaluations

5.1. Example Use Case

5.2. A Practical Case Study

5.2.1. Boundary Violation

5.2.2. Point Collection Rate

5.2.3. Successful Landing

5.2.4. Point Collection Rate and Landing

6. Conclusions and Future Work Directions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI