Next Article in Journal
Selected Papers from MEDER 2024: Advances in Mechanism Design for Robotics
Previous Article in Journal
Decentralized Multi-Cobot Navigation Under Intermittent Communication
Previous Article in Special Issue
Concurrent Multi-Robot Search of Multiple Missing Persons in Urban Environments
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Integrated MADQN–Heuristic Framework for Swarm Robotic Fire Detection and Extinguishing

by
Andrei Dutceac
1,* and
Constantin I. Vizitiu
2
1
Department of Electronic Systems and Military Equipment MTA, 050141 Bucharest, Romania
2
AOSR/Military Technical Academy “Ferdinand I” MTA, 050141 Bucharest, Romania
*
Author to whom correspondence should be addressed.
Robotics 2026, 15(1), 5; https://doi.org/10.3390/robotics15010005
Submission received: 17 November 2025 / Revised: 19 December 2025 / Accepted: 25 December 2025 / Published: 27 December 2025
(This article belongs to the Special Issue Multi-Robot Systems for Environmental Monitoring and Intervention)

Abstract

Wildfires pose a growing global threat, demanding rapid, scalable, and autonomous response strategies. This study proposes HG-MADQN (Heuristic-Guided Multi-Agent Deep Q-Network), a hybrid framework that integrates reinforcement learning with biologically inspired pheromone-based heuristics to achieve adaptive fire detection and suppression using drone swarms. The system models a decentralized swarm operating in a grid-based environment, where each drone combines learned policies with heuristic guidance derived from a dual-pheromone mechanism (a fire-attraction field guiding suppression and a coverage-repulsion field promoting exploration). The proposed hybrid approach ensures efficient coordination, minimizes redundant movements, and maintains continuous area coverage without centralized control. Simulation experiments conducted on dynamic wildfire scenarios demonstrate that HG-MADQN significantly outperforms traditional heuristic, Lévy-Flight, and reinforcement learning (MADQN) algorithms. It achieves faster containment, reduced burned area, and lower resource consumption, while exhibiting strong robustness across multiple swarm sizes and fire configurations. The results confirm that hybridizing learned and heuristic decision models enables a balanced exploration–exploitation trade-off, leading to improved scalability and resilience in cooperative fire suppression missions.

1. Introduction

1.1. Wildfire Challenges and the Role of Swarm Robotics

Wildfires represent a significant and increasing threat to ecosystems [1], infrastructure, human lives, and air quality around the world. In Europe by mid-2025, the European Forest Fire Information System reported over 1 million hectares burned, making 2025 the worst wildfire season on record for the EU.
Recently, advanced technologies have played a crucial role in improving forest fire management and response capabilities. These include satellite-based remote sensing (see also surveys on AI-UAV platforms [2]), unmanned aerial vehicles (UAVs) and swarm systems for both detection and suppression [3,4], IoT sensor networks and wireless communication systems enabling real-time monitoring [5], and machine-learning and reinforcement-learning algorithms that support predictive modeling of fire spread, resource allocation, and decision-support tools [6]. The convergence of these technologies offers the potential for proactive and scalable wildfire management, shifting from mere reaction to strategic, data-driven mitigation.
Swarm robotic systems, composed of multiple UAVs or UGVs [7], offer a scalable, decentralized means of monitoring and intervening across large and inaccessible terrains [8]. However, ensuring effective coordination, rapid response, and efficient energy use in dynamic fire environments remains a major challenge due to the inherent complexity of natural ecosystems. Training a swarm for specific tasks requires realistic simulation environments, yet it is practically impossible to model nature in its entirety, particularly in wildfire scenarios, which depend on numerous interrelated factors such as wind conditions, vegetation type, and material density.

1.2. Algorithmic Paradigms in Swarm Robotics

Although previous studies have classified swarm robotic algorithms using diverse taxonomies such as behavior-based, optimization-driven, or communication-oriented frameworks [9], this paper adopts a functional categorization better suited to the context of adaptive coordination and fire-suppression missions.
Specifically, the algorithms are grouped into three primary directions: (1) heuristic and bio-inspired approaches, (2) reinforcement learning and multi-agent coordination strategies, and (3) hybrid and integrated frameworks designed for adaptive multi-robot control in dynamic environments.

1.2.1. Heuristic and Bio-Inspired Coordination Strategies

Nature-inspired swarm coordination has been widely studied due to its simplicity, scalability, and decentralized control properties. Alsammak [10] proposed a nature-inspired UAV swarm system for wildfire suppression, emphasizing distributed fire spots, collision avoidance, and energy management. Their framework used stigmergy and random-walk mechanisms, achieving superior coverage compared to PSO [11] and Lévy-flight strategies [12]. Lee et al. [13,14] explored bee swarm-inspired cooperation models for foraging robots, improving task distribution and energy efficiency through behavioral modeling. Josy [15] introduced a genetic algorithm-based routing and scheduling system for wildfire suppression using multiple UAVs, optimizing path planning under limited fuel and time constraints. In this approach, the genetic algorithm encodes each UAV’s route and task schedule into candidate solutions that evolve through iterative selection, crossover, and mutation.
Zhou [16] designed a multi-target search algorithm for swarm robots in a 3D mountain environment, achieving efficient area coverage and multi-source detection. Robots use a heuristic scoring function to decide which cell to explore next based on proximity to unexplored areas, probability of target presence, and distance cost. Although heuristic algorithms are computationally lightweight and robust, they typically lack adaptivity and fail to self-optimize in rapidly changing environments such as active wildfire scenarios.

1.2.2. Reinforcement Learning and Multi-Agent Coordination

Deep reinforcement learning (DRL) and its multi-agent extensions have become essential for enabling autonomous decision-making in cooperative robotics. A comprehensive survey [17] provides an overview of Multi-Agent Reinforcement Learning (MARL) methods applied to unmanned aerial vehicle (UAV) control, coordination, and decision-making. The paper reviews State-of-the-Art MARL algorithms, including DQN, MADDPG, PPO, and their applications in cooperative and competitive UAV tasks. Haksar and Schwager [18] implemented a distributed MADQN framework for firefighting UAVs, where each UAV acts independently using local observations while sharing a common learned policy. Collignon [19] applied multi-agent DRL to wildfire search and rescue operations, showing cooperative path planning under uncertainty. This paradigm is extended to multi-UAV pursuit and cooperative search in dynamic 3D environments [20,21], where agents learn navigation and cooperation policies through shared rewards. Huang [22] investigated task allocation and pathfinding through distributed RL, showing improved coordination efficiency.
Despite notable progress, Deep Reinforcement Learning (DRL) and its multi-agent extensions still face significant challenges when applied to critical and dynamic environments such as wildfire suppression or disaster response. The learning process is highly sensitive to reward shaping and hyperparameter tuning, which makes policy transfer across scenarios difficult. This limitation becomes particularly critical in natural disaster scenarios, where the evolution of the environment is highly unpredictable and nearly impossible to simulate accurately. As a result, agents trained in simplified or static simulations may develop behaviors that do not generalize to real-world conditions, leading to substantial variations in performance. Consequently, the impact of their actions in real operations becomes difficult to predict or control, reducing the reliability of DRL-based decision-making in safety-critical applications.

1.2.3. Hybrid and Integrated Control Frameworks

Recent trends point toward integrating heuristic decision models with learning-based frameworks to achieve a balance between adaptability and computational efficiency [23,24]. The paper [25] proposes a Hierarchical Safe Reinforcement Learning (HSRL) framework for multi-robot systems that integrates an RL policy for decision-making and cooperative task allocation and a low-level safety control module based on Uniformly Ultimate Boundedness (UUB) theory, which guarantees that each robot’s state and tracking errors remain within safe limits. Although safe, adapting to new missions or different robot dynamics may require retraining or retuning of the UUB parameters.
Renzaglia and Dibangoye [26] proposed FCAO (Frontier-based Cognitive Adaptive Optimization), an optimization model for multi-robot exploration and 3D coverage, integrating learning-based adaptability. For coverage, they use a Voronoi-based initialization via a Constrained Centroidal Voronoi Tessellation (CVT), and for exploration, they combine CAO (Cognitive-based Adaptive Optimization) with frontier-based navigation. By combining local information-based optimization with global heuristics (Voronoi or frontier guidance), the method achieves efficient, decentralized, and scalable performance for multi-robot systems. However, in completely unknown environments, generating a good initial distribution of robots is difficult. A different hybrid approach is used in [27], integrating reinforcement learning with genetic algorithms to enhance task allocation efficiency. They model the task allocation problem as a Markov game where the Genetic Algorithm is used to generate a diverse population of policy networks (actor-critic networks) while each individual policy network is trained with PPO to optimize performance (reward) given the environment interactions.
Recent research has shown that hybrid control architectures can significantly enhance coordination and adaptability in multi-robot and swarm systems. Learning components can adapt policies or parameters when the environment changes, unlike purely rule-based or model-based methods. Robots can adjust their strategies for new missions, unknown terrains, or varying numbers of agents.
The main contributions of this study can be summarized as follows:
  • Dual-pheromone heuristic design: A novel dual-heuristic mechanism was introduced, combining a fire-attraction pheromone for suppression guidance and an exploration pheromone for area coverage, enabling adaptive balance between exploration and exploitation.
  • Reward function formulation for MADQN: A task-specific reward function was designed to integrate fire intensity, suppression efficiency, and resource optimization, effectively shaping agent behavior toward cooperative and sustainable firefighting actions.
  • Algorithmic fusion of heuristic control and MADQN: The proposed HG-MADQN algorithm fuses heuristic swarm coordination with Multi-Agent Deep Q-Network learning, creating a hybrid decision-making framework capable of decentralized and adaptive fire suppression.
  • Comparative performance analysis: A comprehensive experimental evaluation was conducted across four algorithms: Heuristic, Lévy, Reinforcement Learning (MADQN), and HG-MADQN, demonstrating the superior containment speed, spatial efficiency, and robustness of the proposed approach.
The remainder of this paper is organized as follows. Section 2 introduces the system model and details the proposed HG-MADQN framework. Section 3 describes the simulation setup, presents the experimental results, and discusses the performance analysis. Section 4 concludes the paper with final remarks and directions for future research.

2. Proposed Methodology

2.1. Overall Architecture and Environment

The proposed system models a cooperative swarm of autonomous robots deployed in a discretized environment representing a wildfire scenario [28,29]. Each robot moves within an eight-connected grid, corresponding to the cardinal and diagonal directions, with a fixed step size. Movement is strictly decentralized, and local collision avoidance is ensured by a heuristic scoring function that penalizes proximity between robots and by a global occupancy constraint that prevents multiple agents from occupying the same cell simultaneously. Robots are equipped with local sensors that detect fire sources, environmental pheromone intensity, and nearby agents within a finite radius. Each drone is additionally equipped with a payload module containing water for fire suppression, which can be deployed when an active fire is detected within the sensing range, as shown in Figure 1. A detection radius of three cells allows an agent to perceive active fires and initiate an extinguishing action whenever its payload is available and the cooldown timer permits.
The wildfire propagation model used in this study follows the cellular automata (CA) framework proposed by Wang Xuehua et al. [30], which simulates the spatio-temporal evolution of forest fires through local interaction rules. In this model, the terrain is shown as a two-dimensional grid where each cell corresponds to a discrete land patch with a specific state: unburned, burning, or burned. The transition of a cell from one state to another is governed by both static and dynamic environmental factors. Static factors include terrain slope, elevation, and vegetation type, while dynamic factors comprise wind speed and direction, ambient temperature, and humidity.
The fire-spread probability is determined by an improved K-T rule [31] that combines these influences into a single transition coefficient, expressed as:
R 0 = a T + b W c F ,
where T denotes temperature, W represents wind intensity, and F corresponds to the combustibility coefficient of the vegetation. Wind is modeled as an anisotropic multiplier that enhances spread in the downwind direction, while terrain slope introduces an additional gradient-based correction, accelerating propagation uphill and slowing it downhill.

2.2. Heuristic Function for Fire-Guided Navigation

We propose a Dual-Pheromone Heuristic Navigation that guides local motion and enhances spatial coordination. The design of this heuristic is motivated by natural phenomena observed in collective insect behavior, such as the pheromone-based navigation of ants and the cooperative foraging patterns of honey bees. Similar to how ant colonies deposit pheromones to mark promising paths toward food sources, each drone in the proposed system maintains and updates two virtual pheromone fields: a fire-attraction field (   ρ f i r e ) and a coverage-repulsion field (   ρ c o v ).
The fire-attraction field   ρ f i r e serves as a distributed memory that grows around detected fire regions. When a robot detects a fire cell within its sensing radius, it deposits pheromone in the nearby area, which then diffuses to neighboring cells and gradually evaporates over time. This mechanism increases the probability that other agents in the vicinity will move toward regions of higher pheromone concentration, thus coordinating the swarm’s response toward active fire zones. In contrast, the coverage field   ρ c o v acts as a repulsive signal, accumulating along previously visited areas and encouraging agents to explore unvisited regions. The interplay between attraction to fire pheromones and repulsion from coverage pheromones results in a balanced exploration–exploitation trade-off that prevents redundant motion and improves overall spatial coverage.
At each simulation step, every drone evaluates a discrete set of candidate movement directions within its Moore neighborhood (eight directions). For each candidate cell c i , the heuristic function (2) computes a score that quantifies how favorable that direction is, based on both environmental cues (pheromone fields) and social constraints (spacing and exploration):
  S c i =   W c o v   ρ c o v c i +   W f i r e   ρ f i r e c i   W s e p   f s e p c i +   H n e w   N n e w c i +   H o l d   N o l d c i ,  
where W c o v and W f i r e are the attraction/repulsion weights for pheromone fields,   f s e p c i represents the local separation penalty (3) computed as a function of the squared distance to the nearest neighbor,   N n e w c i and   N o l d c i denote the number of newly and previously observed cells within a 5 × 5 patch centered on c i , and H n e w , H o l d are corresponding coefficients that encourage exploration of unvisited areas and discourage revisiting explored zones. The direction with the highest score is selected as the heuristic action proposal, which serves as the non-learning baseline for decision-making.
  f s e p c i = max 0 ,   1 d m i n 2 ( c i ) n e a r _ s a f e
At each simulation step, the pheromone fields for fire attraction (   ρ f i r e ) and coverage repulsion (   ρ c o v ) evolve according to a diffusion–evaporation process that governs how local information spreads and fades over time. This process is inspired by the stigmergic communication mechanism observed in ant colonies, where pheromones diffuse through the environment and evaporate gradually, creating a short-term collective memory.
ρ t + 1 x , y = e v a p 1 μ ρ t x , y + Δ ρ t x , y + μ 8   ( i , j ) ϵ N 8 x , y ρ t i , j ,
where ρ t x , y   pheromone concentration at cell (x,y) at time t, Δ ρ t x , y newly deposited pheromone at that step (from drones detecting fire or visiting new cells), N 8 x , y   Moore neighborhood (the 8 adjacent cells), μ ∈ [0, 1] diffusion rate, controlling how much pheromone spreads to neighbors, evap ∈ (0, 1) is the evaporation coefficient, controlling the decay of pheromone over time.
Although the pheromone fields are implemented as global grid-based maps at the simulation level, each drone accesses and updates only a local portion of these maps, corresponding to its immediate neighborhood. Specifically, agents query pheromone values within a bounded sensing window centered on their current position and deposit pheromones locally based on detected fires or visited cells. The simulation assumes that local pheromone updates are consistently synchronized across agents at each time step, representing an idealized communication model commonly used to study stigmergic coordination in swarm systems.

2.3. Multi-Agent Deep Q-Network (MADQN) Model

The proposed swarm control framework utilizes a Multi-Agent Deep Q-Network (MADQN) architecture to achieve decentralized decision-making while preserving coordinated behavior among the drones, as shown in Figure 2. The model adopts the Centralized Training with Decentralized Execution (CTDE) paradigm, a widely recognized strategy in cooperative multi-agent reinforcement learning. During training, all agents share a common replay buffer and a shared neural network policy, enabling global information integration and stable convergence. In the execution phase, each agent operates independently, making decisions based solely on its locally perceived states.
The MADQN architecture mitigates the credit assignment problem by focusing learning on local agent observations and individual reward signals, thereby avoiding the difficulty of determining how individual agent actions influence a global outcome, such as overall fire containment. Through shared network parameters, agents learn a common action-value representation from their combined local experiences by using shared network parameters. This encourages the swarm to behave in a consistent and coordinated way without relying on global rewards or centralized supervision. The architecture also improves sample efficiency through experience sharing. A common replay buffer aggregates experiences from all agents, accelerating learning and reducing the variance typically observed when agents learn in isolation.
Each drone is modeled as an autonomous agent i ∈ {1, 2,…,   N a }, where N a denotes the total number of agents, interacting with a partially observable environment. At each discrete time step i , the agent receives an observation vector s i t R 54 , which encodes both environmental and internal features. The observation includes the normalized position coordinates ( x i ,   y i ), local pheromone patches for fire and coverage ( ρ f i r e ,   ρ c o v ) extracted from a 5 × 5 window centered on the agent, and internal states representing normalized payload and remaining energy. This compact state representation captures spatial, environmental, and operational context, enabling the agent to make context-aware navigation and extinguishing decisions.
The action space A = { a 1 , a 2 , … a 8 } corresponds to the eight possible movement directions within the Moore neighborhood. At each time step, an agent selects an action a i t ∈ A according to its learned Q-policy. The action leads to a new position and, if a fire is detected within the sensing radius, may trigger an extinguishing operation depending on the current payload and cooldown status.
The Q-network architecture consists of three fully connected layers with ReLU activations, containing 1024 and 512 neurons in the hidden layers, respectively, followed by a linear output layer of dimension equal to the number of discrete actions (∣A∣ = 8). The Q-network estimates the expected cumulative discounted reward Q θ s , a for each possible action given the agent’s local state s i , where θ denotes the network parameters. The learning objective follows the standard DQN temporal difference loss:
L θ =   E s , a , r , s ~ D r +   γ max a Q θ s , a Q θ s , a 2
where γ is the discount factor, D is the replay buffer storing experience tuples ( s , a , r , s ) , and θ are the parameters of a target network updated periodically for stable convergence. The target network is synchronized with the online network every 2000 training steps.
Exploration is handled through an adaptive ε-greedy strategy, where the probability of random action selection decreases exponentially from ε s t a r t = 0.25 to ε e n d = 0.05 as training progresses. This mechanism encourages exploration of the environment in the early stages and gradually shifts toward exploitation of the learned policy once stability is achieved.
We propose a reward function that integrates multiple shaping terms to guide both individual and collective behavior. Agents receive positive rewards for discovering new unvisited areas, detecting or extinguishing fire cells, and maintaining safe separation from teammates. Conversely, penalties are applied for redundant exploration, boundary collisions, or excessive proximity to other agents. Formally, the reward for agent i at time t is computed as:
R i t =   R n e w +   R f i r e   R o v e r l a p   R c o l l i s i o n   R e d g e ,
where R n e w provides a positive reward when an agent discovers previously unvisited cells, encouraging broad area coverage, R f i r e reinforces mission-related objectives by rewarding the detection or suppression of fire cells, R o v e r l a p penalizes redundant exploration of already visited regions, while R c o l l i s i o n discourages unsafe proximity or collisions with teammates and obstacles, R e d g e imposes a penalty for approaching or crossing the operational boundary, ensuring agents remain within the designated exploration area.
Training proceeds in parallel for all agents using shared experience tuples. The parameter sharing mechanism ensures that each agent learns from the aggregated local experiences collected from all agents, improving generalization and robustness to varying fire configurations. Once trained, the MADQN policy is deployed in a fully decentralized manner, where each agent independently infers its action based on local observations, enabling scalability to large teams without central control.
The overall objective of the MADQN model is to learn a mapping π i :   s i a i that minimizes mission time and residual fire area while maximizing spatial coverage and energy efficiency. By combining global learning through centralized parameter sharing and local autonomy during execution, the system achieves emergent cooperative suppression behavior across dynamic and distributed fire environments.

2.4. Fusion Between Heuristic and MADQN Policies

The proposed HG-MADQN (Heuristic-Guided Multi-Agent Deep Q-Network) framework integrates the adaptability of reinforcement learning with the biologically inspired reasoning of the heuristic model. The fusion mechanism is designed to balance learned policy-driven exploration with reactive, pheromone-based navigation, ensuring both stability and responsiveness in dynamic fire environments.
At each decision step, both subsystems produce independent action proposals. The MADQN policy π R L ( s i ) outputs a discrete action a R L ∈ A based on the learned Q-values for the agent’s local observation s i . In parallel, the heuristic module computes an action a H that maximizes the local pheromone-based score function S(c), as defined in Equation (2). To combine these decisions, the final control action for agent i at time t is selected through a stochastic switching rule:
a i t = a R L t , with   probability   λ a H t ,   with   probability   1 λ
where λ ∈ [0, 1] is the fusion coefficient controlling the relative influence of the reinforcement learning and heuristic components.
In the proposed implementation, λ = 0.5, which represents an equal contribution between the two decision sources. This value was empirically determined to provide the best trade-off between adaptivity and convergence stability across multiple simulated environments.
The fusion process is implemented at the action-selection level, not the reward level, allowing both modules to operate asynchronously while maintaining compatibility with discrete movement control. The pseudo-code is shown in Algorithm 1.
Algorithm 1: Fire detection and extinguishing framework
1 
    Initialize Q-network parameters, pheromone maps, and other parameters
2 
    for episode = 1 to   N e p i s o d e s  do
3 
    Initialize drone positions and environment state
4 
    while (not end)
5 
       for each agent i ∈ {1,…,   N a } do
6 
        Observe local state s = [x, y,   ρ f i r e ,   ρ c o v , energy, payload]
7 
        Choose a random number p ∈ [0, 1]
8 
      if p < λ then
9 
         Select RL action a i ← ε–greedy( Q θ s , a )
10 
      else
11 
         Compute heuristic action a i ← argmax S c i using Equation (2)
12 
       end if
13 
        Execute a, move to the new position, obtain reward r, next state s
14 
        if fire detected then
15 
         Deposit Δ   ρ f i r e around (x, y)
16 
        end if
17 
        Deposit Δ   ρ c o v in visited cell (x, y)
18 
      end for
19 
      Update pheromone fields using diffusion–evaporation Equation (4)
20 
      Sample minibatch from D and update Q-network using Equation (5)
21 
   end while
22 
    end for
The procedure begins by initializing the Q-network parameters, pheromone maps, and diffusion–evaporation constants. Each drone observes its local state, consisting of position, pheromone intensities, and internal variables. At each step, a random value p ∈ [0, 1] determines whether the agent follows the reinforcement learning (RL) policy or the heuristic navigation strategy. If p < λ, the agent applies the RL policy; otherwise, it applies the heuristic pheromone-based policy. Agents deposit fire and coverage pheromones during exploration, which evolve via diffusion–evaporation dynamics (Equation (4)). Experience tuples are stored in a shared replay buffer for centralized training and decentralized execution. The Q-network parameters are periodically updated using the temporal-difference loss (Equation (5)). Table 1 summarizes the main variables, parameters, and hyperparameters used throughout the proposed HG-MADQN framework, together with their numerical values.

3. Experimental Setup and Results

3.1. Setup

To evaluate the effectiveness of the proposed HG-MADQN (Heuristic-Guided Multi-Agent Deep Q-Network) framework, a series of large-scale simulations was conducted in a custom-designed Python (version 3.10) environment inspired by [12]. The environment models a dynamic wildfire scenario on a two-dimensional grid of size 300 × 300. Each cell stores its fire status (0—empty, 1—vegetation, 2—burning, 3—burned/extinguished), a fire intensity value that governs state transitions, and two auxiliary fields used for swarm coordination: the fire-attraction pheromone   ρ f i r e and the coverage-repulsion pheromone   ρ c o v . At each simulation tick, burning cells may ignite neighboring cells according to the probabilistic fire-spread model. Each drone occupies a single grid cell and perceives only local information within a limited sensing radius. The observation vector includes the drone’s position, local patches of the fire and coverage pheromone fields extracted from a 5 × 5 window centered on the agent, and internal states such as remaining payload and available energy. When an active fire is detected within the sensing range and operational constraints are satisfied (e.g., sufficient payload), the drone can execute a suppression action that reduces the local fire intensity or extinguishes burning cells within its suppression radius.
A specific number of drones were deployed from a circular base located near the lower-left corner of the map. Each drone is equipped with local sensors providing partial observations of its surroundings within a fixed sensing radius r d e t e c t (Table 1). Drones can move in one of eight discrete directions (Moore neighborhood) or remain stationary, detect fires, and release extinguishing payloads within a limited radius r e x t . The onboard resources are constrained by a payload capacity of 40 fire extinguishing balls. When the remaining battery falls below 50% of its capacity, the drone autonomously returns to the base for refueling and payload replenishment.
The MADQN component was trained using the PyTorch (version 2.7.0) framework under a centralized training and decentralized execution (CTDE) paradigm. All agents shared a common Q-network during training while executing policies independently during simulation. Experience replay and a periodically updated target network were employed to stabilize learning, while exploration followed an ε-greedy strategy with exponential decay.
For the comparative evaluation, we selected the improved Lévy-flight algorithm from [12] as a benchmark. Since our proposed approach integrates two distinct algorithms, each component was also evaluated separately for comparison.
Figure 3 presents the evolution of the cumulative episode reward during MADQN training. Despite the training limit of 50 episodes, the reward curve exhibits a clear and consistent upward trend, stabilizing after approximately 30–40 episodes. This relatively fast convergence is explained by the constrained nature of the decision-making problem. Each drone operates within a limited 5 × 5 local observation zone, and the action space is discrete and small. As a result, the effective state–action space explored by each agent is significantly reduced compared to typical large-scale MARL benchmarks.

3.2. Fire Propagation and Burned Area Analysis

To evaluate the performance of the four algorithms: Heuristic, Lévy, Reinforcement Learning (RL), and HG-MADQN, we analyzed the evolution of the average number of burned cells for every 100 simulation ticks, as illustrated in Figure 4. Each algorithm was tested over 20 independent simulation episodes with randomly generated fire locations. The number of burned cells in each time interval serves as a reliable indicator of how quickly the swarm can extinguish active fires and how efficiently it spreads across the environment to prevent further propagation. The results were recorded for the first 600 simulation ticks, divided into six intervals of 100 ticks each.
The Heuristic and Lévy approaches perform poorly during the initial simulation intervals (0–300 ticks), where a high number of burned cells indicates delayed reaction and inefficient area coverage. These methods gradually improve in the second half of the simulation, as the swarm stabilizes and begins to contain the fire spread, but their overall performance remains limited due to the lack of adaptive coordination. The RL-based strategy shows moderate results initially but fails to maintain consistent coverage of the environment, leading to a steady increase in burned cells up to the mid-simulation phase. Although the system partially recovers towards the end by extinguishing some fires, its overall performance suggests insufficient spatial coordination and delayed global response. In contrast, the HG-MADQN algorithm consistently achieves the lowest number of burned cells across all intervals, demonstrating faster adaptation, better swarm coordination, and superior control of fire propagation.
Also, Figure 4 shows the mean number of burned cells over multiple episodes for specific values of the parameter λ. Due to the inherently stochastic nature of the environment, individual simulation runs may favor either the heuristic strategy or the reinforcement learning policy. In certain scenarios, the heuristic rules outperform the learned policy, while in others, the reinforcement learning component achieves better fire mitigation. When λ is close to 0, the system is dominated by heuristic rules, resulting in strong but rigid performance, and when λ approaches 1, the system relies almost entirely on the learned policy, which may suffer from sparse rewards and coordination difficulties, leading to increased variability in outcomes. Consequently, λ is not selected to maximize performance in a single scenario, but rather to achieve a robust compromise between learning-based adaptability and heuristic reliability under uncertainty.
Table 2 reports the standard deviation of the number of burned cells per 100 simulation ticks. The relatively large deviations observed across all algorithms are expected, given the stochastic nature of wildfire propagation and the emergent behavior of swarm robotic systems. Small variations in initial fire locations and local interactions can lead to significantly different fire evolution patterns, resulting in high inter-episode variability. Despite this inherent variability, the proposed HG-MADQN method consistently exhibits lower standard deviation.

3.3. Payload Consumption Analysis

The payload consumption analysis illustrated in Figure 5 shows that HG-MADQN achieves the best performance among all four algorithms. Throughout the simulation (0–600 ticks), it consistently uses less water per 100 ticks while still managing to extinguish the fire effectively.
This indicates that HG-MADQN is capable of allocating water resources efficiently, applying extinguishing material only where and when it is most needed. The proposed HG-MADQN algorithm leverages a swarm-based cooperative strategy and a grid-based state representation to coordinate the agents efficiently across the environment, minimizing redundant movements and excessive spraying.
In contrast, the Heuristic and Levy algorithms show higher and more irregular water usage patterns, reflecting inefficient and reactive behavior that wastes payload without proportional improvement in suppression. On the other hand, the Reinforcement Learning (RL) algorithm initially consumes few resources in the early and mid stages; however, during the final phase (after approximately 500 ticks), it requires a sudden and large quantity of water to control the fire. This surge indicates that the RL agent delays suppression until the fire becomes extensive, which ultimately makes it inefficient, using more payload overall and failing to fully extinguish the fire.

3.4. Success Rate

The mission completion analysis compares how effectively each algorithm suppresses the fire by the end of the simulation (tick 600), considering a mission successful if the number of burned cells remains below a defined limit.
The results in Figure 6 reveal a clear hierarchy in performance. HG-MADQN achieves the highest success rate, completing 13 out of 20 missions (65%), followed by the heuristic algorithm with 10 out of 20 (50%). Both Levy and RL methods perform significantly worse, completing only 3 (15%) and 2 (10%) missions, respectively. These results confirm that HG-MADQN provides the most consistent and robust fire-suppression strategy. The high completion rate demonstrates that the swarm-based reinforcement learning approach, operating on a grid-structured environment, effectively coordinates agents and allocates extinguishing resources efficiently, preventing large-scale fire spread in most scenarios. It is important to emphasize that the success rates reported in Figure 5 are time-dependent. The evaluation counts only the missions completed within 600 steps, prioritizing rapid and efficient fire suppression. Both the pure RL and Levy-based approaches are able to complete missions in some scenarios; however, they typically require a longer time horizon. As a result, they fail to meet the completion criterion within the allocated steps.
The Heuristic method, while simpler, manages moderate performance because of its deterministic response pattern, which helps in partially controlling the fire but lacks adaptability to dynamic conditions. The Levy and RL algorithms demonstrate unstable behavior: Levy’s random exploration leads to inefficient fire coverage, while RL, despite showing some learning capability, fails to generalize across test conditions, resulting in frequent mission failures.

3.5. Simulation Analysis of HG-MADQN Fire-Suppression Behavior

Figure 7 illustrates the evolution of the fire-suppression process at different simulation steps. At t = 1, all drones start from the base area and begin to explore the environment. The fire pheromone level is initially zero, since no fire sources have yet been detected. The exploration coverage is still minimal, with less than 1% of the total area scanned. At t = 51, several drones have identified multiple fire clusters. The agents begin to deposit fire pheromones around these regions, which guide nearby drones toward the active fires. The coordination between agents becomes visible through clustered trajectories and concentrated coverage zones. The “Seen” percentage increases to ≈18%, indicating that the swarm rapidly expands its search range. At t = 151, the system reaches its peak fire activity ( ρ f i r e _ m a x = 5.00). The drones are strongly attracted toward the most intense fire regions, forming cooperative suppression patterns. Several fire areas are already extinguished (turned gray), while new small clusters are still active. The seen coverage rose sharply to ≈78%, showing that the HG-MADQN policy efficiently prioritizes high-risk areas. By t = 351, the pheromone concentration almost disappears, signifying that most fires have been extinguished. The swarm now performs residual coverage to ensure that no re-ignition occurs. The explored surface reaches ≈96%, demonstrating that the agents maintained coordinated exploration even after fire extinction.
Overall, the visualization confirms that HG-MADQN achieves adaptive coordination, as the drones dynamically concentrate around emerging fires, distribute themselves efficiently across the environment, and progressively diminish the fire pheromone field until complete suppression is achieved.

3.6. Impact of Swarm Size on Fire Suppression Performance

This section analyzes how the number of robots in the swarm affects the overall fire suppression dynamics. Simulations were conducted with swarm sizes ranging from 10 to 50 agents under identical environmental and ignition conditions. Figure 8 illustrates the evolution of the number of burned cells over time for different swarm sizes.
The curves show a clear capacity effect: larger swarms burn less area per 100 ticks and suppress spread earlier. With 50 robots, the burned-per-100 metric drops to near zero after ~300–400 ticks, indicating sustained containment. 20–30 robots keep fires partially under control (plateaus around ticks 300–400), but residual spread persists, rising again by tick 600. With 10 robots, the burned area grows almost linearly, revealing insufficient capacity to counter the ignition rate. Notably, the 40-robot curve improves over 20–30 early on but rises again after tick ~400, suggesting late-stage flare-ups due to payload cycles (return-to-base), pheromone evaporation, or agent congestion around residual clusters. In contrast, 50 robots cross a practical “suppression threshold,” maintaining continuous pressure and preventing re-ignition.

3.7. Pheromone Field Evolution and Fire Suppression Dynamics

Figure 9 illustrates the evolution of the fire pheromone field ( ρ f i r e ) and the coverage pheromone field ( ρ c o v ) from Figure 8 at three simulation timestamps: 100, 200, and 300 ticks. These maps provide insight into how the swarm self-organizes and adapts its behavior during the fire containment process.
At tick 100, the ρ f i r e field shows high-intensity pheromone concentrations around the initial fire sources, where several robots have detected active flames and started the suppression process. The corresponding ρ c o v field (Figure 9) begins to reveal the initial exploration paths, represented by thin blue traces indicating areas that have been visited or scanned by the robots. By tick 200, the ρ f i r e intensity spreads over a larger area but becomes more diffused, indicating that the swarm has partially reduced the main fire clusters. The ρ c o v field intensifies and extends in structured linear patterns, reflecting coordinated sweeps of the terrain as robots patrol along previously explored corridors. These elongated pheromone trails suggest that agents are avoiding already-covered areas while expanding their coverage outward.
At tick 300, the ρ f i r e   field significantly diminishes in both magnitude and spatial spread, confirming that most fires have been extinguished. The remaining hotspots are small, localized regions with low pheromone intensity, representing isolated flare-ups that are quickly neutralized. The ρ c o v map exhibits a wide and uniform distribution, demonstrating that the swarm achieved near-complete coverage of the region. This spatial saturation of the repulsive pheromone field ( ρ c o v ) ensures minimal redundancy in movement and prevents swarm congestion in previously visited zones.

4. Conclusions

The comparative analysis of the four algorithms: Heuristic, Lévy, Reinforcement Learning (RL), and the proposed HG-MADQN reveals distinct behavioral patterns in their ability to contain fire propagation. The evaluation metric, which is the average number of burned cells per 100 ticks, shows how well each method stops the spread of fire over time. Among the tested methods, RL demonstrates adaptive decision-making capabilities, while HG-MADQN achieves superior performance through swarm cooperation and grid-based information sharing, leading to an emergent collective intelligence that suppresses fires early and efficiently. The results presented in Section 3 confirm that HG-MADQN successfully integrates reinforcement learning with heuristic coordination, resulting in efficient and reliable fire suppression.
The incorporation of a dual-pheromone mechanism proved essential for guiding swarm coordination. Specifically, the fire pheromone represents areas of active ignition, attracting nearby drones toward high-risk zones for suppression, while the exploration pheromone encourages dispersion into unexplored regions to detect new fires. The relative magnitudes of the weighting coefficients directly influence the collective behavior of the swarm: higher values of W f i r e guide agents toward aggressive convergence and faster fire suppression, whereas larger W c o v values emphasize exploration by strengthening repulsion from covered regions. By appropriately tuning these weights, the system can shift between suppression-dominant and exploration-dominant behaviors, enabling adaptive coordination that balances rapid fire response with efficient area coverage. This biologically inspired dual-pheromone interaction establishes a self-organizing principle that supports adaptive, decentralized coordination without requiring centralized control.
In pure RL approaches, agent behavior depends entirely on the learned policy, which is highly sensitive to reward shaping, training conditions, and environmental assumptions. As a result, RL agents may struggle to generalize when fire dynamics, swarm density, or environmental constraints differ from those encountered during training. The heuristic component provides immediate, interpretable feedback from the environment and enables agents to respond quickly to newly detected fires, unexplored regions, or local congestion, even in situations where the learned policy is uncertain or suboptimal. The fusion of heuristic and learned actions allows the system to adapt across different fire configurations, swarm sizes, and environmental conditions more robustly than pure RL. When environmental dynamics change, heuristic guidance ensures reasonable baseline behavior.
Despite its strong performance, HG-MADQN also presents several limitations. The dual-pheromone mechanism, while effective in simulation, depends on carefully chosen evaporation and diffusion rates; improper calibration can lead to unstable swarm behavior or excessive clustering. A notable disadvantage of the proposed approach is the reliance on a shared pheromone map, which assumes reliable communication between all robots and the availability of a common reference frame for coordinate positions. This requirement may limit applicability in environments where communication is intermittent, severely constrained, or unavailable, such as remote wildfire regions with signal degradation due to terrain or smoke.
The current simulation does not explicitly model drone dynamics such as inertia, acceleration limits, wind disturbances, or aerodynamic constraints. Drone motion is represented through discrete grid transitions, which simplifies control and enables large-scale experimentation. This abstraction introduces a sim-to-real gap, as policies learned in an idealized environment may not directly transfer to real UAV systems operating under continuous dynamics and environmental uncertainty. Wind effects, actuator delays, payload-induced inertia, and localization noise could alter the effectiveness of both the heuristic guidance and the learned policy. Consequently, additional adaptation or retraining may be required when deploying the system on real hardware.
Future research will focus on extending the learning architecture to continuous action spaces and on testing the algorithm with real drone hardware to assess its robustness and transferability from simulation to real-world fire suppression scenarios.

Author Contributions

Conceptualization, A.D. and C.I.V.; methodology, C.I.V.; software, A.D.; validation, A.D. and C.I.V.; formal analysis, A.D.; investigation, C.I.V.; resources, A.D.; data curation, A.D.; writing—original draft preparation, A.D.; writing—review and editing, A.D. and C.I.V.; visualization, A.D. and C.I.V.; supervision, C.I.V.; project administration, A.D.; funding acquisition, C.I.V. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Liu, Y.; Stanturf, J.A.; Goodrick, S.L. Trends in global wildfire potential in a changing climate. For. Ecol. Manag. 2010, 259, 685–697. [Google Scholar] [CrossRef]
  2. Nguyen, M.T.; Lee, S.W. Advancing Early Wildfire Detection: Integration of Vision–Language Models and Drone Platforms. Drones 2025, 9, 347. [Google Scholar] [CrossRef]
  3. Dutta, A.; Paul, A.; Chowdhury, A.; Saha, S.; Saha, S.; Kar, A. Drone Swarms in Fire Suppression Activities: A Conceptual Framework. Drones 2021, 5, 17. [Google Scholar] [CrossRef]
  4. Yu, B.; Yu, S.; Zhao, Y.; Wang, J.; Lai, R.; Lv, J.; Zhou, B. Intelligent Firefighting Technology for Drone Swarms with Multi-Sensor Integrated Path Planning: YOLOv8 Algorithm-Driven Fire Source Identification and Precision Deployment Strategy. Drones 2025, 9, 348. [Google Scholar] [CrossRef]
  5. Bushnaq, O.M.; Chaaban, A.; Al-Naffouri, T.Y. The Role of UAV-IoT Networks in Future Wildfire Detection. IEEE Internet Things J. 2021, 8, 16984–16999. [Google Scholar] [CrossRef]
  6. Altamimi, A.; Lagoa, C.; Borges, J.G.; McDill, M.E.; Andriotis, C.P.; Papakonstantinou, K.G. Large-Scale Wildfire Mitigation Through Deep Reinforcement Learning. Front. For. Glob. Change 2022, 5, 734330. [Google Scholar] [CrossRef]
  7. Teow, B.H.A.; Yakimenko, O. Contemplating Urban Operations Involving a UGV Swarm. In Proceedings of the 2018 International Conference on Control and Robots (ICCR), Hong Kong, China, 15–17 September 2018; IEEE: New York, NY, USA, 2018; pp. 35–45. [Google Scholar] [CrossRef]
  8. Wong, C.; Yang, E.; Yan, X.-T.; Gu, D. An overview of robotics and autonomous systems for harsh environments. In Proceedings of the 2017 23rd International Conference on Automation and Computing (ICAC), Huddersfield, UK, 7–8 September 2017; IEEE: New York, NY, USA, 2017; pp. 1–6. [Google Scholar] [CrossRef]
  9. Nguyen, L.V. Swarm Intelligence-Based Multi-Robotics: A Comprehensive Review. Appl. Math 2024, 4, 1192–1210. [Google Scholar] [CrossRef]
  10. Alsammak, A.I.L.H.; Mahmoud, M.A.; Gunasekaran, S.S.; Ahmed, A.N.; AlKilabi, M. Nature-Inspired Drone Swarming for Wildfires Suppression Considering Distributed Fire Spots and Energy Consumption. IEEE Access 2023, 11, 50962–50983. [Google Scholar] [CrossRef]
  11. Eberhart, R.; Kennedy, J. A new optimizer using particle swarm theory. In Proceedings of the MHS’95. Proceedings of the 6th International Symposium on Micro Machine and Human Science, Nagoya, Japan, 4–6 October 1995; IEEE: New York, NY, USA, 1995; pp. 39–43. [Google Scholar]
  12. Zaburdaev, V.; Denisov, S.; Klafter, J. Lévy walks. Rev. Modern Phys. 2015, 87, 483. [Google Scholar] [CrossRef]
  13. JLee, H.; Ahn, C.W. Improving Energy Efficiency in Cooperative Foraging Swarm Robots Using Behavioral Model. In Proceedings of the Sixth International Conference on Bio-Inspired Computing: Theories and Applications, Penang, Malaysia, 27–29 September 2011; IEEE: New York, NY, USA, 2011; pp. 39–44. [Google Scholar] [CrossRef]
  14. Lee, J.-H.; Ahn, C.W.; An, J. A honey bee swarm-inspired cooperation algorithm for foraging swarm robots: An empirical analysis. In Proceedings of the 2013 IEEE/ASME International Conference on Advanced Intelligent Mechatronics, Wollongong, Australia, 9–12 July 2013; IEEE: New York, NY, USA, 2013; pp. 489–493. [Google Scholar] [CrossRef]
  15. John, J.; Sundaram, S. Genetic Algorithm-Based Routing and Scheduling for Wildfire Suppression Using a Team of UAVs. arXiv 2024, arXiv:2407.19162. [Google Scholar] [CrossRef]
  16. Zhou, Y.; Zhou, S.; Wang, M.; Chen, A. Multitarget Search Algorithm Using Swarm Robots in an Unknown 3D Mountain Environment. Appl. Sci. 2023, 13, 1969. [Google Scholar] [CrossRef]
  17. Ekechi, C.C.; Elfouly, T.; Alouani, A.; Khattab, T. A Survey on UAV Control with Multi-Agent Reinforcement Learning. Drones 2025, 9, 484. [Google Scholar] [CrossRef]
  18. Haksar, R.N.; Schwager, M. Distributed Deep Reinforcement Learning for Fighting Forest Fires with a Network of Aerial Robots. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; IEEE: New York, NY, USA, 2018; pp. 1067–1074. [Google Scholar] [CrossRef]
  19. Collignon, M.; Perrusquía, A.; Tsourdos, A.; Guo, W. Search and Rescue Operations in Wildfires Using Unmanned Aerial Vehicles: A Multi-Agent Deep Reinforcement Learning Approach. Neurocomputing 2025, 653, 131211. [Google Scholar] [CrossRef]
  20. Kouzeghar, M.; Song, Y. Multi-Target Pursuit by a Decentralized Heterogeneous UAV Swarm Using Deep Multi-Agent Reinforcement Learning. Drones 2023, 7, 179. [Google Scholar] [CrossRef]
  21. Liu, Y.; Li, X.; Wang, J.; Wei, F.; Yang, J. Reinforcement-Learning-Based Multi-UAV Cooperative Search for Moving Targets in 3D Scenarios. Drones 2024, 8, 378. [Google Scholar] [CrossRef]
  22. Huang, S.; Sun, C.; Gong, J.; Pompili, D. Reinforcement learning–based task allocation and path-finding in multi-robot systems under environment uncertainty. Comput. Aided Civ. Infrastruct. Eng. 2025, 40, 3408–3429. [Google Scholar] [CrossRef]
  23. Li, S.; Li, L.; Lee, G.; Zhang, H. A Hybrid Search Algorithm for Swarm Robots Searching in an Unknown Environment. PLoS ONE 2014, 9, e111970. [Google Scholar] [CrossRef] [PubMed]
  24. Jin, Y.; Zhang, Y.; Yuan, J.; Zhang, X. Efficient Multi-agent Cooperative Navigation in Unknown Environments with Interlaced Deep Reinforcement Learning. In Proceedings of the ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 17 April 2019; IEEE: New York, NY, USA, 2019; pp. 2897–2901. [Google Scholar] [CrossRef]
  25. Sun, H.; Jiang, H.; Zhang, L.; Wu, C.; Qian, S. Multi-robot hierarchical safe reinforcement learning autonomous decision-making strategy based on uniformly ultimate boundedness constraints. Sci. Rep. 2025, 15, 5990. [Google Scholar] [CrossRef]
  26. Renzaglia, A.; Dibangoye, J.; Le Doze, V.; Simonin, O. A Common Optimization Framework for Multi-Robot Exploration and Coverage in 3D Environments. J. Intell. Robot Syst. 2020, 100, 1453–1468. [Google Scholar] [CrossRef]
  27. Fang, Z.; Ma, T.; Huang, J.; Niu, Z.; Yang, F. Efficient Task Allocation in Multi-Agent Systems Using Reinforcement Learning and Genetic Algorithm. Appl. Sci. 2025, 15, 1905. [Google Scholar] [CrossRef]
  28. Aydin, B.; Selvi, E.; Tao, J.; Starek, M.J. Use of Fire-Extinguishing Balls for a Conceptual System of Drone-Assisted Wildfire Fighting. Drones 2019, 3, 17. [Google Scholar] [CrossRef]
  29. Allison, R.S.; Johnston, J.M.; Craig, G.; Jennings, S. Airborne Optical and Thermal Remote Sensing for Wildfire Detection and Monitoring. Sensors 2016, 16, 1310. [Google Scholar] [CrossRef] [PubMed]
  30. Wang, X.; Liu, C.; Liu, J.; Qin, X.; Wang, N.; Zhou, W. A cellular automata model for forest fire spreading simulation. In Proceedings of the 2016 IEEE Symposium Series on Computational Intelligence (SSCI), Athens, Greece, 6–9 December 2016; IEEE: New York, NY, USA, 2016; pp. 1–6. [Google Scholar] [CrossRef]
  31. Karafyllidis, A.I.; Thanailakis, A. A model for predicting forest fire spreading using cellular automata. Ecol. Modell. 1997, 99, 87–97. [Google Scholar] [CrossRef]
Figure 1. A swarm of firefighting drones detecting multiple wildfire hotspots.
Figure 1. A swarm of firefighting drones detecting multiple wildfire hotspots.
Robotics 15 00005 g001
Figure 2. MADQN diagram.
Figure 2. MADQN diagram.
Robotics 15 00005 g002
Figure 3. MADQN Training Reward.
Figure 3. MADQN Training Reward.
Robotics 15 00005 g003
Figure 4. Evolution of the number of burned cells over time for the four tested algorithms (Heuristic, Lévy, RL, and HG-MADQN) and the study of λ.
Figure 4. Evolution of the number of burned cells over time for the four tested algorithms (Heuristic, Lévy, RL, and HG-MADQN) and the study of λ.
Robotics 15 00005 g004
Figure 5. Payload consumption of the swarm over time.
Figure 5. Payload consumption of the swarm over time.
Robotics 15 00005 g005
Figure 6. Number of missions successfully completed by the four algorithms.
Figure 6. Number of missions successfully completed by the four algorithms.
Robotics 15 00005 g006
Figure 7. (a) Initialization phase (all drones at base, no fires detected), (b) Fire detection and pheromone deposition begin, (c) Peak fire activity and cooperative suppression, (d) Final extinguishing phase and residual coverage.
Figure 7. (a) Initialization phase (all drones at base, no fires detected), (b) Fire detection and pheromone deposition begin, (c) Peak fire activity and cooperative suppression, (d) Final extinguishing phase and residual coverage.
Robotics 15 00005 g007
Figure 8. Evolution of the burned cells for different swarm sizes.
Figure 8. Evolution of the burned cells for different swarm sizes.
Robotics 15 00005 g008
Figure 9. Evolution of the coverage pheromone field at t = 100 (a), t = 200 (c), t = 300 (e) and the fire pheromone field at t = 100 (b), t = 200 (d), and t = 300 (f).
Figure 9. Evolution of the coverage pheromone field at t = 100 (a), t = 200 (c), t = 300 (e) and the fire pheromone field at t = 100 (b), t = 200 (d), and t = 300 (f).
Robotics 15 00005 g009
Table 1. Main parameters.
Table 1. Main parameters.
Symbol/VariableDescriptionValue
N × NGrid size (environment dimensions)300 × 300
N a Number of drones (agents)50
Batch sizeMini-batch size for training512
Learning rateOptimizer learning rate0.001
W c o v Coverage-repulsion weight1.0
W s e p Inter-agent separation weight10
W f i r e Fire attraction weight2.0
H n e w Exploration bonus (new cells)0.05
H o l d Penalty for revisiting explored areas−0.05
γDiscount factor (MADQN)0.98
r d e t e c t Fire detection radius3 cells
r e x t Fire suppression radius2 cells
R n e w Reward for discovering new cells in the local 5 × 5 observation window0.1 per new cell
R f i r e Reward for detecting fire0.1
R c o l l i s i o n Penalty for proximity-based collisions0.2
R o v e r l a p Penalty for visited places0.001 per cell
R e d g e Penalty for proximity to map boundaries1
Table 2. Standard deviation of burned cells per 100 simulation ticks.
Table 2. Standard deviation of burned cells per 100 simulation ticks.
IntervalHeuristicHG-MADQNLevyRL
0–10024.922.723.913.8
100–200625.1202.8343.3261.9
200–3001014.3360.6642.8665.3
300–4001116.4468.3675.2996.1
400–500754.1466.8626.51103.3
500–600621.8457.8511.5629.5
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Dutceac, A.; Vizitiu, C.I. An Integrated MADQN–Heuristic Framework for Swarm Robotic Fire Detection and Extinguishing. Robotics 2026, 15, 5. https://doi.org/10.3390/robotics15010005

AMA Style

Dutceac A, Vizitiu CI. An Integrated MADQN–Heuristic Framework for Swarm Robotic Fire Detection and Extinguishing. Robotics. 2026; 15(1):5. https://doi.org/10.3390/robotics15010005

Chicago/Turabian Style

Dutceac, Andrei, and Constantin I. Vizitiu. 2026. "An Integrated MADQN–Heuristic Framework for Swarm Robotic Fire Detection and Extinguishing" Robotics 15, no. 1: 5. https://doi.org/10.3390/robotics15010005

APA Style

Dutceac, A., & Vizitiu, C. I. (2026). An Integrated MADQN–Heuristic Framework for Swarm Robotic Fire Detection and Extinguishing. Robotics, 15(1), 5. https://doi.org/10.3390/robotics15010005

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop