Decentralized Q-Learning for Multi-UAV Post-Disaster Communication: A Robotarium-Based Evaluation Across Urban Environments

Damodarin, Udhaya Mugil; Valenti, Cristian; Spanò, Sergio; La Cesa, Riccardo; Di Nunzio, Luca; Cardarilli, Gian Carlo

doi:10.3390/electronics15010242

Open AccessArticle

Decentralized Q-Learning for Multi-UAV Post-Disaster Communication: A Robotarium-Based Evaluation Across Urban Environments

by

Udhaya Mugil Damodarin

^*,†,

Cristian Valenti

^*,†

,

Sergio Spanò

,

Riccardo La Cesa

,

Luca Di Nunzio

and

Gian Carlo Cardarilli

Department of Electronic Engineering, Tor Vergata University of Rome, Via del Politecnico 1, 00133 Rome, Italy

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Electronics 2026, 15(1), 242; https://doi.org/10.3390/electronics15010242

Submission received: 26 November 2025 / Revised: 27 December 2025 / Accepted: 2 January 2026 / Published: 5 January 2026

(This article belongs to the Special Issue New Trends in Mobile Networks and Wireless Sensor Networks for Smart City IoT)

Download

Browse Figures

Versions Notes

Abstract

Large-scale disasters such as earthquakes and floods often cause the collapse of terrestrial communication networks, isolating affected communities and disrupting rescue coordination. Unmanned aerial vehicles (UAVs) can serve as rapid-deployment aerial relays to restore connectivity in such emergencies. This work presents a decentralized Q-learning framework in which each UAV operates as an independent agent that learns to maintain reliable two-hop links between mobile ground users. The framework integrates user mobility, UAV–user assignment, multi-UAV coordination, and failure tracking to enhance adaptability under dynamic conditions. The system is implemented and evaluated on the Robotarium platform, with propagation modeled using the Al-Hourani air-to-ground path loss formulation. Experiments conducted across Suburban, Dense Urban, and Highrise Urban environments show throughput gains of up to 20% compared with random placement baselines while maintaining failure rates below 5%. These results demonstrate that decentralized learning offers a scalable and resilient foundation for UAV-assisted emergency communication in environments where conventional infrastructure is unavailable.

Keywords:

UAV; wireless communications; robotarium; reinforcement learning; q-learning; multi-uav coordination; post-disaster

1. Introduction

A significant obstacle to emergency management and public safety is the breakdown of major communication systems during large-scale disasters. These terrestrial networks can be instantly disrupted by natural disasters, such as earthquakes, floods, and wildfires, disconnecting entire populations from any means of communication [1,2]. Among other possibilities, the mobility, rapid deployment, and aerial coverage capabilities of unmanned aerial vehicles (UAVs) make them viable platforms for sustained, temporary communication support in affected environments [3,4]. Recent developments in millimeter-wave and 5G communication technologies have further expanded the feasibility of aerial relays and high-capacity UAV links [5].

Advances in autonomous control have enabled UAVs to operate as aerial relays by autonomously determining their positions using embedded intelligence, thereby increasing network performance [6,7]. In this sense, Q-learning [8], a form of Reinforcement Learning (RL), provides a framework for modeling UAVs as autonomous agents that learn through interaction with their environment, enabling decentralized, experience-based decision-making. Application of RL in UAV communication includes resource allocation [7], multi-hop routing [9], path planning [10,11], and spectrum sharing in cooperative networks [12]. UAVs can also be trained to optimize their trajectory, expand coverage, and improve relay quality [13,14]. However, existing methods often rely on simplified communication models or centralized decision-making approaches, which are feasible only in small-scale deployments, limiting their scalability and realism in urban emergency contexts [3].

The proposed system is a decentralized UAV relay framework based on independent Q-learning agents, trained and evaluated using the Robotarium simulation platform [15,16]. This study extends our previous work on UAV-based Q-learning communication networks [17], which focused on a single-relay UAV, by introducing a decentralized multi-UAV architecture capable of cooperative learning in realistic air-to-ground propagation environments. The operational scenario depicts a post-disaster urban environment in which terrestrial infrastructure is assumed to be non-functional and ground users require on-demand, autonomous communication support. Each UAV operates as a distributed Reinforcement Learning agent, maintaining its Q-table to learn an optimal positioning policy. The objective is to maximize end-to-end throughput between dynamically assigned user pairs by optimizing the UAVs’ relay positions. Communication is structured as a two-hop uplink and downlink, where data is transmitted from the source user to the UAV (uplink) and from the UAV to the target user (downlink) [3,18,19,20]. Channel quality is evaluated using the Al-Hourani air-to-ground path loss model, which integrates probabilistic line-of-sight estimation, UAV altitude effects, and urban environment attenuation [18,21]. Unlike centralized coordination schemes, our framework enables decentralized multi-agent learning based solely on local observations and reward feedback, ensuring scalability and robustness in post-disaster conditions [12,13].

In summary, although UAV-assisted communication using Reinforcement Learning has been widely explored, most prior studies overlook crucial factors such as user mobility, realistic propagation environments, and decentralized coordination. To address these limitations, this paper proposes a fully decentralized Q-learning framework in which UAVs act as independent agents capable of scalable operation without central control. The framework incorporates user mobility and evaluates performance under realistic urban propagation conditions using the Al-Hourani air-to-ground path-loss model across Suburban, Dense Urban, and Highrise Urban scenarios. The proposed approach also integrates dynamic UAV–user assignment and decentralized coordination mechanisms to maintain reliable connectivity in rapidly changing post-disaster environments. Finally, this work bridges theoretical modeling and practical deployment by implementing it on the Robotarium platform, contributing to the broader field of Reinforcement Learning-based UAV communication systems discussed in recent comprehensive reviews [22].

2. Materials and Methods

This section describes the methodology and experimental framework used to evaluate the proposed decentralized Q-learning approach for UAV-assisted communication. It is organized into three parts. Section 2.1 introduces the Robotarium platform and explains its suitability for implementing multi-agent learning with safety guarantees. Section 2.2 presents the Reinforcement Learning framework, including the Q-learning algorithm,

ϵ

-greedy policy, and reward formulation. Section 2.3 details the system design, environmental modeling based on the Al-Hourani air-to-ground channel, and the communication and user-assignment procedures used in the experiments.

2.1. Robotarium Platform

The experiments were conducted using the Robotarium platform developed at Georgia Institute of Technology [15]. The Robotarium is a remotely accessible testbed that supports both simulation and physical multi-agent experiments, and it has recently been extended as a standardized benchmark for multi-agent reinforcement learning to improve reproducibility and comparability [23].

Compared to general-purpose robotics simulators (e.g., Gazebo or AirSim) that primarily emphasize high-fidelity three-dimensional (3D) physics and detailed flight dynamics, the Robotarium is designed for standardized evaluation of multi-agent control and learning algorithms under consistent experimental conditions. This makes it particularly well suited for benchmarking decentralized learning policies in a controlled and reproducible environment. A key feature of the Robotarium is its built-in safety layer based on Safety Barrier Certificates (SBCs), which enforce collision avoidance and boundary compliance during multi-agent operation [24,25]. This safety layer was enabled throughout the simulations to ensure that UAV agents do not overlap or exit the workspace while learning.

In this study, the Robotarium Simulation API was used to emulate UAV behavior in a two-dimensional (2D) aerial plane representing a disaster-affected region. The workspace is bounded within normalized coordinates of

[- 1.6, 1.6]

on the X-axis and

[- 1.0, 1.0]

on the Y-axis, and discretized into a

5 \times 8

grid, where each cell corresponds to a potential UAV position. This grid represents an effective real-world domain of approximately 3.2 km × 2.0 km. The learning framework operates at a high decision-making level for relay positioning, while low-level flight control, stabilization, and altitude regulation are assumed to be handled by embedded onboard UAV controllers, as in typical commercial UAV platforms. Accordingly, UAV altitude is treated as a fixed system parameter in the air-to-ground communication model rather than a control variable.

Figure 1 illustrates the Robotarium setup used in this study. Mobile users are randomly distributed across the grid at the start of each episode, while UAV agents are initialized at the centers of individual grid cells.

2.2. Reinforcement Learning Framework

Reinforcement Learning (RL) provides a model-free framework in which agents learn by interacting with their environment and receiving scalar feedback in the form of rewards [8]. In this study, UAVs act as independent agents that autonomously determine their optimal positions to improve network connectivity and throughput. Among RL algorithms, tabular Q-learning was chosen because it is well-suited for discrete state–action spaces, computationally efficient, and easily implementable on the Robotarium platform.

2.2.1. Q-Learning Algorithm

We adopt tabular Q-learning as the underlying learning rule. Each UAV maintains an action-value table

Q (s, a)

that estimates the expected discounted cumulative reward for taking action a in state s and following the current policy thereafter. This choice also aligns with practical deployment constraints, since reinforcement learning training and execution can be computationally intensive and energy-demanding, motivating lightweight implementations and efficiency-oriented design choices [26]. The Q-table is updated according to the standard Bellman equation:

Q (s, a) \leftarrow Q (s, a) + α [r + γ max_{a^{'}} Q (s^{'}, a^{'}) - Q (s, a)],

(1)

where

α

is the learning rate,

γ

is the discount factor, r is the immediate reward, and

s^{'}

is the next state observed after executing action a. The algorithm iteratively updates

Q (s, a)

as UAVs interact with the simulated environment, gradually improving their decision-making policies based on experience.

The learning rate

α

and discount factor

γ

regulate the update dynamics of tabular Q-learning. While

α

determines how rapidly new observations influence existing Q-values,

γ

controls the emphasis placed on future rewards. In the UAV relay positioning context, this discounting directly impacts the trade-off between short-term throughput gains and long-term positioning stability.

2.2.2. State–Action Representation

The environment is discretized a

5 \times 8

grid that defines the possible UAV locations within the 2D simulation workspace. Each grid cell corresponds to a discrete state s. At any given time, a UAV can select one of five possible actions: move up, down, left, right, or stay. This forms a finite state–action space of 40 positions per UAV and ensures efficient tabular learning without excessive computational complexity.

2.2.3. Exploration–Exploitation Policy

To balance exploration and exploitation, a step-based

ϵ

-greedy policy was implemented. UAVs select random actions with probability

ϵ

and the best-known action otherwise. The exploration rate follows:

ϵ_{t} = \{\begin{matrix} 1, & t \leq 0.4 E, \\ 0, & t > 0.4 E, \end{matrix}

where

E = 60

denotes the total training episodes. Thus the first 40% of episodes are purely exploratory, allowing UAVs to freely sample the environment, while the remaining episodes exploit the learned Q-values. This design prevents premature convergence to suboptimal policies and ensures coverage of the state–action space during training.

2.2.4. Reward Function

The UAV reward is defined to balance throughput maximization and communication reliability:

R = α_{r} \cdot \frac{C_{avg}}{C_{\max}} - β_{r} \cdot \frac{N_{fail}}{N_{assign}}

(2)

where

C_{avg}

is the average bottleneck throughput,

C_{max}

is the upper bound on capacity,

N_{fail}

is the number of failed user pairs, and

N_{assign}

is the total number of assigned pairs. The weighting parameters

(α_{r}, β_{r}) = (1.5, 0.5)

were determined empirically to emphasize throughput performance while moderately penalizing failed links. This formulation provides a scalar measure directly linking physical-layer outcomes to the reinforcement signal used in Q-learning updates. Although UAV actions are indirectly coupled through dynamic user–UAV assignments, each UAV receives a local reward computed only from its currently assigned user pairs. The reward formulation includes an explicit penalty for communication failures based on a minimum throughput threshold, which provides consistent negative feedback when relay performance degrades and helps stabilize learning under dynamic repositioning. Empirically, the learning curves exhibit stable reward evolution and bounded failure rates across all evaluated scenarios.

2.2.5. Q-Learning Implementation in the Multi-UAV Network

In the decentralized multi-UAV setting, each UAV i maintains its own Q-table

Q_{i} (s, a)

and acts as an autonomous agent. The agents do not share Q-values or control policies, relying solely on local observations and individual reward feedback. The per-agent workflow for each iteration is summarized as:

1.: Observe current discrete state $s_{i}$ (UAV grid position).
2.: Select action $a_{i}$ using the $ϵ$ -greedy policy described in Section 2.2.3.
3.: Execute $a_{i}$ , observe next state $s_{i}^{'}$ , and measure link outcomes for assigned user pairs as described in Section 2.3.3.
4.: Aggregate communication metrics into the scalar immediate reward $r_{i}$ using Equation (2).
5.: Update the corresponding $Q_{i} (s_{i}, a_{i})$ entry using the Bellman update of Equation (1).

Logged observables.

For analysis and reproducibility, each UAV records per iteration the executed

(s_{i}, a_{i})

pair, next state

s_{i}^{'}

, received reward

r_{i}

, uplink and downlink capacities (used to compute bottleneck rates), and a binary indicator for communication failure (1 if

R_{bottleneck} < 256 kbps

). Because each UAV independently maintains and updates its own Q-table, the overall system achieves scalable coordination without a centralized controller. Coordination emerges indirectly through shared environmental feedback and reward-driven adaptation.

Table 1 compares fixed and scheduled discount factor configurations. Fixed

γ

consistently achieves higher average reward across all scenarios in Suburban and Dense Urban environments. In Highrise Urban scenarios, performance remains limited for both configurations due to severe non-line-of-sight attenuation; however, fixed

γ

still provides higher mean reward. While

γ

decay occasionally reduces variability, it does not improve average performance. Based on these observations, fixed

γ

is adopted for all reported results.

2.3. System Design

The system is designed for environments without terrestrial communication infrastructure, where UAVs act as aerial relays to re-establish connectivity between mobile ground users. The design integrates environmental modeling, channel evaluation, and UAV–user assignment, as described below.

2.3.1. Environment Modeling

Figure 2 illustrates the UAV-assisted two-hop communication setup across three urban morphologies: Highrise Urban, Dense Urban, and Suburban. These configurations correspond to the environments modeled using the path loss formulation.

The communication environment is simulated using the Al-Hourani air-to-ground path loss model [18], which was parameterized for the same three urban categories. Each morphology defines a unique propagation profile derived from the Al-Hourani model. The line-of-sight probability

P_{LoS} (θ)

and excess attenuation factor

η

vary according to building density and height. For every grid cell, precomputed lookup tables provide the corresponding path loss value based on UAV altitude (h) and horizontal distance (d). During training, these values are retrieved in real time to compute signal strength and reward feedback, enabling environment-aware learning. The total path loss is expressed as:

P L (d, h) = P_{LoS} (θ) P L_{LoS} + (1 - P_{LoS} (θ)) P L_{NLoS} + η,

(3)

where

θ = arctan (h / d)

is the elevation angle, and

η

is an excess-attenuation factor representing environment-specific losses.

In this study, the parameters

(a_{1}, b_{1}, a_{2}, b_{2}, c, d, η)

were taken from [18,21] and adapted to fit the Robotarium grid configuration using precomputed path loss lookup tables generated for each environment type. These tables were accessed during access during simulation to efficiently model realistic propagation conditions.

Each environment represents a distinct level of signal obstruction and LoS probability:

Suburban: Sparse obstruction, high LoS probability, small $η$ .
Dense Urban: Moderate obstruction, lower LoS probability, moderate $η$ .
Highrise Urban: Severe blockage and reflection, high $η$ , low LoS probability.

This variation ensures that the same UAV movement results in different received powers and learning rewards, allowing UAVs to adapt to propagation challenges in each environment.

2.3.2. Communication Channel Modeling

In the absence of obstacles, the baseline propagation between UAVs and ground users follows the Free-Space Path Loss (FSPL) model:

P L_{F S P L} (d, h) = 20 {log}_{10} (\frac{4 π f d}{c}),

(4)

where f is the carrier frequency (Hz), d the slant distance between UAV and user (m), and

c = 3 \times 10^{8}

m/s is the speed of light.

The Al-Hourani air-to-ground model extends this formulation by introducing probabilistic Line-of-Sight (LoS) and Non-Line-of-Sight (NLoS) components to account for the effects of building density and elevation angle in urban environments [18]. Uplink and downlink path losses are computed for each UAV–user link using the Al-Hourani air-to-ground model [18,21], and the corresponding received signal powers are calculated as:

P_{r, UAV} = P_{t, user} - P L_{UL} (d, h), P_{r, user} = P_{t, UAV} - P L_{DL} (d, h),

(5)

where

P_{t, user}

and

P_{t, UAV}

denote the transmit powers of the user and UAV, respectively, and

P L_{UL}

and

P L_{DL}

represent the uplink and downlink path losses in dB. All power values are expressed in dBm and converted to watts before capacity estimation.

The communication rate is then obtained using the Shannon capacity formula:

R = B {log}_{2} (1 + \frac{P_{r}}{N_{0}}),

(6)

where

B = 20 MHz

is the system bandwidth,

k = 1.38 \times 10^{- 23}

J/K is the Boltzmann constant, and

N_{0} = k T B

is the thermal noise power at temperature

T = 290

K.

The overall two-hop link performance is represented by the bottleneck throughput:

R_{bottleneck} = min (R_{UL}, R_{DL}),

(7)

These variations directly influence the UAVs’ reward updates and learning outcomes across different environments, determining the end-to-end link capacity between each user pair. Any bottleneck throughput below 256 kbps is classified as a communication failure and contributes to the UAVs’ negative reward. These dynamic throughput variations provide the feedback signal that guides Q-learning convergence across different urban environments. The Al-Hourani model parameters ensure that variations in building density directly affect

P_{r}

and, consequently, the reward signal used by each UAV during Q-table updates.

2.3.3. User Pairing and UAV Assignment

At the start of each episode, ground users are randomly paired to represent dynamic and unpredictable post-disaster communication demands, such as message relaying or rescue coordination [27,28]. Each pair forms a two-hop relay link: uplink (source user to UAV) and downlink (UAV to destination user) [3,18,19].

For each user pair, the serving UAV is selected based on the evaluated uplink and downlink link qualities computed from the air-to-ground path-loss model. The combined metric, expressed in decibels, is computed as:

P L_{total} = P L_{UL} + P L_{DL},

where

P L_{UL}

and

P L_{DL}

are determined from the Al-Hourani air-to-ground model (Section 2.3.2).

If the resulting bottleneck throughput falls below 256 kbps, the communication is classified as a failure for that iteration. The success or failure outcomes of these assignments directly affect the scalar rewards in Equation (2), linking network performance to Reinforcement Learning.

This decentralized assignment mechanism enables adaptive coverage and scalability, allowing UAVs to cooperatively adjust their positions based on independent learning feedback without explicit inter-agent communication [29,30].

3. Results

This section presents the evaluation of UAVs operating as communication relays to link ground users under different environmental scenarios. The proposed UAV coordination framework was implemented over 60 episodes using a grid-based environment and Q-learning control. Key performance metrics include the per-UAV reward trace, average bottleneck throughput, and user-pair failure rate.

Table 2 summarizes the simulation setup. Each 5 × 8 grid cell represents a 400 m × 400 m area, yielding an effective domain of 3.2 km × 2.0 km. The framework was evaluated across three environment types, namely, Suburban, Dense Urban, and Highrise Urban, and three load scenarios to assess scalability under increasing demand.

Table 3 lists the learning and system parameters used in the Q-learning framework. Each training episode consists of 100 iterations, and 60 episodes are simulated per scenario. These parameters are used consistently across all experiments. UAVs operate at a fixed altitude of 0.1 km and transmit at 1 W, while ground users transmit at 0.03 W. The carrier frequency of 2.1 GHz and bandwidth of 20 MHz correspond to typical LTE/5G mid-band deployments. A minimum throughput threshold of 256 kbps is enforced to represent the lower bound required for reliable voice and basic data services in emergency communication scenarios.

Simulation Results

The performance of the proposed decentralized Q-learning framework was evaluated across three UAV-to-user configurations in Suburban, Dense Urban, and Highrise Urban environments. Each configuration was simulated for 60 episodes, during which the UAVs autonomously adjusted their positions to improve communication quality based on continuous reward feedback. Three key performance metrics were analyzed: (i) the average reward per episode, indicating the rate and stability of learning convergence; (ii) the bottleneck throughput, defined as the minimum two-hop data rate between the UAV and its assigned user pair; and (iii) the failure rate, representing the percentage of user pairs whose bottleneck capacity falls below 256 kbps. These metrics were evaluated across three network scales, S1 (1 UAV/6 users), S2 (3 UAVs/12 users), and S3 (5 UAVs/20 users), to assess the scalability and robustness under varying network loads.

As shown in Figure 3, the UAVs achieved rapid convergence and stable learning behavior in the Suburban environment. The average reward quickly rose and stabilized above 8, indicating efficient exploration and placement. Throughput values remained around 23 Mbps across all scenarios, while failure rates consistently dropped below 2%. These results demonstrate that the framework performs reliably in open environments with minimal signal obstruction.

The Dense Urban environment introduces greater variability due to propagation losses and obstruction, as illustrated in Figure 4. During early training episodes, signal reflection and blockage caused larger reward fluctuations. The UAVs gradually adapted, with average rewards stabilizing between 5 and 7 after 25–30 episodes. Throughput improved to approximately 20 Mbps, and failure rates declined to 3–4%. These results indicate that even under moderate interference, the decentralized Q-learning framework converges effectively through autonomous decision-making.

The Highrise Urban environment (Figure 5) posed the greatest challenge due to severe non-line-of-sight conditions and high attenuation. During early episodes, negative rewards and high failure ratios were observed, particularly in S1 and S2. After roughly 25 episodes, the UAVs began identifying stable communication positions, improving throughput from 13 to 17 Mbps and reducing failure rates from nearly 50% to under 20%. These results indicate that the learning policy remains resilient even under harsh propagation conditions.

Table 4 reports the average throughput, reward, and failure rate obtained in 60 training episodes for all environments and scenarios. In Suburban settings, the framework achieved high throughput of 23 Mbps with minimal failure rates, showing that learning converges quickly in open areas. Performance in Dense Urban environments was lower due to signal obstruction; however, adding more UAVs (S2, S3) improved reliability, reducing failure rates to below 3%. Highrise Urban conditions presented the most challenging case, with throughput limited to around 15 Mbps and slower reward convergence. Nevertheless, multi-UAV configurations again enhanced resilience by lowering failures. Overall, the results demonstrate that the proposed decentralized Q-learning framework maintains stable performance across environments, scales with increasing numbers of UAVs, and adapts effectively to urban complexity.

Across all environments, the proposed Q-learning framework consistently outperformed the random UAV-placement baseline. Compared with random positioning, the framework achieved throughput gains of approximately 18.9% in Suburban, 9.5% in Dense Urban, and 7.4% in Highrise Urban scenarios. In Suburban and Dense Urban environments, these gains were accompanied by low failure rates below 5%, whereas higher failure rates were observed in Highrise Urban scenarios due to severe non-line-of-sight attenuation. These results highlight the effectiveness of decentralized learning for communication-aware UAV positioning across diverse propagation environments.

4. Comparison with State-of-the-Art Methods

Quantitative benchmarking across UAV-assisted communication studies is often difficult to report fairly because existing works differ substantially in system assumptions (e.g., centralized versus decentralized control, static versus mobile users), channel and mobility models, UAV dynamics, learning architectures, and optimization objectives. As a result, direct numerical comparison across heterogeneous experimental setups may be misleading rather than representative of true relative performance. For this reason, we provide a structured system-level comparison in Table 5 and focus our quantitative evaluation on controlled baselines implemented under identical assumptions, channel models, and simulation settings.

As shown in Table 5, existing works address only subsets of the challenges involved in UAV-assisted communication. Cui et al. [7] applied a Multi-Agent Reinforcement Learning (MARL) approach that enabled multi-agent learning among UAVs, but their framework assumed static users and lacked decentralized control, UAV-user assignment, and failure tracking. Sharvari et al. [9] introduced a Q-learning method for UAV routing, yet the design was limited to single-UAV operation without mobility, assignment strategies, or scalability features. Rizvi et al. [10] advanced MARL by employing action masking, which improved training efficiency and supported multi-UAV scenarios, but the framework did not incorporate user mobility or mechanisms to handle connection failures.

In contrast, the framework proposed in this study integrates the key features that earlier works considered separately. It employs independent Q-learning agents to achieve decentralized decision-making, explicitly accounts for user mobility, and integrates UAV–user assignment in a multi-UAV setting. Furthermore, systematic failure tracking is incorporated to quantify reliability under dynamic network conditions. Collectively, these features support scalable and practical post-disaster communication scenarios while keeping the learning mechanism computationally lightweight.

5. Discussion

The simulation results confirm that decentralized tabular Q-learning can provide effective communication-aware positioning for UAV-assisted post-disaster connectivity. Across Suburban, Dense Urban, and Highrise Urban environments, the proposed framework achieves improved bottleneck throughput and reduced link failures compared with baseline placement strategies under user mobility and probabilistic air-to-ground channel conditions. These results indicate that decentralized learning can maintain consistent performance benefits across the evaluated UAV densities without relying on centralized coordination.

A key advantage of the proposed approach is its fully decentralized architecture. Unlike centralized designs that require global state information, reliable backhaul links, and a central controller, each UAV in the proposed system learns positioning decisions based on local observations. This property is particularly relevant in post-disaster scenarios where terrestrial infrastructure may be damaged or unavailable, and centralized coordination can become difficult to sustain or may introduce a single point of failure. The use of a large-scale air-to-ground propagation model enables the evaluation of positioning decisions under environment-dependent path loss conditions. In addition, the Robotarium platform provides a controlled multi-agent environment with built-in safety enforcement through barrier certificates, enabling reproducible evaluation of decentralized learning policies under collision-free operation.

Despite these strengths, several limitations remain. The discretized grid restricts positioning to a finite set of candidate locations, and UAV altitude is treated as a fixed parameter rather than an optimization variable. Moreover, the current system model does not explicitly incorporate inter-UAV interference, detailed multi-user medium-access contention, or explicit energy depletion. These simplifications are adopted to preserve tractability and isolate the core contribution of decentralized, communication-aware positioning, and the results should therefore be interpreted as system-level trends under the adopted assumptions.

Future work could relax the current assumptions by enabling variable altitude, modeling energy depletion, and incorporating interference and medium-access effects. For scenarios that require continuous state–action spaces, function approximation or deep reinforcement learning (DRL) may be considered, although this increases training complexity and may affect stability. Finally, hardware-in-the-loop experiments or field trials would help assess robustness under realistic operational conditions.

Author Contributions

Conceptualization, U.M.D. and C.V.; methodology, U.M.D., C.V. and S.S.; software, U.M.D. and C.V.; validation, U.M.D., C.V., R.L.C. and S.S.; formal analysis, S.S., R.L.C. and L.D.N.; investigation, S.S. and L.D.N.; resources, L.D.N. and G.C.C.; data curation, U.M.D., C.V. and S.S.; writing—original draft preparation, U.M.D. and C.V.; writing—review and editing, U.M.D., C.V. and S.S.; visualization, U.M.D. and C.V.; supervision, L.D.N. and G.C.C.; project administration, G.C.C.; funding acquisition, G.C.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by: Project ECS 0000024 Rome Technopole, CUP B83C22002820006, in the frame of the National Recovery and Resilience Plan (NRRP), Mission 4 Component 2 Investment 1.5—Call for tender No. 3277 of 30 December 2021 of the Italian Ministry of University and Research—funded by the European Union—NextGenerationEU. Spoke 1 “FutureHPC & BigData” of the Italian Research Center on High-Performance Computing, Big Data and Quantum Computing (ICSC) funded by MUR Missione 4 Componente 2 Investimento 1.4: Potenziamento strutture di ricerca e creazione di “campioni nazionali” di R&S (M4C2-19)—Next Generation EU (NGEU).

Informed Consent Statement

Not applicable.

Data Availability Statement

Data is contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

UAV	Unmanned Aerial Vehicle
RL	Reinforcement Learning
Q-Learning	Quality-based Learning
LoS	Line-of-Sight
NLoS	Non-Line-of-Sight
FSPL	Free-Space Path Loss
A2G	Air-to-Ground
MARL	Multi-Agent Reinforcement Learning
SBC	Safety Barrier Certificate
API	Application Programming Interface
2D	Two-Dimensional
3D	Three-Dimensional
GHz	Gigahertz
MHz	Megahertz
kbps	Kilobits per Second
dB	Decibel

References

Li, B.; Fei, Z.; Zhang, Y. UAV communications for 5G and beyond: Recent advances and future trends. IEEE Internet Things J. 2018, 6, 2241–2263. [Google Scholar] [CrossRef]
Chandran, I.; Vipin, K. Multi-UAV networks for disaster monitoring: Challenges and opportunities from a network perspective. Drone Syst. Appl. 2024, 12, 1–28. [Google Scholar] [CrossRef]
Zeng, Y.; Wu, Q.; Zhang, R. Accessing from the sky: A tutorial on UAV communications for 5G and beyond. Proc. IEEE 2019, 107, 2327–2375. [Google Scholar] [CrossRef]
Mozaffari, M.; Saad, W.; Bennis, M.; Nam, Y.H.; Debbah, M. A tutorial on UAVs for wireless networks: Applications, challenges, and open problems. IEEE Commun. Surv. Tutor. 2019, 21, 2334–2360. [Google Scholar] [CrossRef]
Xiao, Z.; Xia, P.; Xia, X.G. Enabling UAV cellular with millimeter-wave communication: Potentials and approaches. IEEE Commun. Mag. 2016, 54, 66–73. [Google Scholar] [CrossRef]
Wu, Q.; Xu, J.; Zeng, Y.; Ng, D.W.K.; Al-Dhahir, N.; Schober, R.; Swindlehurst, A.L. A comprehensive overview on 5G-and-beyond networks with UAVs: From communications to sensing and intelligence. IEEE J. Sel. Areas Commun. 2021, 39, 2912–2945. [Google Scholar] [CrossRef]
Cui, J.; Liu, Y.; Nallanathan, A. Multi-agent reinforcement learning-based resource allocation for UAV networks. IEEE Trans. Wirel. Commun. 2019, 19, 729–743. [Google Scholar] [CrossRef]
Watkins, C.J.; Dayan, P. Q-learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
Sharvari, N.; Das, D.; Bapat, J.; Das, D. Improved Q-learning based Multi-hop Routing for UAV-Assisted Communication. IEEE Trans. Netw. Serv. Manag. 2024, 22, 1330–1344. [Google Scholar] [CrossRef]
Rizvi, D.; Boyle, D. Multi-agent reinforcement learning with action masking for UAV-enabled mobile communications. IEEE Trans. Mach. Learn. Commun. Netw. 2024, 3, 117–132. [Google Scholar] [CrossRef]
Souto, A.; Alfaia, R.; Cardoso, E.; Araújo, J.; Francês, C. UAV path planning optimization strategy: Considerations of urban morphology, microclimate, and energy efficiency using Q-learning algorithm. Drones 2023, 7, 123. [Google Scholar] [CrossRef]
Shamsoshoara, A.; Khaledi, M.; Afghah, F.; Razi, A.; Ashdown, J. Distributed cooperative spectrum sharing in UAV networks using multi-agent reinforcement learning. In Proceedings of the 2019 16th IEEE Annual Consumer Communications & Networking Conference (CCNC), Las Vegas, NV, USA, 11–14 January 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–6. [Google Scholar]
Lv, Z.; Xiao, L.; Du, Y.; Niu, G.; Xing, C.; Xu, W. Multi-agent reinforcement learning based UAV swarm communications against jamming. IEEE Trans. Wirel. Commun. 2023, 22, 9063–9075. [Google Scholar] [CrossRef]
Wang, Y.; Cui, Y.; Yang, Y.; Li, Z.; Cui, X. Multi-UAV Path Planning for Air-Ground Relay Communication Based on Mix-Greedy MAPPO Algorithm. Drones 2024, 8, 706. [Google Scholar] [CrossRef]
Wilson, S.; Glotfelter, P.; Wang, L.; Mayya, S.; Notomista, G.; Mote, M.; Egerstedt, M. The robotarium: Globally impactful opportunities, challenges, and lessons learned in remote-access, distributed control of multirobot systems. IEEE Control Syst. Mag. 2020, 40, 26–44. [Google Scholar] [CrossRef]
Canese, L.; Cardarilli, G.C.; Dehghan Pir, M.M.; Di Nunzio, L.; Spanò, S. Design and Development of Multi-Agent Reinforcement Learning Intelligence on the Robotarium Platform for Embedded System Applications. Electronics 2024, 13, 1819. [Google Scholar] [CrossRef]
Mugil, D.U.; Valenti, C.; Villani, A. A Simulation of a Telecommunications Channel with UAV-Based Q-Learning Network. In Proceedings of the CEUR Workshop Proceedings, Kyiv, Ukraine, 19 December 2024; Volume 3870, pp. 26–31. [Google Scholar]
Al-Hourani, A.; Kandeepan, S.; Jamalipour, A. Modeling air-to-ground path loss for low altitude platforms in urban environments. In Proceedings of the 2014 IEEE Global Communications Conference, Austin, TX, USA, 8–12 December 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 2898–2904. [Google Scholar]
Fan, J.; Cui, M.; Zhang, G.; Chen, Y. Throughput improvement for multi-hop UAV relaying. IEEE Access 2019, 7, 147732–147742. [Google Scholar] [CrossRef]
Zhang, C.; Liao, X.; Wu, Z.; Qiu, G.; Chen, Z.; Yu, Z. Deep q-learning-based buffer-aided relay selection for reliable and secure communications in two-hop wireless relay networks. Sensors 2023, 23, 4822. [Google Scholar] [CrossRef]
Ni, H.; Zhu, Q.; Hua, B.; Mao, K.; Pan, Y.; Ali, F.; Zhong, W.; Chen, X. Path loss and shadowing for UAV-to-ground UWB channels incorporating the effects of built-up areas and airframe. IEEE Trans. Intell. Transp. Syst. 2024, 25, 17066–17077. [Google Scholar] [CrossRef]
Arani, A.H.; Hu, P.; Zhu, Y. UAV-assisted space-air-ground integrated networks: A technical review of recent learning algorithms. IEEE Open J. Veh. Technol. 2024, 5, 1004–1023. [Google Scholar] [CrossRef]
Torbati, R.J.; Lohiya, S.; Singh, S.; Nigam, M.S.; Ravichandar, H. Marbler: An open platform for standardized evaluation of multi-robot reinforcement learning algorithms. In Proceedings of the 2023 International Symposium on Multi-Robot and Multi-Agent Systems (MRS), Boston, MA, USA, 4–5 December 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 57–63. [Google Scholar]
Wang, L.; Ames, A.D.; Egerstedt, M. Safety barrier certificates for collisions-free multirobot systems. IEEE Trans. Robot. 2017, 33, 661–674. [Google Scholar] [CrossRef]
Emam, Y.; Glotfelter, P.; Egerstedt, M. Robust barrier functions for a fully autonomous, remotely accessible swarm-robotics testbed. In Proceedings of the 2019 IEEE 58th Conference on Decision and Control (CDC), Nice, France, 11–13 December 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 3984–3990. [Google Scholar]
Rothmann, M.; Porrmann, M. A survey of domain-specific architectures for reinforcement learning. IEEE Access 2022, 10, 13753–13767. [Google Scholar] [CrossRef]
Xu, Y.; Wei, Y.; Jiang, K.; Wang, D.; Deng, H. Multiple UAVs path planning based on deep reinforcement learning in communication denial environment. Mathematics 2023, 11, 405. [Google Scholar] [CrossRef]
Ao, T.; Zhang, K.; Shi, H.; Jin, Z.; Zhou, Y.; Liu, F. Energy-efficient multi-UAVs cooperative trajectory optimization for communication coverage: An MADRL approach. Remote Sens. 2023, 15, 429. [Google Scholar] [CrossRef]
Omoniwa, B.; Galkin, B.; Dusparic, I. Communication-enabled deep reinforcement learning to optimise energy-efficiency in UAV-assisted networks. Veh. Commun. 2023, 43, 100640. [Google Scholar] [CrossRef]
Ding, Y.; Yang, Z.; Pham, Q.V.; Hu, Y.; Zhang, Z.; Shikh-Bahaei, M. Distributed machine learning for uav swarms: Computing, sensing, and semantics. IEEE Internet Things J. 2023, 11, 7447–7473. [Google Scholar] [CrossRef]

Figure 1. Simulation setup in the Robotarium Environment. Twelve mobile users (U1–U12) are randomly distributed across a

5 \times 8

discretized grid, while three UAV agents (D1–D3) are positioned at the grid-cell centers. Each cell represents a 400 m× 400 m real-world area.

Figure 1. Simulation setup in the Robotarium Environment. Twelve mobile users (U1–U12) are randomly distributed across a

5 \times 8

discretized grid, while three UAV agents (D1–D3) are positioned at the grid-cell centers. Each cell represents a 400 m× 400 m real-world area.

Figure 2. UAV-assisted two-hop communication setup across three urban morphologies: (a) Highrise Urban, (b) Dense Urban, and (c) Suburban. Solid lines represent uplink connections (user to UAV), while dashed lines represent downlink connections (UAV to user).

Figure 3. Performance of the proposed framework in the Suburban environment. Panels (a–c) show the average reward for Scenarios 1–3 (S1: 1 UAV/6 users, S2: 3 UAVs/12 users, S3: 5 UAVs/20 users); (d–f) show the corresponding average bottleneck throughput (Mbps); and (g–i) present the failure ratio (%).

Figure 4. Performance of the proposed framework in the Dense Urban environment. Panels (a–c) show the average reward, (d–f) depict the bottleneck throughput (Mbps), and (g–i) present the failure ratio (%).

Figure 5. Performance of the proposed framework in the Highrise Urban environment. Panels (a–c) show the average reward, (d–f) the average bottleneck throughput (Mbps), and (g–i) the failure ratio (%).

Table 1. Average reward (mean ± standard deviation) obtained with fixed and scheduled discount factors across different environments and scenarios.

Environment	Scenario	Fixed $γ$	$γ$ Decay
Suburban	S1	$8.52 \pm 0.74$	$- 2.69 \pm 8.83$
	S2	$7.09 \pm 0.96$	$5.01 \pm 1.20$
	S3	$6.25 \pm 0.77$	$4.84 \pm 0.98$
Dense Urban	S1	$6.75 \pm 0.63$	$- 1.37 \pm 9.07$
	S2	$5.65 \pm 0.83$	$3.80 \pm 1.94$
	S3	$4.48 \pm 0.91$	$4.19 \pm 1.45$
Highrise Urban	S1	$- 0.62 \pm 9.31$	$- 8.73 \pm 8.86$
	S2	$- 5.93 \pm 5.85$	$- 6.22 \pm 4.86$
	S3	$- 2.16 \pm 3.26$	$- 4.30 \pm 3.02$

Table 2. System configuration.

Parameter	Value/Description
Simulation area	3.2 km × 2.0 km
Grid resolution	$5 \times 8$ (40 cells)
Environment types	Suburban, Dense Urban, Highrise Urban
Number of UAVs	1, 3, or 5
Number of users	6, 12, or 20

Table 3. Learning parameters.

Parameter	Value
Episodes	60
Iterations per episode	100
Learning rate ( $α$ )	0.2
Discount factor ( $γ$ )	0.8
UAV altitude	0.1 km
UAV transmit power	1 W
User transmit power	0.03 W
Carrier frequency	2.1 GHz
System bandwidth	20 MHz
Throughput threshold	256 kbps

Table 4. Summary of simulation results across Suburban, Dense Urban, and Highrise Urban environments. Values represent averages over 60 training episodes for each scenario.

Environment	Scenario	Avg. Throughput (Mbps)	Avg. Reward	Failure Rate
Suburban	S1	23	8.7	0.02
	S2	22.4	8.5	0.01
	S3	23	8.1	0.01
Dense Urban	S1	19	6.5	0.03
	S2	18.8	5.4	0.04
	S3	18.5	4.5	0.04
Highrise Urban	S1	14.5	−2	0.18
	S2	13.5	−4	0.24
	S3	14.3	−1	0.18

Table 5. System-level comparison with representative UAV-assisted communication studies. A check mark (✓) indicates that the feature is explicitly addressed, while a cross (✗) denotes that it was not considered.

Reference	Learning Method	Decentralized	User Mobility	Multi-UAV	UAV Assignment	Failure Tracking
Cui et al. [7]	MARL	✗	✗	✓	✗	✗
Sharvari et al. [9]	Q-learning	✗	✗	✗	✗	✗
Rizvi et al. [10]	MARL + Masking	✓	✗	✓	✓	✗
This Work	Q-learning	✓	✓	✓	✓	✓

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Damodarin, U.M.; Valenti, C.; Spanò, S.; La Cesa, R.; Di Nunzio, L.; Cardarilli, G.C. Decentralized Q-Learning for Multi-UAV Post-Disaster Communication: A Robotarium-Based Evaluation Across Urban Environments. Electronics 2026, 15, 242. https://doi.org/10.3390/electronics15010242

AMA Style

Damodarin UM, Valenti C, Spanò S, La Cesa R, Di Nunzio L, Cardarilli GC. Decentralized Q-Learning for Multi-UAV Post-Disaster Communication: A Robotarium-Based Evaluation Across Urban Environments. Electronics. 2026; 15(1):242. https://doi.org/10.3390/electronics15010242

Chicago/Turabian Style

Damodarin, Udhaya Mugil, Cristian Valenti, Sergio Spanò, Riccardo La Cesa, Luca Di Nunzio, and Gian Carlo Cardarilli. 2026. "Decentralized Q-Learning for Multi-UAV Post-Disaster Communication: A Robotarium-Based Evaluation Across Urban Environments" Electronics 15, no. 1: 242. https://doi.org/10.3390/electronics15010242

APA Style

Damodarin, U. M., Valenti, C., Spanò, S., La Cesa, R., Di Nunzio, L., & Cardarilli, G. C. (2026). Decentralized Q-Learning for Multi-UAV Post-Disaster Communication: A Robotarium-Based Evaluation Across Urban Environments. Electronics, 15(1), 242. https://doi.org/10.3390/electronics15010242

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Decentralized Q-Learning for Multi-UAV Post-Disaster Communication: A Robotarium-Based Evaluation Across Urban Environments

Abstract

1. Introduction

2. Materials and Methods

2.1. Robotarium Platform

2.2. Reinforcement Learning Framework

2.2.1. Q-Learning Algorithm

2.2.2. State–Action Representation

2.2.3. Exploration–Exploitation Policy

2.2.4. Reward Function

2.2.5. Q-Learning Implementation in the Multi-UAV Network

2.3. System Design

2.3.1. Environment Modeling

2.3.2. Communication Channel Modeling

2.3.3. User Pairing and UAV Assignment

3. Results

Simulation Results

4. Comparison with State-of-the-Art Methods

5. Discussion

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI