1. Introduction
The rapid expansion of wireless networks and the emergence of next-generation 6G systems have created a demand for highly flexible and intelligent communication infrastructures. Unmanned aerial vehicles (UAVs) have recently attracted significant attention as aerial communication nodes due to their rapid deployment capability, adaptive coverage, and enhanced line-of-sight (LoS) connectivity [
1,
2,
3]. When multiple UAVs cooperate in a network, they can provide large-scale, on-demand coverage for heterogeneous users across wide geographic areas, supporting delay-sensitive and energy-constrained IoT applications [
1,
4]. However, the performance of such networks is constrained by the trade-offs among energy consumption, end-to-end latency, and inter-UAV connectivity, especially under dynamic channel conditions and co-channel interference [
2,
3].
Recent advances in the emerging low-altitude economy have further expanded the role of UAV networks beyond traditional aerial communication relays. For example, UAV-enabled Integrated Sensing and Communication (ISAC) systems allow aerial platforms to simultaneously perform wireless communication and environmental sensing, improving spectrum utilization and situational awareness. Joint trajectory and beamforming optimization for UAV-based ISAC has been investigated to enhance communication throughput while satisfying sensing requirements [
5]. In addition, UAV-assisted Mobile Edge Computing (MEC) has emerged as a promising paradigm for supporting computation-intensive Internet-of-Things (IoT) applications, where adaptive multi-objective optimization frameworks can improve task completion efficiency, energy utilization, and system scalability in dynamic environments [
6]. These emerging applications highlight the growing importance of efficient multi-UAV communication architectures capable of adaptive topology control, interference mitigation, and energy-aware resource allocation.
Traditional optimization frameworks, including convex relaxation, game-theoretic formulations, and heuristic power-control strategies, have been widely used to allocate communication resources such as transmit power and channel assignments for improved energy efficiency [
1,
2]. Despite their success, these approaches require accurate system models and centralized computation, making them unsuitable for large-scale, real-time multi-UAV environments.
Deep reinforcement learning (DRL) has recently emerged as a powerful tool for addressing high-dimensional and non-convex optimization problems in UAV communication networks. DRL frameworks such as Deep Q-Networks (DQN) and Double DQN (DDQN) have been successfully applied to adaptive power control, trajectory planning, and resource allocation without requiring explicit channel modeling [
7,
8,
9,
10,
11,
12]. However, most existing DRL-based solutions primarily focus on physical-layer parameters, such as transmit power or UAV mobility, while assuming fixed or implicitly defined network topologies.
From a network-level perspective, the inter-UAV graph structure plays a critical role in determining interference coupling, energy consumption, and connectivity robustness. Although several topology-control studies emphasize connectivity preservation and routing stability [
13,
14], they rarely integrate graph-level metrics with physical-layer resource optimization or end-to-end energy–latency objectives. In particular, graph density—which quantifies the fraction of active inter-UAV links—has a direct impact on both spectral efficiency and energy expenditure, yet remains largely unexplored in learning-based UAV optimization frameworks.
Moreover, standard DQN-based approaches often suffer from slow convergence and overestimation bias when applied to large and highly coupled action spaces, such as those arising from joint power allocation, association control, and topology adaptation. These limitations motivate the adoption of enhanced architectures, including dueling network decomposition, which separates state-value and action-advantage estimation to improve training stability and sample efficiency [
15,
16,
17,
18]. This observation further supports the need for a topology-aware and architecturally robust DRL framework for multi-UAV networks.
The interplay between network topology and physical-layer performance introduces a critical trade-off: a dense inter-UAV graph improves connectivity and reliability but increases interference and energy consumption, whereas a sparse topology reduces interference but risks network fragmentation and service disruption. Capturing this trade-off within a reinforcement learning framework requires careful reward design and constraint handling.
To address these challenges, this paper proposes a Dueling Deep Q-Network (Dueling DQN) framework for the joint optimization of transmit power, link association, and graph density in multi-UAV IoT networks. By decomposing the Q-value into a state-value and advantage component, the Dueling DQN improves learning stability, convergence speed, and policy generalization compared to conventional DQN and DDQN methods. Additionally, graph density is incorporated as a controllable parameter, allowing the UAV network to adapt its topology to balance energy efficiency, latency, and connectivity robustness. A reward normalization mechanism is applied to properly scale energy efficiency and latency metrics, addressing differences in units and ensuring meaningful joint optimization.
Unlike prior works that independently address power control, routing, or topology design in UAV networks, the proposed framework integrates these components into a unified reinforcement learning–assisted optimization architecture. The introduction of graph-density–aware topology control together with a two-timescale hybrid learning–analytical optimization distinguishes the proposed approach from existing UAV communication optimization methods.
The key contributions of this work are as follows. First, we formulate a novel joint optimization problem that integrates energy efficiency, latency, and graph density control, while considering realistic SINR, interference, and connectivity constraints. Second, we develop a Dueling DQN reinforcement learning framework capable of jointly learning optimal transmit power levels, link associations, and network topology configurations, improving convergence stability and decision robustness. Third, we investigate the impact of graph density on the energy–latency trade-off and demonstrate how adaptive topology regulation enhances connectivity and overall network performance. Finally, extensive simulations validate the proposed approach against baseline schemes, including conventional DQN, DDQN, heuristic, and equal power allocation strategies, demonstrating consistent improvements in energy efficiency, latency, and link reliability across diverse scenarios.
The remainder of this paper is organized as follows.
Section 2 provides a review of related work.
Section 3 describes the system model and problem formulation, followed by
Section 4, which presents the proposed Dueling DQN-based optimization framework.
Section 5 discusses simulation results and performance analysis, and
Section 6 concludes the paper with future research directions.
2. Related Work
Energy and latency optimization in UAV-assisted communication systems has been extensively studied, particularly in the context of 6G and IoT networks. Many works focus on power or trajectory control, often neglecting topology-level parameters such as inter-UAV graph density, which directly influence connectivity robustness and interference dynamics.
Pervez et al. [
1] formulated a joint communication–computation optimization problem for UAV-assisted MEC networks using iterative and water-filling methods. While demonstrating significant gains in energy and latency, the approach assumes a fixed UAV connectivity graph and centralized information.
Federated and edge learning techniques have been integrated into UAV resource management. Tang et al. [
19] used federated learning and Lyapunov-based control to minimize energy and training latency, and Yuan et al. [
20] proposed layered task offloading to reduce system energy. However, both methods rely on static inter-UAV links and do not incorporate adaptive topology control.
Multi-agent reinforcement learning (MARL) has been applied for distributed UAV coordination. Betalo et al. [
21] employed a Multi-Agent DQN to enhance data freshness and energy harvesting, whereas Wang et al. [
22] used DDQN for trajectory planning in obstacle-rich environments. Both studies, however, focus on single-dimensional objectives (scheduling or trajectory) and do not consider joint energy-latency-topology optimization.
Other approaches, including DRL combined with IRS or edge deployment strategies [
17,
23,
24], improve energy efficiency or latency but lack explicit modeling of inter-UAV graph density or connectivity adaptation.
Unlike the above studies, which primarily optimize transmit power, trajectory, or scheduling under fixed connectivity assumptions, the proposed framework explicitly treats graph density as a controllable decision variable within the learning process. This enables true joint optimization of energy efficiency, end-to-end latency, and network topology under dynamic interference conditions. By integrating topology adaptation into the reinforcement learning architecture, the proposed approach captures the coupled relationship between connectivity robustness and physical-layer performance, which remains largely unexplored in existing UAV communication studies.
In summary, existing works either optimize a single performance metric or assume static network topologies. In contrast, the proposed Dueling DQN-based framework explicitly incorporates graph density as a controllable variable, enabling adaptive trade-offs between energy consumption, latency, and inter-UAV connectivity. To the best of our knowledge, this study is among the first to explicitly integrate graph-density adaptation with joint energy–latency optimization within a unified reinforcement learning framework.
3. System Model and Problem Formulation
We consider a multi-UAV communication network consisting of
M UAVs deployed over a target area to provide cooperative aerial connectivity, as illustrated in
Figure 1. Each UAV operates as an aerial communication node capable of establishing wireless links with neighboring UAVs, forming a dynamically reconfigurable network topology. The UAVs collectively support data transmission and control signaling for the underlying ground network, which may include IoT devices, users, or terrestrial base stations.
The inter-UAV network topology is modeled as an undirected graph
, where
denotes the set of UAVs and
E represents the set of active inter-UAV communication links [
25]. The existence of a link between UAV
i and UAV
j is represented by a binary association variable [
26]:
The overall connectivity of the UAV network is quantified using the graph density metric
, defined as the ratio of active links to all possible UAV connections:
Graph density serves as a system-level control variable that captures the tradeoff between connectivity robustness and interference intensity. A higher density improves routing reliability and cooperation among UAVs, while a lower density reduces interference and energy consumption.
Each UAV
i is located in three-dimensional space at coordinates
. The Euclidean distance between UAV
i and UAV
j is given by
where
and
denote the horizontal and vertical separations, respectively. Assuming dominant line-of-sight (LoS) propagation, the path loss between UAV
i and UAV
j is modeled as [
27]
For analytical clarity and controlled performance evaluation, the channel gain is modeled using a deterministic large-scale path-loss model. This modeling choice allows the impact of topology control, power allocation, and interference management to be clearly analyzed without additional variability introduced by fast fading. Nevertheless, the proposed reinforcement learning framework is not restricted to deterministic channels and can naturally accommodate stochastic channel variations (e.g., Rayleigh or Rician fading) by incorporating instantaneous channel realizations into the state representation.
The received power at UAV
j from UAV
i is
Note that although path loss and transmit power may be expressed in dB form for modeling convenience, all SINR and interference calculations in the simulations are performed using linear-scale power values after appropriate conversion.
The resulting signal-to-interference-plus-noise ratio (SINR) is expressed as
Assuming dominant line-of-sight (LoS) propagation, the path loss between UAV
i and UAV
j is modeled as [
27]
where
is the reference path loss at distance
and
is the path-loss exponent.
The received power at UAV
j from UAV
i is
where
is the transmit power allocated to link
. The resulting signal-to-interference-plus-noise ratio (SINR) is expressed as
where
denotes the additive white Gaussian noise power. The achievable data rate for link
is
with
B being the channel bandwidth. The energy efficiency (EE) of link
is defined as the successfully transmitted data per unit transmission energy:
The transmission latency for a packet of size
S is given by
where
denotes the propagation delay.To ensure a meaningful joint optimization, both metrics are normalized as
where
and
represent reference maximum values. The joint optimization objective is formulated as a weighted, dimensionless utility function:
where
and
control the relative importance of energy efficiency and latency. The network-wide optimization problem is expressed as
subject to
Here, denotes the maximum transmit power, is the minimum SINR requirement, represents the maximum tolerable interference at each UAV receiver, limits the maximum number of connections per UAV, is the target graph density, and is a small tolerance parameter.
The resulting optimization problem is a mixed-integer nonlinear programming (MINLP) problem, characterized by non-convex fractional objectives and combinatorial topology constraints. The exponential growth of the solution space renders conventional optimization methods computationally infeasible for real-time multi-UAV networks, motivating the adoption of a learning-based solution, as described in the following section.
4. Solving the Optimization Problem Using Dueling DQN
The joint energy–latency optimization problem introduced in
Section 3 is highly non-convex due to coupled interference, discrete link association variables, and the combinatorial nature of graph-density control. Classical optimization methods, such as convex relaxation or dual decomposition, become computationally intractable in large-scale multi-UAV networks, and are unsuitable for real-time deployment. To address these challenges, a Dueling DQN framework is adopted, enabling UAVs to learn optimal policies for transmit power, link association, and network topology directly from interaction with the environment without requiring explicit channel or interference models [
17]. As illustrated in
Figure 2, the proposed Dueling DQN consists of an online network and a target network, where the action-value function is decomposed into state-value and advantage streams.
4.1. Two-Timescale Hybrid Optimization
To enhance learning efficiency and ensure feasibility, we adopt a two-timescale hybrid optimization approach. At the slower timescale, the Dueling DQN learns the optimal inter-UAV link associations and graph-density configuration. At the faster timescale, the transmit power of each active link is optimized using a Newton–Bisection solver under a fixed topology [
28]. Formally, the hierarchical optimization can be expressed as
where the optimal power vector is obtained from
This decomposition reduces the action space for the DQN while preserving near-optimal continuous power allocation. It is important to emphasize that the Dueling DQN does not directly compute the final transmit power values. Instead, it determines high-level discrete decisions, including link association, graph-density configuration, and coarse power adjustment direction. Once these structural decisions are fixed, the Newton–Bisection solver performs continuous per-link power refinement to maximize the utility function under the current interference conditions. This hybrid design allows the reinforcement learning agent to focus on the combinatorial network optimization while the analytical solver efficiently handles continuous power optimization.
4.2. Per-Link Power Optimization and Interference Awareness
For a given topology and association state, the utility of a link
is defined as a weighted trade-off between normalized energy efficiency and latency:
where
and
are weight coefficients, and
,
denote the maximum expected values used for normalization to ensure both terms are dimensionless and comparable. With fixed interference
, the per-link utility reduces to
where
. The optimal per-link transmit power
satisfies
and is computed numerically using a Newton–Bisection solver (see
Appendix A).
To incorporate interference coupling among UAVs, the interference at UAV
j is iteratively updated as
allowing the power update to adapt to the current network state.
4.3. Constraint Handling via Projection and Topology Pruning
Feasibility of the solution is guaranteed by projecting continuous power actions onto the allowable range:
Graph-density constraints are enforced by pruning the lowest-utility links such that the total number of active links satisfies
This deterministic pruning ensures connectivity while mitigating interference.
4.4. Markov Decision Process Formulation
For conciseness and to avoid repetition of widely known reinforcement learning formulations, only problem-specific elements of the MDP are detailed here, while standard definitions of Markov Decision Processes and DQN training mechanisms are summarized and properly referenced. This condensation enhances clarity without affecting mathematical completeness.
The Dueling DQN is trained on a Markov Decision Process (MDP) defined as
[
29]. The state
encodes the network conditions:
where
is the link SINR,
is residual UAV energy,
is graph density, and
is the link association indicator. The action
determines discrete adjustments in power and link associations. The reward function is defined as
where
,
, and
are penalty coefficients for SINR violation, excessive interference, and graph-density deviation, respectively.
4.5. Dueling DQN Architecture and Learning
The following expressions follow the standard Dueling DQN value–advantage decomposition and are included for completeness, with emphasis on their adaptation to the proposed multi-UAV joint optimization framework.
The Dueling DQN approximates the action-value function as
where
is the state-value function and
is the advantage function. The temporal-difference (TD) target is
and the network parameters are updated by minimizing
The target network parameters are softly updated via
with
, and an
-greedy policy maintains exploration.
The Dueling DQN network consists of two fully connected hidden layers with 256 neurons each, using ReLU activation functions. The learning rate is set to , with discount factor . The replay buffer size is 50,000, and a mini-batch size of 64 is used for stochastic gradient updates. The soft target update factor is , and training is conducted over 1000 episodes to ensure stable convergence.
4.6. Convergence and Performance Metrics
The proposed hybrid Dueling DQN framework operates in a non-convex, high-dimensional action space due to coupled interference, discrete link association, and graph-density control. Under fixed interference conditions, the per-link power optimization admits a unique stationary solution. Furthermore, the Dueling DQN is trained with bounded rewards and a finite action space.
Due to the use of deep reinforcement learning in a non-convex environment, formal optimality guarantees cannot be strictly established. However, under bounded rewards and a finite action space, the training process empirically converges to a stable policy that consistently improves the joint energy–latency performance across different network configurations.
To quantitatively evaluate the performance of the multi-UAV network under the learned policy, standard metrics are adopted. These metrics capture the trade-offs between energy efficiency, latency, and reliability.
These metrics directly correspond to the objectives of the optimization framework: maximizing energy efficiency, minimizing latency, and ensuring robust inter-UAV connectivity. By monitoring the performance metrics listed in
Table 1, namely
,
, and
, the proposed Dueling DQN policy can be systematically evaluated and compared against baseline schemes across diverse network scenarios. This framework enables UAVs to autonomously adjust transmit power, link associations, and network topology, thereby achieving a balanced trade-off among energy consumption, latency, and connectivity reliability in real-time multi-UAV IoT networks.
4.7. Algorithm Description and Complexity Analysis
Algorithm 1 summarizes the proposed Dueling DQN-based framework for joint optimization of transmit power, inter-UAV link association, and graph density. At each time step, UAVs observe the network state and select actions using an -greedy policy. The Dueling DQN decomposes the action-value function into state-value and advantage components for stable learning. For each active link, transmit power is refined using a Newton–Bisection solver and projected onto . Graph-density constraints are enforced by retaining the highest-utility links to satisfy the target density . Experience replay and soft target updates are employed to stabilize training.
The computational complexity consists of three parts: Dueling DQN updates, per-link power optimization, and graph-density pruning. Let
B denote the mini-batch size and
the number of active links. The Dueling DQN update has complexity
. Power optimization requires
operations, where
is the number of Newton–Bisection iterations. Graph-density pruning has complexity
. Therefore, the overall complexity per learning step is
Although the theoretical MDP action space grows with the number of UAVs, the proposed framework incorporates several mechanisms that improve scalability. The two-timescale hybrid design restricts the reinforcement learning agent to discrete topology and association decisions, while continuous power allocation is solved analytically via the Newton–Bisection method. In addition, graph-density control and connectivity-aware pruning limit the number of active links, ensuring that the effective decision space remains structured and computationally manageable even for larger UAV networks.
| Algorithm 1: Dueling DQN for Joint Power, Association, and Graph Density Optimization |
![Drones 10 00275 i001 Drones 10 00275 i001]() |
5. Simulation and Results Analysis
The simulation environment considers a multi-UAV IoT network deployed over a
smart city area. Each UAV operates as a mobile access point serving ground IoT devices distributed according to a Poisson Point Process (PPP). UAV altitudes are constrained within 80–120 m to ensure sufficient coverage overlap and dynamic inter-UAV connectivity. The average network graph density
is varied between
and
to examine different connectivity regimes. UAV horizontal positions follow a bounded random waypoint mobility model with limited speed. The wireless channel follows a probabilistic LoS model with distance-dependent path loss and log-normal shadowing. UAVs are subject to limited onboard energy, with realistic hovering and propulsion power consumption models summarized in
Table 2.
The proposed dueling DQN framework jointly optimizes energy efficiency and end-to-end latency, including transmission and queuing delays, by adaptively adjusting transmit powers and inter-UAV associations. Reward weights and are selected as 0.6:0.4 to prioritize energy sustainability while maintaining latency fairness.
The reward formulation represents a linear scalarization of the underlying bi-objective optimization problem, where energy efficiency and latency are inherently conflicting objectives. Varying the weight ratio corresponds to selecting different operating points along the Pareto frontier of feasible energy–latency trade-offs. A sensitivity analysis was conducted by varying the weight ratio across (0.3:0.7), (0.5:0.5), and (0.8:0.2), confirming that increasing improves energy efficiency at the cost of moderate latency increase, while increasing reduces latency with higher energy consumption. The selected configuration (0.6:0.4) lies near a balanced Pareto-efficient region and demonstrates stable convergence behavior.
The convergence behavior, stability, and generalization capability of the proposed scheme are evaluated under varying numbers of UAVs, traffic loads, and graph densities.
To demonstrate the effectiveness of the proposed framework, its performance is compared against several baseline schemes under identical simulation settings. The first benchmark is a convolutional DQN, which employs convolutional layers to extract features from the state representation but does not incorporate the dueling value–advantage decomposition [
39]. The second baseline is DDQN, which mitigates Q-value overestimation by decoupling action selection and evaluation while maintaining a standard non-dueling network architecture [
11].
For non-learning-based approaches, a heuristic power allocation (PA) scheme with fixed UAV associations and iterative water-filling optimization is considered [
40], along with an equal power (EP) scheme assuming static connectivity [
41]. These baselines enable a comprehensive assessment of convergence speed, stability, and performance gains in terms of energy efficiency, latency, and outage probability.
Figure 3 illustrates the training convergence behavior of the proposed Dueling DQN compared with Conv-DQN, and DDQN. The energy efficiency, measured in bps/W, is plotted against the number of training episodes. The proposed Dueling DQN converges significantly faster and attains the highest steady-state performance, approaching
bps/W. In contrast, Conv-DQN and DDQN exhibit slower convergence and stabilize at lower energy-efficiency levels. These results confirm that the dueling value–advantage decomposition improves learning stability and accelerates convergence.
Figure 4 compares the energy-efficiency performance under different network densities and UAV scales. Across all evaluated scenarios, the proposed Dueling DQN consistently outperforms the benchmark methods, demonstrating strong robustness to topology variations.
Under sparse connectivity conditions (), the proposed framework achieves an average energy efficiency of approximately bps/W with eight UAVs, yielding performance gains of about 11% and 17% over DDQN and Conv-DQN, respectively. The heuristic power allocation scheme attains only bps/W, highlighting the limitations of static optimization approaches in dynamic interference environments.
As the network density increases, the robustness of the proposed Dueling DQN becomes more pronounced. In fully connected topologies (), the proposed method maintains an energy efficiency of bps/W, corresponding to a marginal degradation of only 1.4% relative to the sparse case. In comparison, DDQN and Conv-DQN experience larger performance reductions due to increased state–action complexity and interference coupling. This resilience is attributed to the ability of the dueling architecture to decouple state values from action advantages, enabling more reliable policy updates in dense network conditions.
Scalability is further evaluated by increasing the number of UAVs under different graph densities. The proposed Dueling DQN sustains near-optimal performance as the network scales, maintaining energy efficiency above bps/W for UAV counts ranging from four to eight at . In contrast, Conv-DQN and DDQN begin to degrade beyond six UAVs, while non-learning baseline schemes exhibit limited scalability.
Figure 5 presents the end-to-end latency performance under sparse, moderate, and dense connectivity scenarios. The proposed Dueling DQN consistently achieves the lowest latency across all evaluated conditions.
Under sparse connectivity (), the proposed approach maintains an average latency below s with eight UAVs, achieving latency reductions of approximately 52% and 67% compared with the uncertainty-based and Double-FQPC schemes, respectively. The random selection strategy suffers from excessive delays exceeding s.
For moderate connectivity (), the proposed framework exhibits graceful scalability, with latency increasing from s to s as the number of UAVs grows from four to eight. Competing methods experience steeper latency growth, indicating inferior interference management under increasing network size. In dense topologies (), the proposed Dueling DQN maintains latency below s with eight UAVs, representing an increase of only 18% compared to sparse conditions.
Figure 6 and
Figure 7 illustrate the impact of graph density on energy efficiency and latency for a 30-UAV network. As graph density increases, all schemes experience performance degradation due to intensified interference and coordination overhead. Nevertheless, the proposed Dueling DQN consistently achieves the highest energy efficiency and the lowest latency across all density levels.
Figure 8 depicts the energy–latency trade-off among the evaluated schemes. The proposed Dueling DQN occupies the Pareto-optimal region, achieving both high energy efficiency (approximately
bps/W) and low latency (approximately
s). In comparison, Conv-DQN and DDQN offer moderate trade-offs, while heuristic and random schemes cluster in the low-efficiency, high-latency region. This confirms the superiority of the proposed approach in jointly optimizing conflicting performance objectives.
Table 3 summarizes the statistical energy-efficiency performance of all evaluated methods. In addition to achieving the highest mean energy efficiency, the proposed Dueling DQN exhibits the lowest coefficient of variation (2.7%), indicating improved learning stability and robustness. Compared with Conv-DQN and DDQN, the reduced variance demonstrates that the dueling architecture not only enhances average performance but also yields more consistent and reliable convergence behavior.
6. Conclusions
This paper presented a Dueling Deep Q-Network (DQN) framework for joint energy efficiency and latency optimization in multi-UAV communication networks. Unlike conventional DRL-based approaches that optimize transmit power or trajectory in isolation, the proposed framework jointly learns transmit power allocation, inter-UAV link association, and adaptive graph density regulation within a unified learning model. This integrated design enables each UAV to autonomously balance energy consumption, end-to-end latency, and connectivity robustness in dynamic, interference-limited environments.
Extensive simulation results demonstrated that the proposed Dueling DQN consistently outperforms conventional DQN, Double DQN (DDQN), and non-learning heuristic schemes. Specifically, the proposed approach achieves up to 15% improvement in energy efficiency, reduces end-to-end latency by up to 12%, and exhibits significantly enhanced convergence stability across varying network densities and UAV scales. By explicitly incorporating graph density as a controllable decision variable, the framework dynamically adapts the network topology, preserving performance even under dense connectivity conditions. Furthermore, the dueling value–advantage decomposition improves learning efficiency by stabilizing Q-value estimation and accelerating convergence in high-dimensional state–action spaces.
Overall, the results confirm that the proposed Dueling DQN framework provides a robust, scalable, and data-driven solution for real-time resource management in 6G-enabled multi-UAV networks. Future work will extend this framework toward multi-agent coordination and federated reinforcement learning, as well as the integration of digital-twin-assisted environments, to further enhance scalability, security, and energy-aware intelligence in UAV-assisted smart city and emergency response applications.