Adaptive Multi-Objective Optimization for UAV-Assisted Wireless Powered IoT Networks

Zhu, Xu; He, Junyu; Zhao, Ming

doi:10.3390/info16100849

Open AccessArticle

Adaptive Multi-Objective Optimization for UAV-Assisted Wireless Powered IoT Networks

by

Xu Zhu

^1,2

,

Junyu He

¹ and

Ming Zhao

^1,*

¹

School of Computer Science, Central South University, Changsha 410083, China

²

School of Information Engineering, Hunan Industrial Vocational and Technical College, Changsha 410208, China

^*

Author to whom correspondence should be addressed.

Information 2025, 16(10), 849; https://doi.org/10.3390/info16100849

Submission received: 22 August 2025 / Revised: 27 September 2025 / Accepted: 29 September 2025 / Published: 1 October 2025

(This article belongs to the Section Internet of Things (IoT))

Download

Browse Figures

Versions Notes

Abstract

This paper studies joint data collection and wireless power transfer in a UAV-assisted IoT network. A rotary-wing UAV follows a fly–hover–communicate cycle. At each hover, it simultaneously receives uplink data in full-duplex mode while delivering radio-frequency energy to nearby devices. Using a realistic propulsion-power model and a nonlinear energy-harvesting model, we formulate trajectory and hover control as a multi-objective optimization problem that maximizes the aggregate data rate and total harvested energy while minimizing the UAV’s energy consumption over the mission. To enable flexible trade-offs among these objectives under time-varying conditions, we propose a dynamic, state-adaptive weighting mechanism that generates environment-conditioned weights online, which is integrated into an enhanced deep deterministic policy gradient (DDPG) framework. The resulting dynamic-weight MODDPG (DW-MODDPG) policy adaptively adjusts the UAV’s trajectory and hover strategy in response to real-time variations in data demand and energy status. Simulation results demonstrate that DW-MODDPG achieves superior overall performance and a more favorable balance among the three objectives. Compared with the fixed-weight baseline, our algorithm increases total harvested energy by up to 13.8% and the sum data rate by up to 5.4% while maintaining comparable or even lower UAV energy consumption.

Keywords:

Internet of Things (IoT); wireless power transfer (WPT); multi-objective optimization (MOO); deep deterministic policy gradient (DDPG)

1. Introduction

The Internet of Things (IoT) has matured into a foundational paradigm that interconnects vast numbers of sensors and actuators to enable pervasive sensing, real-time monitoring, and intelligent control across application domains such as smart homes, industrial automation, healthcare, and environmental monitoring [1,2,3]. As the population of connected devices grows exponentially, it becomes increasingly critical to ensure dependable data acquisition, low end-to-end latency, and sustainable energy provisioning for resource-constrained edge devices.

In many IoT scenarios, especially those involving wide-area deployments such as power grid monitoring [4] and environmental sensing [5], devices generate time-sensitive data streams that must be collected promptly to support intelligent decision-making [6]. However, most IoT devices are energy-constrained and are often deployed in inaccessible environments, making manual battery replacement impractical. To mitigate this power bottleneck, radio-frequency (RF) wireless power transfer (WPT) has been extensively explored to extend node lifetime and sustain long-term operation [7,8]. In addition, Simultaneous Wireless Information-and-Power Transfer (SWIPT) techniques [9,10] co-deliver energy and data over a shared RF spectrum, thereby improving both spectral and energy efficiency.

To enhance scalability and responsiveness, UAVs have been introduced as mobile relays or data ferries in IoT networks [11,12]. Owing to their high mobility, fast deployment, and adaptive coverage capability, UAVs are well-suited for dynamic network scenarios where fixed infrastructure is unavailable or insufficient. In particular, rotary-wing UAVs can operate in a fly–hover–communicate mode, enabling them to visit sensor nodes, collect data, and wirelessly charge nearby devices simultaneously during each hovering stage [13,14].

However, designing control policies for such UAV-assisted WPT-enabled networks is challenging due to the inherent trade-offs between conflicting system objectives. For example, maximizing data throughput often requires the UAV to move rapidly and stay close to target devices, while maximizing harvested energy prefers longer hovering times and proximity to multiple devices. Meanwhile, minimizing UAV propulsion energy favors smooth, low-speed trajectories. These objectives are often mutually conflicting and cannot be simultaneously optimized without compromises [15,16].

Despite the inherent multi-objective nature of UAV-assisted wireless powered IoT networks, many existing studies have primarily adopted single-objective formulations or employed fixed priority weights across multiple objectives [17,18]. Such formulations inherently neglect the temporal variability and dynamic trade-offs that emerge in practical deployments.

In real-world scenarios, system-level priorities are often non-stationary and evolve over time due to changing operational contexts—such as the urgency of data delivery triggered by buffer overflows, critical energy replenishment needs for near-depleted nodes, or rapidly fluctuating channel and mobility conditions. These dynamic factors fundamentally alter the relative importance of competing objectives, making static optimization strategies insufficient.

To address this challenge, we revisit the MOO formulation for UAV-assisted wireless powered IoT networks. We adopt a deep reinforcement learning (DRL) framework based on the DDPG algorithm and propose a dynamic-weight mechanism that enables the UAV agent to adaptively adjust its preference among multiple objectives based on real-time environment observations. This design allows the UAV to coordinate its trajectory, hovering decisions, and service scheduling more flexibly in response to fluctuating task demands.

This work makes the following contributions:

We consider a UAV-assisted wireless powered IoT network in which IoT devices generate real-time data and require sustainable energy supply. A rotary-wing UAV equipped with a full-duplex hybrid access point (HAP) follows a fly–hover–communicate protocol to collect data from target devices and wirelessly charge surrounding nodes within its coverage.
We formulate a dynamic multi-objective optimization problem that maximizes the system sum rate and harvested energy while minimizing UAV propulsion energy consumption. To solve it, we extend the classic DDPG by introducing a multi-dimensional reward and a state-conditioned weight generator (WeightNet). Unlike fixed or manually tuned scalarization, WeightNet adaptively outputs preferences at each step. Trained with the actor–critic, it performs online scalarization of the vector reward, enabling adaptive preference learning in time-varying UAV-assisted IoT networks.
Simulation results confirm that our proposed DW-MODDPG approach achieves better flexibility and coordination across multiple objectives. By adjusting weight outputs in real time, the UAV policy exhibits strong generalization and responsiveness to different task priorities.

The remainder of this paper is organized as follows. Section 2 surveys reinforcement learning for UAVs and pertinent work on multi-objective optimization. Section 3 specifies the system architecture and formalizes the optimization problem. Section 4 summarizes the reinforcement learning preliminaries used in our method. Section 5 develops the DW-MODDPG framework and describes its network architecture. Section 6 presents the experimental setup and analyzes the results. Section 7 concludes this paper.

2. Related Work

UAVs have become a practical means to scale RF-powered Internet-of-Things deployments because of their agility, infrastructure-light rollout, and reach to dispersed nodes. By operating as a mobile platform, a single UAV can collect information while also delivering radio-frequency energy. Building on this capability, recent studies propose co-designed protocols and control policies, thereby mitigating power scarcity and link constraints in low-power devices [8,19].

2.1. UAV-Enabled Wireless Powered IoT Systems

A key obstacle for UAV-assisted IoT is synchronizing RF energy transfer and data collection when both flight time and propulsion/communication energy are constrained. Initial work mainly used the harvest-then-transmit (HTT) protocol [13]—devices harvest first, then transmit—simplifying coordination but frequently wasting airtime and energy. To improve spectral and energy efficiency, recent studies propose using full-duplex HAPs on rotary-wing UAVs to enable SWIPT [14], allowing concurrent energy delivery and data reception during hovering periods.

While RF-based WPT and SWIPT are widely adopted in UAV-assisted IoT networks due to their compatibility with existing communication systems, other WPT technologies have also been explored for specific applications. For instance, laser-based WPT offers high-power density and long-range transmission capabilities, making it suitable for scenarios requiring concentrated energy delivery over distance [20]. However, it requires precise alignment and is susceptible to atmospheric conditions. Microwave power transfer can achieve longer ranges than inductive coupling and is less sensitive to alignment, but it involves larger antennas and stricter regulatory constraints [21]. Additionally, in the domain of magnetic resonance coupling, recent designs such as the star-shaped coil array proposed by Pahlavan et al. [22] demonstrate significant improvements in rotation tolerance for free-moving receivers. This innovative transmitter configuration uses overlapping coil layers to maintain power transfer efficiency even when the receiver rotates up to

90^{\circ}

, addressing a key challenge in dynamic WPT applications. These alternative WPT systems may complement RF-based approaches in hybrid energy delivery architectures, especially in environments with diverse node distributions and energy demands. A comprehensive review of emerging WPT technologies can be found in [23], which discusses their principles, challenges, and applicability in sustainable IoT networks.

2.2. Optimization Objectives and Trade-Offs

Recent work [24] formulates a multi-UAV backscatter network and minimizes the long-term average age of information (AoI) via Lyapunov-guided scheduling and trajectory planning, highlighting the coupling between mobility, access control, and freshness under resource constraints. In wireless powered designs, energy/throughput trade-offs are explicit: [25] investigates a rotary-wing UAV equipped with a full-duplex hybrid access point (HAP) and multiple antennas, deriving flight/hover/communication strategies and optimization formulations—such as throughput maximization and total-time minimization—under FD WPT with concurrent information collection during hovering. Integrated WPT–MEC architectures further co-optimize UAV trajectories, energy transfer, and computation/offloading, typically casting the problem as a joint optimization with multiple, and often competing, metrics [26].

On the learning side, multi-objective control is often handled by actor–critic training with scalarized (weighted) objectives in UAV–MEC settings; for example, Liu et al. model computation offloading as a multi-objective MDP and employ a multi-objective deep RL method to balance delay and energy in dynamic environments [27]. Parallel advances in the MORL community propose policy learning over Pareto trade-offs and preference conditioning, but most domain-specific implementations in UAV–IoT still rely on scalarization to reconcile throughput, harvested energy, and propulsion costs.

2.3. Reinforcement Learning for UAV Control

With the rise of artificial intelligence in wireless communications, DRL has been increasingly employed to handle the complexity and uncertainty of UAV-assisted networks [28,29]. DRL-based methods enable UAVs to learn policies through interactions with the environment, without requiring prior knowledge of full system models. Notably, DDPG [30] and its variants have been applied for continuous trajectory control and dynamic scheduling in edge-enabled or real-time sensing scenarios [31,32].

Despite their success, existing DRL approaches often assume fixed objective preferences or pre-defined reward weights, limiting their ability to respond to mission-level priority changes. For instance, in [33,34], UAV swarms were trained to achieve communication coverage and fairness, yet objective trade-offs were manually specified. Similarly, AoI minimization frameworks consider static urgency levels across the mission.

3. System Description and Problem Formulation

In this section, we consider a UAV-assisted, wireless powered IoT framework that couples downlink wireless power transfer with uplink data acquisition. Building on this model, we pose a unified MOO problem that jointly balances competing goals—enhancing sum data rate, increasing harvested energy, and reducing UAV energy expenditure—under the constraints of the proposed architecture.

3.1. System Model

Following the above overview, we instantiate the network considered in this work. As depicted in Figure 1, a rotary-wing UAV serves a set of single-antenna IoT devices distributed over a bounded task region. The UAV acts as a HAP equipped with two RF antennas. Due to its finite onboard energy, the operation is constrained within a mission horizon

T > 0

. We adopt the fly–hover–communicate procedure: the UAV activates communication only during hovering, remaining silent in flight. At any hover location, the HAP operates in full duplex—one antenna continuously delivers downlink wireless power transfer, while the other simultaneously receives uplink data acquisition from the scheduled devices.

(1) IoT devices: Let the set of IoT devices be denoted by

J = {1, 2, \dots, J}

. The devices are randomly deployed on the ground, with the coordinates of device j specified as

[x_{j}, y_{j}]

. Each device is assumed to monitor certain physical phenomena (e.g., temperature, gas levels, equipment status) and generates sensor data in real time.

Each device maintains a local buffer for temporarily storing the sensed data. Let

l_{j} (t)

denote the buffer length (i.e., amount of untransmitted data) of device j at time t, where

t \in [0, T]

. The buffer is updated at discrete time intervals

Δ t

according to

l_{j} (t + Δ t) = l_{j} (t) + λ_{j} (t) Δ t,

(1)

where

λ_{j} (t)

is the instantaneous data generation rate at time t. We model

λ_{j} (t)

as a Poisson-distributed random variable, with the rate parameter specific to each device to capture heterogeneous sensing characteristics.

The data buffer capacity is hardware-limited and uniformly bounded by

l_{max}

for all devices. When

l_{j} (t)

reaches

l_{max}

, any additional data either overwrites existing content or is dropped, resulting in data loss. Therefore, timely uplink transmission is essential to avoid buffer overflow and ensure reliable data acquisition.

Assuming time division multiple access (TDMA) for uplink scheduling, each device transmits data at a constant power

P_{u}

. Let Q denote the total size corresponding to a full buffer. Then, the amount of data pending transmission for device j at time t is given by

Q_{j} (t) = \frac{l_{j} (t)}{l_{max}} Q .

(2)

Given that both buffer occupancy and data generation rate vary across devices, their urgency for data transmission naturally differs. We define the data upload priority for device j at time t as

q_{j}^{u} (t) = λ_{j} (t) \cdot \frac{l_{j} (t)}{l_{max}} .

(3)

The priority metric integrates both the current buffer utilization and the instantaneous data generation rate, reflecting the urgency of preventing future buffer overflow. Devices with higher data arrival rates and fuller buffers are therefore assigned greater transmission urgency. This priority

q_{j}^{u} (t)

determines the UAV’s hovering targets, which in turn directly affects the total uplink data throughput

R_{sum}

.

(2) UAV model: The UAV operates at a constant altitude

H > 0

throughout its mission. Let

[x_{u} (t), y_{u} (t)]

denote its horizontal coordinates at time t, while its vertical position remains unchanged and is thus omitted from the notation. The UAV adaptively updates its trajectory in real time based on environmental feedback and system objectives. Its motion is characterized by the flight speed

v (t)

and yaw angle

θ (t)

, where

θ (t) \in [- π, π]

and

v (t)

is constrained by a maximum velocity

v_{max}

, typically set to

20 m / s

.

The propulsion energy consumption of the UAV is modeled based on practical rotary-wing dynamics. The instantaneous propulsion power required at speed V is given by [35]

P (V) = P_{0} (1 + \frac{3 V^{2}}{U_{tip}^{2}}) + P_{i} {(\sqrt{1 + \frac{V^{4}}{4 v_{0}^{4}}} - \frac{V^{2}}{2 v_{0}^{2}})}^{\frac{1}{2}} + \frac{1}{2} d_{0} ρ s A V^{3} .

(4)

where the total propulsion power consists of three components: (1) the blade profile power

P_{0}

, (2) the induced power

P_{i}

, and (3) the parasite power due to air resistance. Here,

U_{tip}

denotes the rotor blade tip speed,

v_{0}

is the mean induced velocity in hover,

d_{0}

is the fuselage drag coefficient,

ρ

is the air density, s is the rotor solidity, and A is the rotor disc area. Let

V = 0

, the UAV is hovering, and its power consumption simplifies to

P_{hov} = P_{0} + P_{i} .

(5)

We assume that the UAV’s wireless power transfer and data collection capabilities are limited to a circular region around its location. Due to signal attenuation and power constraints, only devices within a certain range can be effectively served. Let

D_{dc}

and

D_{eh}

denote the maximum effective radii for data collection and energy harvesting, respectively.

At each decision epoch, the UAV selects one target device for data upload. When the target device falls within the radius

D_{dc}

, the UAV hovers at an appropriate location and operates in full-duplex mode: it receives data from the selected device and simultaneously transmits energy to other nearby devices within the

D_{eh}

range. This process continues until the data upload is completed. We denote

P_{d}

as the constant downlink transmission power used by the UAV for wireless energy transfer.

The UAV’s trajectory, determined by

v (t)

and

θ (t)

, affects both the propulsion energy

E_{total}^{c}

and the data collection efficiency

R_{sum}

.

(3) Channel model: We adopt a practical air-to-ground (A2G) communication model that incorporates both line-of-sight (LoS) and non-line-of-sight (NLoS) components. Let

h_{j} (t)

and

g_{j} (t)

denote the downlink and uplink channel power gains between the UAV and IoT device j at time t, respectively.

The path loss between the UAV and ground device j is modeled as

L_{j} (t) = \{\begin{matrix} γ_{0} d_{j} {(t)}^{- \tilde{α}}, & LoS link, \\ μ^{NLoS} γ_{0} d_{j} {(t)}^{- \tilde{α}}, & NLoS link, \end{matrix}

(6)

where

γ_{0} = {(\frac{4 π f_{c}}{c})}^{- 2}

denotes the reference path gain at unit distance (

d_{0} = 1 m

), with

f_{c}

and c representing the carrier frequency and speed of light, respectively. The term

\tilde{α}

is the path loss exponent, and

μ^{NLoS}

accounts for additional attenuation in NLoS conditions.

The LoS probability between the UAV and device j is modeled as a function of the elevation angle

θ_{j} (t)

:

P_{j}^{LoS} (θ_{j} (t)) = \frac{1}{1 + a exp (- b (θ_{j} (t) - a))} .

(7)

where a and b are constants determined by the propagation environment and operating frequency. The elevation angle in degrees is given by

θ_{j} (t) = \frac{180}{π} {sin}^{- 1} (\frac{H}{d_{j} (t)}) .

(8)

with the Euclidean distance

d_{j} (t)

between UAV and device j computed as

d_{j} (t) = \sqrt{H^{2} + {(x_{u} (t) - x_{j})}^{2} + {(y_{u} (t) - y_{j})}^{2}} .

(9)

Accordingly, the NLoS probability is

P_{j}^{NLoS} (t) = 1 - P_{j}^{LoS} (t)

. Assuming channel reciprocity, the effective channel gain between the UAV and device j is approximated as

h_{j} (t) \approx g_{j} (t) = (P_{j}^{LoS} (θ_{j} (t)) + μ^{NLoS} P_{j}^{NLoS} (θ_{j} (t))) γ_{0} d_{j} {(t)}^{- \tilde{α}} .

(10)

The channel gain determines the achievable uplink rate

R_{k}

and thus influences the hovering duration and total throughput

R_{sum}

.

(4) Energy harvesting model: While hovering, the UAV operates in full-duplex mode by receiving uplink data from the selected device and simultaneously transmitting RF energy at a fixed power

P_{d}

to nearby devices. All IoT devices within the UAV’s energy transfer radius

D_{eh}

—excluding the target device—are eligible to harvest energy.

The received RF power at device j is calculated as

P_{j}^{r} (t) = {| h_{j} (t) |}^{2} P_{d}, if Δ d_{j} (t) \leq D_{eh} .

(11)

To reflect circuit-level non-idealities, we adopt a nonlinear energy harvesting (EH) model, which captures the saturation effect at high input power. The harvested DC power at device j is given by

P_{j}^{h} (t) = \frac{P_{limit} e^{c d} - P_{limit} e^{- c (P_{j}^{r} (t) - d)}}{e^{c d} (1 + e^{- c (P_{j}^{r} (t) - d)})} .

(12)

where

P_{limit}

is the maximum achievable output power of the EH circuit, and

c, d

are positive constants determined by the EH circuit’s hardware characteristics. The harvested energy

P_{j}^{h} (t)

accumulates to

E_{total}^{h}

, which is one of the MOO objectives.

3.2. Problem Formulation

In UAV-assisted wireless powered IoT networks, trajectory planning and hovering decisions must jointly account for the service urgency of IoT devices, the avoidance of data loss, and the UAV’s limited energy budget. To address these requirements, we design an adaptive task scheduling strategy in which the UAV dynamically selects the next target based on real-time data urgency.

At each time step t, the UAV determines the target device

\hat{j} (t)

according to the highest data transmission priority:

\hat{j} (t) = arg max_{j} q_{j}^{u} (t),

(13)

where

q_{j}^{u} (t)

is defined in Equation (3). When the UAV approaches within the data collection range

D_{dc}

of the target device (i.e.,

d_{j} (t) \leq D_{dc}

), it hovers to initiate simultaneous uplink data reception and downlink energy transfer.

Let K denote the total number of UAV hovering events within the mission duration T, and let

k \in {1, 2, \dots, K}

index these events. Denote by

j_{k}

the target device associated with the k-th hovering. The instantaneous uplink transmission rate during hovering is given by

R_{k} = W {log}_{2} (1 + \frac{P_{u} {| g_{j_{k}} (t) |}^{2}}{σ_{n}^{2}}) .

(14)

where W is the system bandwidth,

P_{u}

is the transmit power of the IoT device, and

σ_{n}^{2}

is the noise power at the UAV receiver. The required hovering time to transmit the remaining data

Q_{j_{k}} (t)

is computed as

t_{k} = \frac{Q_{j_{k}} (t)}{R_{k}} .

(15)

During each hovering session, the UAV also transfers energy to other nearby devices (within

D_{eh}

) except

j_{k}

. Based on the harvested power

P_{j}^{h} (t)

defined in Equation (12), the energy harvested at device j over the duration

t_{k}

is

E_{j} = P_{j}^{h} (t) \cdot t_{k}, \forall j \neq j_{k}, d_{j} (t) \leq D_{eh} .

(16)

The total harvested energy during the k-th hovering event is

E_{k} = \sum_{\begin{matrix} j \neq j_{k} \\ d_{j} (t) \leq D_{eh} \end{matrix}} E_{j} .

(17)

Over the entire task period, the cumulative data throughput and energy harvested are, respectively,

\begin{matrix} R_{sum} & = \sum_{k = 1}^{K} R_{k}, \end{matrix}

(18)

\begin{matrix} E_{h}^{total} & = \sum_{k = 1}^{K} E_{k} . \end{matrix}

(19)

In contrast, the total propulsion energy consumed by the UAV over T seconds is calculated as

E_{total}^{c} = \int_{0}^{T} P (v (t)) d t .

(20)

where

P (v (t))

is defined in Equation (4). Since the UAV’s transmit power

P_{d}

is assumed to be constant, the energy cost for communication is not included in the optimization objective.

Therefore, the joint trajectory and hovering policy optimization problem can be formulated as a MOO problem:

\begin{matrix} P_{1} : max_{v (t), θ (t)} & (R_{sum}, E_{total}^{h}, - E_{total}^{c}), \end{matrix}

(21)

The optimization variables are the UAV horizontal velocity

v (t)

and yaw angle

θ (t)

over the mission period

t \in [0, T]

, subject to

\begin{matrix} 0 \leq v (t) \leq v_{max}, - π \leq θ (t) \leq π . \end{matrix}

(22)

The search space is continuous in both

v (t)

and

θ (t)

. For numerical implementation, the mission period T is discretized into N time steps, resulting in a

2 N

-dimensional optimization problem.

The UAV’s operational goal involves simultaneously maximizing system throughput, enhancing energy harvesting efficiency, and minimizing propulsion energy consumption. However, these objectives often lead to conflicting decisions, requiring careful trade-off management.

On one hand, increasing the sum data rate

R_{sum}

requires the UAV to visit more devices within the mission duration. This suggests a need for high-speed flight to accommodate a larger number of hovering instances (K), as well as positioning the UAV close to the target device to reduce communication delay and improve link quality. Ideally, hovering directly above the target device offers the best channel conditions for uplink transmission.

On the other hand, to maximize total harvested energy

E_{h}^{total}

, it is preferable for the UAV to maximize the number of devices within its energy transfer range

D_{eh}

during each hovering. This typically requires the UAV to hover at a position that balances proximity to multiple devices, which may not coincide with the optimal position for uplink throughput.

Additionally, minimizing energy consumption

E_{c}^{total}

favors flight at the maximum-endurance (ME) speed

V_{ME}

, where the propulsion power is minimized. However, flying at this energy-optimal speed may reduce the number of devices visited within T and may even result in data overflow at some IoT nodes due to delayed service.

These observations show that the three optimization goals are inherently conflicting. The UAV must make decisions under partial observability, dynamic topology, and heterogeneous device demands, making conventional model-based approaches (e.g., dynamic programming or exhaustive search) computationally prohibitive.

In continuous-state and continuous-action problems like UAV trajectory and hovering, traditional multi-objective optimization methods face severe limitations. Methods requiring discretization of state–action spaces suffer from the curse of dimensionality, leading to exponential growth in computational complexity. They also assume fully known dynamics and static environments, unrealistic in wireless IoT networks with time-varying channels, stochastic energy harvesting, and dynamic device demands. Consequently, these methods either cannot scale to high-dimensional problems or yield suboptimal trajectories.

Model-free DRL offers a flexible alternative. By representing policies with neural networks, DRL handles high-dimensional continuous state and action spaces directly, enabling smooth control of UAV speed and heading. It learns from environment interactions, adapting to partially observable and stochastic conditions, and captures trade-offs among conflicting objectives through vectorized rewards. This alleviates the computational burden of exhaustive search while achieving scalable, adaptive, and near-optimal solutions.

In this work, we use DRL to optimize UAV trajectory and hovering. DDPG is adopted due to its support for continuous actions. UAV control variables—flight speed

v (t)

and yaw angle

θ (t)

—are continuous, making DDPG suitable. Conventional DDPG relies on a scalar reward, which cannot reflect multi-objective trade-offs. We extend DDPG to a multi-objective framework, modeling rewards as vectors and integrating a dynamic weighting mechanism to adjust objective importance based on observations. This enables the UAV to learn adaptive policies that coordinate competing objectives, demonstrating DRL’s advantage in continuous multi-objective optimization.

4. Preliminaries

Reinforcement learning (RL) [36,37] addresses sequential decision-making via trial–feedback interaction between an agent and its environment. Unlike supervised learning, no labeled input–output pairs or full environment model are assumed; learning is instead driven by evaluative scalar rewards that reflect the consequences of chosen actions.

To make this interaction precise and to reason about long-term consequences, RL is commonly formalized within a Markov Decision Process (MDP). In the standard formulation, an MDP is denoted by

〈 S, A, r, p, γ 〉

, where S and A are the state and action spaces,

r (s, a)

is the one-step reward,

p (s^{'} ∣ s, a)

specifies the transition dynamics, and

γ \in (0, 1)

discounts future outcomes.

Under this MDP formalism, the learning goal is to obtain a policy

π (a ∣ s)

that maximizes the expected long-term return. The discounted return at time t is

G_{t} = \sum_{k = 0}^{\infty} γ^{k} r_{t + k + 1},

(23)

where

r_{t + k + 1}

is the reward received k steps after time t. The agent seeks an optimal policy

π^{*}

that maximizes the expected return:

π^{*} = arg max_{π} E [G_{t} | π] .

(24)

To facilitate policy learning, RL algorithms often utilize a value function, which estimates the expected return under a given policy. A key variant is the action-value function, or Q-function, denoted as

Q^{π} (s, a)

, which represents the expected return when the agent takes action a in state s and then follows policy

π

:

Q^{π} (s, a) = E_{π} [G_{t} | s_{t} = s, a_{t} = a],

(25)

which quantifies the long-term utility of taking action a at state s and thereafter following

π

. Reliable value estimation underpins systematic policy improvement in complex and uncertain environments.

Early algorithms employed look-up tables or simple function approximators. A representative example is Q-learning [38], which updates a Q-table to favor actions with larger estimated long-term returns; however, tabular methods become infeasible in large or continuous spaces. Deep neural networks alleviate this limitation by serving as expressive approximators. The Deep Q-Network (DQN) [39] parameterizes

Q (\cdot)

with a neural network and minimizes the squared temporal-difference error

L (θ_{Q}) = E [{(y_{t} - Q (s_{t}, a_{t} ∣ θ_{Q}))}^{2}],

(26)

where a common target is

y_{t} = r (s_{t}, a_{t}) + γ max_{a^{'}} \tilde{Q} (s_{t + 1}, a^{'} ∣ θ^{-}),

(27)

with

\tilde{Q} (\cdot ∣ θ^{-})

denoting a slowly updated target network. Despite its success, DQN is tailored to discrete action spaces. For continuous control, policy-gradient and actor–critic methods [40] are typically employed: a critic estimates value signals, and an actor updates the policy using the critic’s feedback. The DDPG algorithm [30] instantiates this paradigm with deep function approximation and deterministic policies, enabling efficient learning in high-dimensional continuous domains and seeing broad adoption in robotics, control, and autonomous systems.

5. Method

To address the multi-objective optimization problem, we cast the joint data acquisition and wireless energy delivery task as an MDP and solve it with a DRL approach. This section specifies the environment—state variables, action set, and reward signal—which together define the agent’s interaction loop.

(1) Environment formulation: We treat the UAV as a decision-making agent operating in a dynamic, partially observable, wireless powered IoT field. At each discrete time t, the agent receives an observation

s_{t}

, selects an action

a_{t}

, and obtains a reward

r_{t}

. The learning objective is to derive a policy that maximizes the long-horizon return while trading off data-collection efficiency, harvested energy, and the UAV’s own energy expenditure.

(2) State space design: In practice, it is unrealistic for the UAV to acquire complete knowledge of the global network due to the scale and dynamics of the IoT environment. To maintain practicality and scalability, we construct a compact and informative state representation based only on local and essential observations. The state at time t is defined as

s_{t} = [Δ x_{\hat{j}} (t), Δ y_{\hat{j}} (t), x_{u} (t), y_{u} (t), N_{out} (t), N_{loss} (t)],

(28)

where

(Δ x_{\hat{j}} (t), Δ y_{\hat{j}} (t))

denotes the horizontal distance between the UAV and the currently targeted IoT device

\hat{j}

in Cartesian coordinates;

(x_{u} (t), y_{u} (t))

is the UAV’s absolute position;

N_{out} (t)

is the number of times the UAV exceeds the operational boundary; and

N_{loss} (t)

indicates the number of IoT devices experiencing buffer overflows. This compact state captures both the spatial information for navigation and critical performance indicators for guiding behavior. It enables the UAV to perceive its environment adequately while reducing observation and computation complexity.

(3) Action space design: The UAV’s movement is modeled with continuous actions to allow fine-grained control. At time t, the action is represented by

a_{t} = [v (t) cos (θ (t)), v (t) sin (θ (t))],

(29)

where

v (t) \in [0, v_{max}]

is the UAV’s speed and

θ (t) \in [- π, π]

is the yaw angle. This vector representation simplifies learning and facilitates smooth updates through gradient-based methods.

(4) Reward function design with adaptive weights: The UAV receives immediate feedback in the form of a 4-dimensional reward vector that corresponds to our multi-objective problem:

r_{t} = [r_{dc} (t), r_{eh} (t), r_{ec} (t), r_{aux} (t)],

(30)

where:

-: $r_{dc} (t)$ encourages high data rate during hovering:

r_{d c} (t) = \{\begin{matrix} 100 \times R^{k}, & at UAV ’ s k^{t h} hovering, \\ 0, & otherwise . \end{matrix}

(31)

-: $r_{eh} (t)$ rewards efficient energy transfer to multiple devices:

r_{e h} (t) = \{\begin{matrix} 100 \times (E^{k} + \sum_{j = 0}^{J} I_{d_{j} (t) \leq D_{e h}}), & at UAV ’ s k^{t h} hovering, \\ 0, & otherwise . \end{matrix}

(32)

-: $r_{ec} (t)$ penalizes propulsion energy consumption:

r_{ec} (t) = \{\begin{matrix} - P_{hov}, & if hovering, \\ - P (v (t)), & otherwise, \end{matrix}

(33)

-: $r_{aux} (t)$ is an auxiliary reward:

r_{aux} (t) = - d_{j}^{x} (t) - d_{j}^{y} (t) - N_{f} (t) - N_{d} (t) .

(34)

It is evident that

r_{aux} (t)

reflects the spatial relationship between the UAV and its designated target. When the UAV is distant from the target device, this value becomes more negative, thereby implicitly guiding the UAV to approach the intended location. Furthermore, if the UAV violates operational boundaries or causes data overflow at IoT terminals due to delayed data acquisition, it will incur an additional penalty. This auxiliary reward component is designed to penalize undesirable flight behaviors, encouraging the UAV to accomplish fundamental operational goals regardless of how the main optimization objectives are weighted.

Finally, these multi-dimensional rewards are aggregated into a scalar reward using dynamically adjusted weights:

R_{t} = ω_{1} (t) \cdot r_{dc} (t) + ω_{2} (t) \cdot r_{eh} (t) + ω_{3} (t) \cdot r_{ec} (t) + r_{aux} (t) .

(35)

The weight vector

ω (t)

is generated by a neural network conditioned on the current state

s_{t}

, enabling the UAV to flexibly shift focus across objectives (e.g., favoring energy harvesting under low battery or throughput under strict latency), thereby improving decision flexibility and adaptability.

Unlike Pareto-front MORL approaches that approximate all non-dominated solutions at high computational cost, our method directly learns a single adaptive policy for dynamic environments. Preference-based MORL usually relies on fixed or user-specified weights, whereas WeightNet adaptively generates scalarization weights from the current state, enabling context-aware trade-offs without manual specification. This makes DW-MODDPG well-suited for UAV-assisted WPT networks, where objective priorities change with time and network conditions.

5.1. DW-MODDPG Algorithm

To address the continuous control problem in the formulated MOO framework, we propose a DW-MODDPG algorithm. This algorithm extends the standard DDPG by incorporating a dynamic weight adaptation module to handle vector-valued rewards, enabling adaptive preference adjustment across multiple objectives in real time. The overall framework of the DW-MODDPG algorithm is illustrated in Figure 2.

5.1.1. Network Architecture and Initialization

The learning agent maintains two primary neural networks, consistent with the actor–critic paradigm:

The actor network $μ (s | θ^{μ})$ maps the current state s to an action a, representing the control policy.
The critic network $Q (s, a | θ^{Q})$ estimates the expected cumulative reward given state–action pairs, providing feedback for policy improvement.

All network parameters are initialized with a truncated normal distribution centered at zero, with standard deviation scaled by the inverse square root of the input dimension. Bias terms are initialized at

0.001

to facilitate early gradient propagation. To stabilize training, target networks

μ^{'}

and

Q^{'}

are maintained as slowly updated copies of the main networks.

5.1.2. Vector-Valued Rewards and Adaptive Weighting

Unlike standard DDPG, which operates on scalar rewards, DW-MODDPG handles vector-valued rewards:

r_{t} = [r_{dc}, r_{eh}, r_{ec}, r_{aux}],

(36)

representing multiple objectives such as data collection, energy harvesting, energy consumption, and auxiliary costs. Instead of using static weights, we introduce a lightweight neural network called WeightNet, parameterized by

θ^{w}

, to dynamically generate a weight vector based on the current state:

ω_{t} = WeightNet (s_{t}; θ^{w}),

(37)

where

ω_{t} = [ω_{1} (t), ω_{2} (t), ω_{3} (t)]

. The scalar reward used for policy optimization is computed as

r_{t}^{total} = ω_{t}^{⊤} \cdot [r_{dc}, r_{eh}, r_{ec}] + r_{aux} .

(38)

This dynamic weighting mechanism allows the agent to adaptively emphasize different objectives depending on the environment, e.g., prioritizing energy saving when battery levels are low or maximizing throughput when data backlog occurs.

5.1.3. Training Procedure

The DW-MODDPG algorithm follows a standard actor–critic learning cycle with adaptive weighting. At each time step, the agent observes state

s_{t}

, generates adaptive weights

ω_{t}

, selects an action

a_{t}

perturbed by Gaussian noise for exploration, executes the action, and observes the next state

s_{t + 1}

and reward vector

r_{t}

. The scalar reward

r_{t}^{total}

is then computed using WeightNet.

Critic training minimizes the mean squared error between the estimated Q-value and the target value from target networks:

y_{i} = r_{i}^{total} + γ Q^{'} (s_{i + 1}, μ^{'} (s_{i + 1} | θ^{μ^{'}}) | θ^{Q^{'}}),

(39)

L_{Q} = E_{i} [{(Q (s_{i}, a_{i} | θ^{Q}) - y_{i})}^{2}] .

(40)

For the actor network, the policy is updated to maximize the expected Q-value:

L_{μ} = - E_{i} [Q (s_{i}, μ (s_{i} | θ^{μ}) | θ^{Q})] .

(41)

Target networks are softly updated to improve stability:

θ^{Q^{'}} \leftarrow τ θ^{Q} + (1 - τ) θ^{Q^{'}}, θ^{μ^{'}} \leftarrow τ θ^{μ} + (1 - τ) θ^{μ^{'}},

(42)

where

τ ≪ 1

is the soft update coefficient. Gaussian noise

N (0, σ^{2})

is applied to actions to encourage exploration, with

σ

decaying over time.

5.1.4. Algorithmic Steps

Algorithm 1 summarizes the DW-MODDPG training procedure, highlighting the dynamic weight adaptation at each step to balance multiple objectives.

Algorithm 1 MODDPG with dynamic WeightNet.

1:: Initialize actor $μ$ , critic Q, target networks $μ^{'}$ , $Q^{'}$ and WeightNet
2:: Initialize replay buffer $D$
3:: for each episode do
4:: Update the environment status and observe the current state s
5:: for each step t do
6:: Generate adaptive weights $ω_{t} = WeightNet (s_{t})$
7:: Select action $a_{t} = μ (s_{t} | θ^{μ}) + N (0, σ^{2})$
8:: Execute $a_{t}$ , observe next state $s_{t + 1}$ and reward vector $r_{t}$
9:: Compute scalar reward $r_{t}^{total} = ω_{t}^{⊤} [r_{dc}, r_{eh}, r_{ec}] + r_{aux}$
10:: Store transition $(s_{t}, a_{t}, r_{t}^{total}, s_{t + 1})$ into replay buffer $D$
11:: if update then
12:: Randomly sample a mini-batch transitions from $D$
13:: Compute target values $y_{i}$ :

$y_{i} = \{\begin{cases} r_{i} \cdot w^{' T}, & if terminal s_{i + 1} \\ r_{i} \cdot w^{' T} + γ Q^{'} (s_{i + 1}, μ^{'} (s_{i + 1} | θ^{μ^{'}}) | θ^{Q^{'}}), & otherwise \end{cases}$
14:: Update critic using loss $L_{Q}$ Equation (40)
15:: Update actor using loss $L_{μ}$ Equation (41)
16:: Soft update target networks Equation (42)
17:: Decay action randomness: $σ^{2} \leftarrow σ^{2} \cdot η$
18:: end if
19:: end for
20:: end for

By integrating dynamic weight adaptation with the DDPG framework, DW-MODDPG achieves flexible and efficient multi-objective optimization in continuous control tasks, overcoming the limitations of conventional scalar-reward DDPG and static multi-objective strategies.

The introduction of the dynamic WeightNet potentially introduces non-stationarity to the learning environment, as the reward function’s scalarization changes over time. However, this challenge is mitigated by two key design choices: (i) the WeightNet is trained jointly with the actor–critic networks, ensuring all components optimize towards a consistent long-term goal; and (ii) the use of target networks with soft updates stabilizes the Q-learning process against such variations. Empirical results in Section 6 demonstrate the stable convergence achieved by our approach.

6. Simulation Results and Discussion

6.1. Simulation Settings

To evaluate the performance of the proposed DW-MODDPG algorithm, we design a simulation environment that emulates a UAV-assisted wireless powered IoT network. The UAV operates within a 400 m × 400 m square area at a fixed altitude of 10 m and maximum flight speed

v_{max} = 20

m/s [41]. The UAV is responsible for collecting data and wirelessly transferring energy to ground IoT devices. The communication and energy transfer coverage radii are set as

D_{d c} = 10

m and

D_{e h} = 30

m, respectively. The transmit power of the UAV is configured to

P_{d} = 40

dBm, while the IoT devices transmit with power

P_{u} = - 20

dBm [42]. The wireless bandwidth W is 1 MHz and the noise power is set to

σ_{n}^{2} = - 90

dBm.

IoT devices are randomly deployed across the area. Their data buffers are updated every second, with data accumulation following a Poisson distribution. The arrival rate is randomly drawn from the set {4, 8, 15, 20} packets per second. The maximum buffer size is limited to

l_{max} = 5000

packets. Each data transmission task corresponds to a packet size of

Q = 10

Mbits. The parameters of the energy harvesting model adopt the nonlinear model configuration from [15].

Both the policy and value networks are multilayer perceptrons with ReLU activations in the hidden layers; the actor’s output layer adopts tanh. In contrast to prior studies that used fixed scalarization coefficients for multi-objective rewards, we employ a lightweight auxiliary network to generate a state-conditioned, time-varying weight vector

ω_{t} = {[ω_{d c}, ω_{e h}, ω_{e c}]}^{⊤}

. This module is trained jointly with the actor–critic so that reward scalarization adapts online to environmental changes and mission priorities. The scalarized reward at time t is

r_{t}^{total} = ω_{t}^{⊤} {[r_{d c}, r_{e h}, r_{e c}]}^{⊤}

. Key simulation settings are reported in Table 1, and detailed architectural and optimizer configurations are provided in Table 2.

To validate the performance and convergence of our proposed DW-MODDPG algorithm, we first examine its training dynamics. As shown in Figure 3, the per-episode accumulated reward increases markedly, with variance diminishing and the moving average stabilizing over time. Moreover, the critic network’s loss initially oscillates but then decreases sharply, demonstrating that the network gradually learns to accurately approximate the action values. These observations jointly indicate steady policy improvement and reliable convergence of the proposed algorithm.

As can be seen from Figure 4A,B, in terms of optimization objectives, both the total data rate and the total harvested energy increase significantly as the number of training episodes grows, while the average energy consumption of the UAV gradually decreases. Specifically, the UAV learns to select target devices more efficiently during flight and to adjust its hovering positions to simultaneously facilitate data collection and energy transfer. Moreover, as the coverage radius

D_{d c}

expands, the UAV is able to serve a larger number of energy-harvesting devices while maintaining a high data rate, thereby substantially enhancing the total harvested energy. The experimental results demonstrate that the proposed algorithm can effectively manage the MOO problem of the UAV.

6.2. Performance Comparison with Benchmark Policies

To assess the effectiveness of the proposed DW-MODDPG-based control policy (denoted as

P_{DW - MODDPG}

), we compare it against two benchmark strategies from [14]:

$P_{V_{max}}$ : The UAV flies at the maximum speed $v_{max} = 20 m / s$ and hovers directly above each target device to collect data.
$P_{V_{ME}}$ : The UAV travels at the maximum endurance speed $v_{ME} = 10.2 m / s$ and also hovers above target devices for data collection.

According to [14],

P_{V_{max}}

prioritizes throughput, whereas

P_{V_{ME}}

minimizes propulsion energy consumption. Figure 5A,B report results for data-collection radii

D_{dc} \in {10, 15, 20, 25, 30} m

. Each point is averaged over 100 independent episodes to ensure statistical robustness.

Hovering directly above the target maximizes instantaneous channel gain and thus often yields high per-link data rates for

P_{V_{max}}

and

P_{V_{ME}}

. However, these fixed-speed heuristics provide no flexibility to jointly optimize EH. In contrast,

P_{DW - MODDPG}

adaptively selects hovering locations to balance multiple objectives. Notably, at

D_{dc} = 10 m

,

P_{DW - MODDPG}

achieves a higher mission-level sum data rate than both baselines by prioritizing high-rate regions and reducing unnecessary repositioning.

Regarding the number of served devices and the total harvested energy,

P_{DW - MODDPG}

consistently outperforms the benchmarks across all

D_{dc}

values. As

D_{dc}

increases, the feasible hover region enlarges, enabling the agent to choose positions that simultaneously satisfy the DC constraint while improving coverage of EH devices and lowering repositioning overhead, which in turn boosts the total harvested energy.

6.3. Effect of Weight Preferences on Control Policies

To assess the impact of different optimization preferences on UAV decision-making, we conduct additional experiments with varied weight configurations, as summarized in Table 3. In these experiments, we comparatively evaluate our proposed dynamic weight adjustment method against four benchmark control schemes. The optimized results are reported in Figure 6A,B by varying

D_{d c}

.

optDC: Prioritizes data collection with $ω_{dc} = 1.0$ and $ω_{eh} = ω_{ec} = 0.0$ .
optEH: Focuses on maximizing harvested energy, setting $ω_{eh} = 1.0$ and others to 0.0.
optEC: Aims to minimize UAV energy consumption with $ω_{ec} = 1.0$ .
optJoint: Considers all objectives equally with $ω_{dc} = ω_{eh} = ω_{ec} = 1.0$ .

It is observed from Figure 6A that as

D_{d c}

increases, the sum data rate decreases while the total harvested energy increases under all policies. The reason can be explained as follows. A larger

D_{d c}

enlarges the average distance between the UAV and the target device during data collection, which reduces the transmission rate. Meanwhile, it offers a wider choice scope of hovering positions, so that more IoT devices can be covered for energy harvesting with better channel conditions. The average flying energy consumption exhibits a mild increase in the large-

D_{d c}

region because additional maneuvers are required to balance coverage and link quality.

Compared with the equal-weight baseline, OPT-DWjoint achieves consistently better multi-objective performance across all

D_{d c}

. As shown in Figure 6A(a–c), OPT-DWjoint yields a higher sum data rate and a higher total harvested energy than optjoint at all tested

D_{d c}

while keeping the average energy consumption lower than optjoint and close to optEC. This indicates that adapting the weights to the environment state can push the operating point towards a more favorable Pareto trade-off than using a fixed equal weighting.

The underlying reasons are illustrated in Figure 6B. Compared with optjoint, OPT-DWjoint increases the total number of DC devices and the average data rate for most

D_{dc}

settings (Figure 6B(a,b)) and simultaneously improves the average number of EH devices and the sum EH rate (Figure 6B(c,d)). This suggests that the policy learned by OPT-DWjoint adapts its preference in a context-aware manner: when communication gain dominates (e.g., strong channels or small

D_{dc}

), it allocates more hovering around DC targets to raise per-link rates; when EH opportunities dominate (e.g., many EH devices or low residual energy), it adjusts hovering geometries to widen EH coverage at a modest cost to DC, thereby increasing both the number of served devices and the total harvested energy.

We benchmark three SOO policies with fixed weights—optDC(

ω_{dc} = 1

), optEH (

ω_{eh} = 1

), and optEC (

ω_{ec} = 1

)—and report results in Figure 6A,B. optDC achieves the highest sum data rate by serving more DC devices and flying closer to targets (Figure 6B(a,b)), but it yields the lowest total harvested energy due to reduced EH coverage and incurs higher propulsion energy than optEC. optEH maximizes total harvested energy via more EH devices and higher EH rate (Figure 6B(c,d)), at the cost of lower throughput and generally higher propulsion than optEC. optEC minimizes average propulsion energy (Figure 6A(c)) by conservative motion but sacrifices both throughput and harvested energy (Figure 6A(a,b)).

SOO policies expose the inherent trade-offs: optimizing one metric pushes the system to that extreme while degrading the others. This motivates multi-objective control; our dynamic-weight policy adapts preferences to context and operates in a more favorable Pareto region.

7. Conclusions

In this paper, we investigated a MOO problem in UAV-assisted wireless powered IoT networks, where the UAV simultaneously performs data collection and energy transfer. To jointly optimize the sum data rate, total harvested energy, and flying energy consumption, we proposed an enhanced DW-MODDPG algorithm with an adaptive weight scheduling mechanism. This dynamic weighting network generates objective weights in real time based on the observed environmental states, enabling the UAV to flexibly balance competing objectives under varying conditions.

We formulated a vector-valued reward function to represent multiple optimization targets and applied a state-dependent weight scheduler to dynamically scalarize the reward. Extensive simulation results validated the effectiveness of the proposed method and demonstrated its capability to adapt to different optimization preferences. Additionally, the algorithm framework supports an arbitrary number of objectives, highlighting its general applicability to broader MOO scenarios.

Despite the promising performance of the proposed DW-MODDPG algorithm, several limitations warrant consideration and pave the way for future research. First, our current model operates under the assumption of perfect channel state information (CSI) and idealized energy harvesting circuits. In practical deployments, uncertainties from channel estimation errors and hardware nonlinearities in EH components could potentially degrade performance. Integrating robust or distributionally robust optimization techniques within the DRL framework would be a valuable extension to enhance resilience against such environmental and model uncertainties.

Second, the present study focuses on a single-UAV scenario. Extending the framework to multi-UAV systems introduces critical challenges such as inter-UAV coordination, interference management, and the need for scalable learning algorithms. To address cooperation, multi-agent deep reinforcement learning architectures would be essential. A promising approach is the Centralized Training with Decentralized Execution framework. During the training phase, a centralized critic could leverage global information (e.g., all UAVs’ states and actions) to learn coordinated strategies, while each UAV’s actor network learns a policy based on its local observations. For interference management, the state space would need to be augmented to include the relative positions and transmission statuses of nearby UAVs. Furthermore, the reward function must be redesigned to explicitly penalize performance degradation caused by co-channel interference, guiding the UAVs to learn implicit interference avoidance behaviors. Finally, to enhance scalability and privacy, decentralized learning paradigms such as federated reinforcement learning could be explored, where UAVs collaboratively learn a global policy by sharing model parameters or gradients instead of raw data without relying on a central controller. Addressing these aspects through targeted algorithmic modifications would constitute a significant advancement towards deploying robust multi-UAV systems in practice.

Author Contributions

Conceptualization, X.Z. and M.Z.; Methodology, X.Z.; Software, J.H.; Validation, J.H.; Formal Analysis, X.Z.; Data Curation, J.H.; Writing—Original Draft, X.Z. and J.H.; Writing—Review and Editing, M.Z.; Supervision, M.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Hunan Provincial Natural Science Foundation (Grant Nos. 2025JJ90177 and 2024JJ9173) and the Science and Technology Major Special Project Fund of Changsha (Grant No. kh2401010).

Institutional Review Board Statement

Not applicable. The study did not involve humans or animals. We used data from a third party.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

References

Gubbi, J.; Buyya, R.; Marusic, S.; Palaniswami, M. The Internet of Things (IoT): A vision, architectural elements, and future directions. Future Gener. Comput. Syst. 2013, 29, 1645–1660. [Google Scholar] [CrossRef]
Zanella, A.; Bui, N.; Castellani, A.; Vangelista, L.; Zorzi, M. Internet of Things for smart cities. IEEE Internet Things J. 2014, 1, 22–32. [Google Scholar] [CrossRef]
Al-Fuqaha, A.; Guizani, M.; Mohammadi, M.; Aledhari, M.; Ayyash, M. Internet of Things: A survey on enabling technologies, protocols, and applications. IEEE Commun. Surv. Tutor. 2015, 17, 2347–2376. [Google Scholar] [CrossRef]
Kalør, A.E.; Popovski, P. Timely monitoring of dynamic sources with observations from multiple wireless sensors. arXiv 2020, arXiv:2012.12179. [Google Scholar] [CrossRef]
Tsai, C.-W.; Lai, C.-F.; Chiang, M.-C.; Yang, L.T. Data mining for Internet of Things: A survey. IEEE Commun. Surv. Tutor. 2014, 16, 77–97. [Google Scholar] [CrossRef]
Kaul, S.; Yates, R.; Gruteser, M. Real-time status: How often should one update? In Proceedings of the 31st Annual IEEE International Conference on Computer Communications (IEEE INFOCOM 2012), Orlando, FL, USA, 25–30 March 2012; pp. 2731–2735. [Google Scholar]
Clerckx, B.; Zhang, R.; Schober, R.; Ng, D.W.K.; Kim, D.I.; Poor, H.V. Fundamentals of wireless information and power transfer: From RF energy harvester models to signal and system designs. IEEE J. Sel. Areas Commun. 2019, 37, 4–33. [Google Scholar] [CrossRef]
Bi, S.; Zeng, Y.; Zhang, R. Wireless powered communication networks: An overview. IEEE Wirel. Commun. 2016, 23, 10–18. [Google Scholar] [CrossRef]
Zhang, R.; Ho, C.K. MIMO broadcasting for simultaneous wireless information and power transfer. IEEE Trans. Wirel. Commun. 2013, 12, 1989–2001. [Google Scholar] [CrossRef]
Zhou, X.; Zhang, R.; Ho, C.K. Wireless information and power transfer: Architecture design and rate–energy tradeoff. IEEE Trans. Commun. 2013, 61, 4754–4767. [Google Scholar] [CrossRef]
Mozaffari, M.; Saad, W.; Bennis, M.; Nam, Y.H.; Debbah, M. A tutorial on UAVs for wireless networks: Applications, challenges, and open problems. IEEE Commun. Surv. Tutor. 2019, 21, 2334–2360. [Google Scholar] [CrossRef]
Zhan, C.; Zeng, Y. Energy-efficient data collection in UAV-enabled wireless sensor networks. IEEE Wirel. Commun. Lett. 2018, 7, 328–331. [Google Scholar] [CrossRef]
Wu, Q.; Zeng, Y.; Zhang, R. Joint trajectory and communication design for multi-UAV enabled wireless networks. IEEE Trans. Wirel. Commun. 2018, 17, 2109–2121. [Google Scholar] [CrossRef]
Ye, H.-T.; Kang, X.; Joung, J.; Liang, Y.-C. Optimization for full-duplex rotary-wing UAV-enabled wireless-powered IoT networks. IEEE Trans. Wirel. Commun. 2020, 19, 5057–5072. [Google Scholar] [CrossRef]
Zeng, Y.; Xu, J.; Zhang, R. Energy minimization for wireless communication with rotary-wing UAV. IEEE Trans. Wirel. Commun. 2019, 18, 2329–2345. [Google Scholar] [CrossRef]
Xie, L.; Xu, J.; Zeng, Y. Common throughput maximization for UAV-enabled interference channel with wireless powered communications. IEEE Trans. Commun. 2020, 68, 3197–3212. [Google Scholar] [CrossRef]
Yu, Y.; Tang, J.; Huang, J.; Zhang, X.; So, D.K.C.; Wong, K.-K. Multi-objective optimization for UAV-assisted wireless powered IoT networks based on extended DDPG algorithm. IEEE Trans. Commun. 2021, 69, 6361–6373. [Google Scholar] [CrossRef]
Wan, S.; Lu, J.; Fan, P.; Letaief, K.B. Toward big data processing in IoT: Path planning and resource management of UAV base stations in mobile-edge computing systems. IEEE Internet Things J. 2020, 7, 3555–3572. [Google Scholar] [CrossRef]
Zhou, F.; Hu, R.Q.; Qian, Y. Wireless power transfer and data collection in wireless sensor networks. IEEE Internet Things J. 2020, 7, 14915–14927. [Google Scholar]
Jin, K.; Zhou, W. Wireless Laser Power Transmission: A Review of Recent Progress. IEEE Trans. Power Electron. 2019, 34, 3842–3859. [Google Scholar] [CrossRef]
Brown, W.C. The History of Wireless Power Transmission. Sol. Energy 1996, 56, 3–21. [Google Scholar] [CrossRef]
Pahlavan, S.; Shooshtari, M.; Jafarabadi Ashtiani, S. Star-Shaped Coils in the Transmitter Array for Receiver Rotation Tolerance in Free-Moving Wireless Power Transfer Applications. Energies 2022, 15, 8643. [Google Scholar] [CrossRef]
Lu, X.; Wang, P.; Niyato, D.; Kim, D.I.; Han, Z. Wireless Charging Technologies: Fundamentals, Standards, and Network Applications. IEEE Commun. Surv. Tutorials 2015, 18, 1413–1452. [Google Scholar] [CrossRef]
Long, Y.; Zhao, S.; Gong, S.; Gu, B.; Niyato, D.; Shen, X.S. AoI-aware sensing scheduling and trajectory optimization for multi-UAV-assisted wireless backscatter networks. IEEE Trans. Veh. Technol. 2024, 73, 15440–15455. [Google Scholar] [CrossRef]
Fathollahi, L.; Mohassel Feghhi, M.; Atashbar, M. Energy optimization for full-duplex wireless-powered IoT networks using rotary-wing UAV with multiple antennas. Comput. Commun. 2024, 215, 62–73. [Google Scholar] [CrossRef]
Li, Y.; Ding, H.; Yang, Z.; Li, B.; Liang, Z. Integrated trajectory optimization for UAV-enabled wireless powered MEC system with joint energy consumption and AoI minimization. Comput. Netw. 2024, 254, 110842. [Google Scholar] [CrossRef]
Liu, X.; Chai, Z.-Y.; Li, Y.-L.; Cheng, Y.-Y.; Zeng, Y. Multi-objective deep reinforcement learning for computation offloading in UAV-assisted multi-access edge computing. Inf. Sci. 2023, 642, 119154. [Google Scholar] [CrossRef]
Pham, Q.-V.; Le, L.B.; Huynh, V.-N.; Fang, F.; Hwang, W.-J. UAV communications in 5G and beyond: Recent advances and future trends. IEEE Internet Things J. 2021, 8, 11664–11684. [Google Scholar]
Liu, Y.; Zhang, S.; Cheng, N.; Shi, W.; Wang, J. Reinforcement learning-enhanced trajectory planning in UAV-enabled mobile edge computing. IEEE Trans. Veh. Technol. 2022, 71, 8770–8783. [Google Scholar]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. In Proceedings of the International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
Yang, Y.; Zhang, M.; Zhang, D.; Li, K. Deep reinforcement learning for task offloading in UAV-enabled edge computing networks. IEEE Internet Things J. 2020, 7, 11173–11185. [Google Scholar]
Sun, Y.; Sun, H.; Ding, M.; Poor, H.V. Age-optimal trajectory planning for UAV-assisted data collection. IEEE Trans. Commun. 2021, 69, 6570–6585. [Google Scholar]
Qu, Y.; Song, Y.; Zhang, H.; Cao, X.; Niyato, D. Multi-agent RL-based control for UAV swarm deployment in IoT coverage scenarios. Sensors 2021, 21, 460. [Google Scholar]
Li, M.; Lin, Y.; Fu, X.; Zhao, J. Deep reinforcement learning-based trajectory optimization for UAV-aided data collection. IEEE Trans. Mob. Comput. 2021, 20, 2226–2239. [Google Scholar]
Zeng, Y.; Zhang, R. Energy-efficient UAV communication with trajectory optimization. IEEE Trans. Wirel. Commun. 2017, 16, 3747–3760. [Google Scholar] [CrossRef]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Kaelbling, L.P.; Littman, M.L.; Moore, A.W. Reinforcement learning: A survey. J. Artif. Intell. Res. 1996, 4, 237–285. [Google Scholar] [CrossRef]
Watkins, C.J.C.H.; Dayan, P. Q-learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
Konda, V.R.; Tsitsiklis, J.N. Actor–critic algorithms. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2000; pp. 1008–1014. [Google Scholar]
Hu, H.; Xiong, K.; Qu, G.; Ni, Q.; Fan, P.; Ben Letaief, K. AoI-minimal trajectory planning and data collection in UAV-assisted wireless powered IoT networks. IEEE Internet Things J. 2021, 8, 1211–1223. [Google Scholar] [CrossRef]
Wang, L.; Chen, M.; Sun, Y.; Zhang, R. UAV-aided wireless powered communication networks: Trajectory optimization and resource allocation for maximizing minimum throughput. IEEE Access 2019, 7, 134978–134991. [Google Scholar]

Figure 1. System model.

Figure 2. Framework of DW-MODDPG algorithm.

Figure 3. Training metrics comparison. (a) Accumulated reward. (b) Loss.

Figure 4. (A) Training curves tracking optimization objectives: (a) Sum Data Rate; (b) Total Harvested Energy; (c) Average Energy Consumption. (B) Training curves tracking optimization objectives: (a) Total number of DC devices; (b) Average data rate; (c) Average number of EH devices; (d) Average energy harvesting rate.

Figure 5. (A) Optimized objectives under different policies: (a) sum data rate; (b) total harvested energy; (c) average energy consumption. (B) Optimized results under different policies: (a) total number of DC devices; (b) average data rate; (c) average number of EH devices; (d) sum EH rate.

Figure 6. (A) Optimized objectives under different weight parameters: (a) sum data rate; (b) total harvested energy; (c) average flying energy consumption. (B) Optimized results under different weight parameters: (a) total number of DC devices; (b) average data rate; (c) average number of EH devices; (d) sum EH rate.

Table 1. Simulation parameters.

Parameter	Value
Bandwidth (B)	1 MHz
Noise power ( $σ_{n}^{2}$ )	−90 dBm
Reference channel power gain ( $γ_{0}$ )	−30 dB
Attenuation coefficient of NLoS link ( $μ$ )	0.2
Path loss exponent ( $\tilde{α}$ )	2.3
Parameters of LoS probability (a, b)	10, 0.6
Blade profile power ( $P_{0}$ )	79.86
Induced power ( $P_{i}$ )	88.63
Tip speed of rotor blade ( $U_{tip}$ )	120 m/s
Mean rotor induced velocity in hover ( $v_{0}$ )	4.03
Fuselage drag ratio ( $d_{0}$ )	0.6
Air density ( $ρ$ )	1.225 kg/m³
Rotor solidity (s)	0.05
Rotor disc area (A)	0.503 m²
Maximum output DC power ( $P_{limit}$ )	9.079 $μ$ W
Parameters of EH model (c, d)	47,083, 2.9 $μ$ W

Table 2. Network configurations.

Parameters	Values
Actor–Critic Network
Network structure for actor	[400, 300]
Network structure for critic	[400, 300]
Number of training episodes	1600
Learning rate for actor	$10^{- 3}$
Learning rate for critic	$10^{- 3}$
Reward discount factor	0.9
Replay memory size	8000
Batch size	64
Initial exploration variance	2.0
Final exploration variance	0.1
Soft target update parameter	0.001
Weight Scheduler Network
Hidden layer sizes	[64, 64]
Activation function	ReLU
Output dimension (weights)	3
Output activation	Softmax

Table 3. Comparison experiment parameters.

Name	Parameters
opt_joint	$ω_{d c} = 1.0, ω_{e h} = 1.0, ω_{e c} = 1.0$
opt_DC	$ω_{d c} = 1.0, ω_{e h} = 0.0, ω_{e c} = 0.0$
opt_EH	$ω_{d c} = 0.0, ω_{e h} = 1.0, ω_{e c} = 0.0$
opt_EC	$ω_{d c} = 0.0, ω_{e h} = 0.0, ω_{e c} = 1.0$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhu, X.; He, J.; Zhao, M. Adaptive Multi-Objective Optimization for UAV-Assisted Wireless Powered IoT Networks. Information 2025, 16, 849. https://doi.org/10.3390/info16100849

AMA Style

Zhu X, He J, Zhao M. Adaptive Multi-Objective Optimization for UAV-Assisted Wireless Powered IoT Networks. Information. 2025; 16(10):849. https://doi.org/10.3390/info16100849

Chicago/Turabian Style

Zhu, Xu, Junyu He, and Ming Zhao. 2025. "Adaptive Multi-Objective Optimization for UAV-Assisted Wireless Powered IoT Networks" Information 16, no. 10: 849. https://doi.org/10.3390/info16100849

APA Style

Zhu, X., He, J., & Zhao, M. (2025). Adaptive Multi-Objective Optimization for UAV-Assisted Wireless Powered IoT Networks. Information, 16(10), 849. https://doi.org/10.3390/info16100849

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Adaptive Multi-Objective Optimization for UAV-Assisted Wireless Powered IoT Networks

Abstract

1. Introduction

2. Related Work

2.1. UAV-Enabled Wireless Powered IoT Systems

2.2. Optimization Objectives and Trade-Offs

2.3. Reinforcement Learning for UAV Control

3. System Description and Problem Formulation

3.1. System Model

3.2. Problem Formulation

4. Preliminaries

5. Method

5.1. DW-MODDPG Algorithm

5.1.1. Network Architecture and Initialization

5.1.2. Vector-Valued Rewards and Adaptive Weighting

5.1.3. Training Procedure

5.1.4. Algorithmic Steps

6. Simulation Results and Discussion

6.1. Simulation Settings

6.2. Performance Comparison with Benchmark Policies

6.3. Effect of Weight Preferences on Control Policies

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI