You are currently viewing a new version of our website. To view the old version click .
Entropy
  • Article
  • Open Access

29 May 2025

Entropy-Based Age-Aware Scheduling Strategy for UAV-Assisted IoT Data Transmission

,
,
and
1
College of Communication Engineering, Army Engineering University of PLA, Nanjing 210000, China
2
Department of Information and Communication, Noncommissioned Officer Academy of PAP, Hangzhou 310000, China
3
Academy of Systems Engineering, Academy of Military Sciences, Beijing 100000, China
*
Authors to whom correspondence should be addressed.
This article belongs to the Special Issue Task-Oriented Communications in Industrial IoT: Age of Information and Beyond

Abstract

This paper investigates data transmission in an Internet of Things (IoT) network, where multiple devices send environmental data to a remote base station through an unmanned aerial vehicle (UAV) relay. The UAV serves as an airborne intermediary that collects status information from distributed IoT devices (e.g., temperature readings in a real-time forest fire monitoring system) and forwards it to the base station. To capture the impact of data staleness, a novel Age of Information (AoI) and entropy-aware system loss is defined in terms of L-conditional cross-entropy, which quantifies the expected penalty caused by state misestimation. The scheduling problem, which aims to minimize the system loss defined by L-conditional cross-entropy, is formulated as a Restless Multi-Armed Bandit (RMAB) problem. By applying Lagrange relaxation, the objective function is decomposed into tractable sub-problems, enabling a low-complexity, gain-index-based scheduling strategy. Numerical simulations validate the effectiveness of the proposed algorithm in reducing the long-term average system loss. In particular, the gain-index-based policy achieves a significant reduction in average penalty compared to random, round-robin, periodic update, and MAX-AoI scheduling strategies, demonstrating its superior performance over these baselines.

1. Introduction

Since its inception, the Internet of Things (IoT) has garnered significant attention and found widespread applications across diverse domains, including industrial manufacturing, intelligent transportation, healthcare, environmental monitoring, and military security [1,2]. Architecturally, an IoT system typically comprises three core components: IoT devices, transmission networks, and base stations (BS). IoT devices, often referred to as sources, are frequently deployed in remote and challenging environments for the purpose of capturing physical attributes of their surroundings, such as temperature, humidity, and pollution levels. However, inherent limitations in energy and transmission power impede these devices’ ability to establish long-range communication. To address this constraint and enhance data transmission efficiency in ground-based IoT systems, Unmanned Aerial Vehicles (UAVs) have been adopted as aerial relays, facilitating the conveyance of sensed data from geographically dispersed IoT devices to centralized ground base stations [3]. At the base station, the received data, which may include parameters like temperature, humidity, or carbon monoxide concentration in the context of a forest fire monitoring system, undergoes processing to extract meaningful insights that inform decision-making processes. Critically, the accuracy and timeliness of such decisions heavily depend on the freshness of the extracted information.
The seminal work by Kaul et al. [4] introduced the Age of Information (AoI) as a metric to quantify the freshness of information in communication networks. Since then, AoI has received extensive attention across various network paradigms, including remote monitoring, edge computing, and time-sensitive applications [5,6,7,8,9,10,11,12,13,14,15]. Several studies [6,9,11] have focused on minimizing the average AoI across all IoT nodes in UAV-assisted data collection systems. These studies have demonstrated that carefully designed UAV trajectory optimization strategies—such as the selection of optimal hovering locations and the determination of efficient access sequences—can significantly reduce the overall AoI in UAV-assisted IoT systems. In scenarios involving unknown traffic patterns, Markov Decision Process (MDP) models have been adopted to guide scheduling policies that reduce long-term AoI [10]. In addition to trajectory optimization, numerous centralized scheduling strategies have been proposed to coordinate UAV-based data collection. For instance, the UAV adopts the MAF-MAD strategy, applying Maximum AoI First for sampling IoT devices (Hop 1) and Maximum AoI Difference for forwarding data to the base station (Hop 2), thereby prioritizing updates with the highest freshness gap [12]. Building upon this, [13] proposed an airborne cooperative relaying and AoI-sensitive data collection (ADC) framework, which iteratively optimizes UAV flight paths using a generalized AoI expectation function. The problem has also been extended to multi-UAV scenarios, where joint optimization of trajectories and scheduling is required to minimize system-wide AoI. For example, [14] formulates a multi-stage stochastic optimization problem that jointly manages UAV data buffers and AoI states through queueing models, and proposes an adaptive, age-aware scheme for multi-UAV deployments. Recent research on UAV-assisted IoT systems has largely focused on minimizing freshness metrics such as the AoI. These efforts reflect a growing recognition that simply reducing AoI is insufficient to capture the true value of the monitored data, especially in UAV-assisted IoT systems where timely and informative updates are critical for accurate decision-making.

3. System Model and Methodology

3.1. UAV-Relayed IoT Remote Monitoring System

This paper considers a pull-based update mechanism, in which the GBS only requests status updates from IoT devices through a UAV relay when it is uncertain about the system state. Under this mechanism, we consider the status updating system illustrated in Figure 1, where a UAV is deployed as a fixed aerial base station, providing end-to-end wireless coverage between the ground base station (GBS) and IoT devices located within the communication range of the UAV relay. Due to communication limitations—such as the restricted transmission power of IoT devices, signal attenuation caused by physical obstacles (e.g., dense vegetation, buildings, and terrain irregularities), and the limited communication range inherent in low-power wireless technologies (such as ZigBee or Wi-Fi)—direct and reliable communication between IoT devices and the ground base station (GBS) is often challenging or infeasible. For example, in forest fire monitoring scenarios, dense vegetation and uneven terrain significantly weaken signals from devices, making it difficult for them to reliably transmit data directly to distant ground stations. Therefore, the UAV relay acts as an intermediary, facilitating communication between the GBS and IoT devices. In the monitoring area, N IoT devices transmit crucial status updates to the UAV through M wireless channels.
Figure 1. UAV-relayed IoT device status updating system.
Let T denote the total observation duration, which is divided into multiple time slots of equal length, denoted by t = 0 , 1 , 2 T 1 . Each device n N generates Markovian state information X n , t X for the remote monitoring system. X n , t may represent information related to forest fire status, such as temperature, humidity, smoke concentration, and atmospheric quality. We use to quantify the hazard level of the fire-monitoring system, for example, Normal, Alert, or Fire, where Y n , t is a function of the system state X n , t . In response to a pull request, each device n generates and submits a time-stamped updating message ( X n , t , t ) to one wireless channel. We assume that transmitting an update message to the receiver requires one time slot. However, the transmission of status updates is unreliable due to the inherent fading characteristics of wireless channels. Let l n denote the probability of successful transmission from device n to the UAV, and let l r denote the probability of successful transmission from the UAV to the GBS, which is independent of the chosen wireless channel.
We focus on the AoI at the GBS, denoted by Δ t , which is defined as the time elapsed since the generation of the most recently received status update. Due to the unreliability of the channel state, the data received by the GBS will be stale. We represent this data by X n , t - Δ t , which was generated Δ t time slots ago. At each time slot t , the information age evolution of the status update of IoT device n is given by the following:
Δ n t + 1 = Δ n t + 1   with   probability   1 - l n × l r   1                                 with   probability   l n × l r ,
Specifically, at each time slot t , if the transmission from IoT device n to the UAV and from the UAV to the GBS are both successful (with a combined probability l n × l r ), the AoI for device n resets to 1. Otherwise, the AoI Δ t increments by 1, indicating that the received information is becoming older. This reflects how communication reliability directly impacts information freshness at the base station.

3.2. Loss Model for UAV-Relayed IoT Remote Monitoring System

Consider the UAV-relayed IoT remote monitoring system shown in Figure 1, which provides real-time hazard level estimations through monitoring data provided by various IoT devices. For example, in a fire detection system, the latest state information of IoT device n is X n , t Δ t , and the system estimates the hazard level as o = ϕ X n , t Δ t , Δ t O , where ϕ is a function of X n , t Δ t and Δ t . Inherently, errors exist in estimating the hazard level associated with the monitored area’s state. These estimation errors are characterized by a loss function L Y n , t , o , which quantifies the difference between the estimated and actual values, providing a measure of model performance.
Building upon the concept of remote monitoring, consider a fire detection system capable of assessing fire risk in specific areas. The system uses sensor data to track the fire status in real-time. However, it is important to note that the estimated state produced by the monitoring system may not always match the actual fire state. This misestimation could have significant implications, such as delayed responses or unnecessary actions based on inaccurate fire detection.
The loss function values, as shown in Table 1, are used to quantify the errors in the system’s state estimations. For instance, if the system estimates the state hazard level as Normal when there is a Fire (i.e., the true state is Fire, and the estimated state is Normal), the loss function L F , N = 200 , indicating a significant misclassification that could lead to disastrous consequences.
Table 1. Example of loss values.
It is important to note that the loss values in Table 1 do not represent actual economic losses. Instead, they serve as theoretical penalty terms defined for modeling purposes, aiming to reflect the relative severity of different misclassification outcomes within the system.
To reflect the real-world costs of different types of misclassifications (e.g., falsely estimating a Fire state as Normal), a penalty structure is designed in which the severity of each error is encoded using asymmetric values. For example, minor misjudgments (e.g., mistaking Alert for Normal) are assigned a penalty of 50, while severe misclassifications (e.g., predicting Fire as Normal) are assigned 300 or more, reflecting at least a 6-fold increase in consequence. These values are not intended to represent exact monetary losses, but to capture the relative severity of different errors.

3.3. Information-Theoretic Metric for UAV-Relayed IoT Remote Monitoring System

In a UAV-relayed IoT remote monitoring system, the estimated loss in system performance resulting from an IoT device’s action in a given state X n , t Δ t , Δ t at time slot t is defined by the following penalty function:
p n x , δ = E L Y n , t , o X n , t Δ t = x , Δ t = δ ,
where o = ϕ X n , t Δ t , Δ t O is the estimated output of the state hazard level, ϕ is any function that maps from X × to O . Consider the optimization task described below:
min o O E L Y n , t , o X n , t Δ t = x , Δ t = δ ,
Let ϕ be the optimal estimator solving the estimation problem in Equation (3). Then, we denote its output using the following
o = ϕ X n , t Δ t , Δ t ,
After substituting this optimal estimator into Equation (2), we obtain the following:
p n x , δ = E L Y n , t , ϕ X n , t Δ t , Δ t X n , t Δ t = x , Δ t = δ ,
This equation represents the minimum achievable expected loss, meaning that for every estimator o O , the penalty cannot be lower than this bound. As stated in [24], the penalty function in Equation (5) is intrinsically linked to the concept of entropy. Let o = Y ^ n , t be the estimation of Y n , t . Subsequently, the L-cross entropy and L-conditional cross-entropy between Y n , t and Y ^ n , t are defined.
H L Y n , t ; Y ^ n , t = E L Y n , t , o ,
Equation (6) defines the L-cross entropy, which means the penalty caused by the difference between Y n , t and Y ^ n , t
H L Y n , t ; Y ^ n , t X n , t Δ t = x , Δ t = δ = E L Y n , t , o X n , t Δ t = x , Δ t = δ ,
Intuitively, the L-conditional cross-entropy in Equation (7) quantifies the expected loss caused by estimation errors under stale information, especially when the loss function is asymmetric and task-specific. Equation (7) formally defines this quantity, measuring the anticipated loss due to the divergence between Y n , t and Y ^ n , t , conditional on the observed variables X n , t Δ t and Δ t . Minimizing H L directly encourages the estimated state to align more closely with the true state, as the optimal estimator forces Y ^ n , t Y n , t .
Consequently, this reduction in penalty due to estimation errors manifests as a lower entropy value, which improves the estimation accuracy of remote monitoring systems by suppressing error propagation caused by suboptimal state estimation.

3.4. L-Conditional Cross-Entropy and Its Lower Bounds

Based on the L-conditional cross-entropy definition, Equation (7) formulates the expected loss measure conditioned on a given estimator o O . Minimizing this quantity is more challenging than directly computing the L-conditional entropy. Using Gibbs’ inequality [25] and Jensen’s Inequality [26], the L-conditional entropy is proven to be the theoretical lower bound, as formally expressed by:
H L Y n , t X n , t Δ t = x , Δ t = δ H L Y n , t ; Y ^ n , t X n , t Δ t = x , Δ t = δ .
To enforce this bound, we reformulate the original optimization problem (3) which minimizes the penalty in Equation (2). By integrating the theoretical insights from Equation (8), the original problem is transformed into the following formulation:
H L Y n , t X n , t Δ t = x , Δ t = δ = min o O H L Y n , t ; Y ^ n , t X n , t Δ t = x , Δ t = δ                                                                                                             = min o O E L Y n , t , o X n , t Δ t = x , Δ t = δ                                                                                                   = E L Y n , t , o X n , t Δ t = x , Δ t = δ ,
From Equations (5) and (9), it is evident that
p n x , δ = H L Y n , t X n , t Δ t = x , Δ t = δ .
For the optimal estimator o , the expected loss is indeed the L-conditional entropy, which is an information-theoretic lower bound of p n x , δ . This bound defines the fundamental performance limit by quantifying the performance degradation caused by incomplete status awareness. Notably, existing metrics proposed in prior studies—such as AoII, UoI and AoS—fail to characterize this information-theoretic limit.

3.5. Penalty Function and AoI

As previously stated, we consider a remote real-time fire monitoring system, where the fire hazard level is classified into three categories: Normal, Alert, or Fire. The monitoring area is divided into grids based on three levels, and N IoT devices are deployed accordingly. In this structure, the grid center at position (3,3) is designated as Fire, the immediate surrounding cells form the Alert region, and the outermost layer constitutes the Normal region. Upon each pull request, device n reports a state vector X n , t , which captures the spatial distribution trend of local temperature, particularly under delayed updates. We model temperature evolution as a stochastic process, where the state remains unchanged with probability 0.7 and transitions (e.g., rising or falling) occur with probability 0.3. The resulting spatial patterns under delayed updates are illustrated in Figure 2a. Our analytical results demonstrate that the penalty function p n x , δ exhibits non-monotonic behavior with respect to the Age of Information with different initial states, as shown in Figure 2b.
Figure 2. (a) Grids based on different fire hazard levels; (b) Penalty vs. AoI for different initial states.
From Figure 2b, we observe that regions closer to the Fire area exhibit a more rapid increase in penalty due to the higher loss value associated with their state transitions in subsequent time slots. Moreover, the penalty curve does not necessarily vary monotonically with increasing AoI. This non-monotonic behavior arises because the penalty reflects the expected loss under information staleness—not just the AoI itself. As such, commonly used metrics like AoI, AoII, and UoI fail to fully capture this behavior.

3.6. Problem Formulation

Due to transmission errors and limited communication resources, device updates may not be timely, leading to misunderstandings of the current situation. To minimize the penalty, we study a pull-based multi-channel transmission scheduling problem, where a central scheduler actively selects which devices to query at each time slot in order to minimize prediction loss. Unlike push-based approaches, where devices autonomously transmit their updates regardless of system needs, the pull-based strategy allows the scheduler to selectively request updates based on each device’s current information age or potential impact on system performance. This helps prioritize critical updates and reduce redundant communication overhead. The scheduling policy is denoted by π = π 1 , π 2 , π N , where π n = u n 0 , u n 1 , determines whether status information from device n is scheduled for transmission at each time slot t . Let Π be the set of all scheduling policies and π Π . Since status updates from the N IoT devices are transmitted over M shared channels, the constraint n = 1 N u n t M must be satisfied.
Our objective is to identify an optimal scheduling policy that minimizes the total expected loss across all N IoT devices over a finite time horizon T . Accordingly, the time-averaged scheduling optimization problem can be formulated as follows:
p ¯ o p t = inf π Π lim sup T n = 1 N E π 1 T t = 0 T 1 ω n p n X n , t Δ t , Δ t ,
s . t . n = 1 N u n t M , u n t 0 , 1 , t ,
where p n X n , t Δ t , Δ t is the penalty incurred by device n at slot t as defined in Equation (2), ω n is the weight value of device n , and p ¯ o p t represents the optimal value of problem (11). Equation (12) specifies the scheduling constraints, ensuring that at most M devices are selected for transmission at each time slot t .

4. The Scheduling Problem and Policy

Due to the channel resource constraint in problem (11)–(12), when N > M , not all devices can transmit updates simultaneously in each time slot. To address this, we develop an efficient scheduling policy that minimizes the time-averaged total expected penalty across all N devices while strictly satisfying constraint (12). The detailed design is presented in the following section.

4.1. RMAB Formulation

The scheduling problem can be formulated as an RMAB. Specifically, each device n is an arm and X n , t Δ t , Δ t is the state of each arm. At every time step, we need to allocate the communication channels to M out of N arms in a way that minimizes the sum expected penalty of all arms. The formal definition of this RMAB problem is presented below. Let u n t 0 , 1 denote the action applied to device n at time t , where u n t = 1 means that n is selected to update in slot t , and u n t = 0 otherwise. The RMAB problem is generally intractable to solve optimally. While the Whittle index policy is known to be asymptotically optimal for many RMAB problems [27], it requires the satisfaction of a complex condition known as indexability. In this paper, we adopt a low-complexity gain-index policy that avoids such requirements.

4.2. Relaxation and Decomposition

Applying the conventional relaxation and Lagrange decomposition approach for RMAB problems [28], we apply the method of Lagrange multipliers to decouple the problem into N single-device sub-problems. The scheduling problem (11)–(12) is then relaxed as follows:
p ¯ o p t = inf π Π lim sup T n = 1 N E π 1 T t = 0 T 1 ω n p n X n , t Δ t , Δ t ,
s . t . lim sup T n = 1 N E π 1 T t = 0 T 1 u n t M ,
Clearly, constraint (14) provides a relaxed form of the original constraint (12), requiring only average compliance over time, in contrast to Equation (12), which must be satisfied at every time slot t . Based on the relaxation in Equation (14), we employ the Lagrange multiplier method [29] to convert the problem into an unconstrained form by introducing a dual variable λ :
sup λ 0 J λ ,
J λ = inf π Π lim sup T n = 1 N E π 1 T t = 0 T 1 ω n p n X n , t Δ t , Δ t + λ u n t λ M .
Equations (15) and (16) define the dual function J λ , derived by relaxing the original constrained problem using the Lagrange method. Here, the constraint in Equation (12) is incorporated into the objective using a dual variable λ , resulting in a penalty term λ M . For a fixed λ , the term λ M is a constant and independent of the scheduling policy π . Thus, the problem in Equation (15) can be decoupled into N independent sub-problems, where each sub-problem corresponding to device n is formulated as follows:
J n λ = inf π n Π n lim sup T E π n 1 T t = 0 T 1 ω n p n X n , t Δ t , Δ t + λ u n t .
where π n = u n 0 , u n 1 , denotes a sub-scheduling policy for devices n , and Π π represents the set of all admissible policies for the sub-problems J n λ .

4.3. The Gain-Index Policy

To address the decoupled sub-problems described in Equation (17), we propose a low-complexity scheduling policy based on gain indices. The derivation of the gain index follows the index policy construction framework introduced in [30]. Each sub-problem can be modeled as an MDP, where the state is defined as X n , t Δ t = x , Δ t = δ . The Bellman optimality equation under the average cost criterion for the MDP in Equation (17) is given by the following:
V n x , δ , λ + J n λ = min Q n x , δ , λ , u ,
where V n x , δ , λ is the relative-value function and Q n x , δ , λ , u is the action value function defined as follows:
Q n x , δ , λ , u = ω n p n x , δ + λ u + E V n x , δ , λ x , δ , u ,
It can also be written as the following
Q n x , δ , λ , u = ω n p n x , δ + V n x , δ + 1 , λ , u = 0 ω n p n x , δ + λ + 1 β n V n x , δ + 1 , λ + β n E V n X n , t 1 , 1 , λ X n , t Δ t = x , Δ t = 1 , u = 1
where β n = l n × l r . In the context that λ is fixed, we may also simply write the value function as V n x , δ .
We propose the following index policy for the original RMAB problem.
Definition 1.
(Gain Index Policy): For each device  n N  , we define a gain index as follows:
g n x , δ , λ = Q n x , δ , λ , 1 Q n x , δ , λ , 0 ,
In each time slot  t  , the gain index  g n  is computed for each device  n . Based on these indices, the scheduling policy selects the top  M  devices with the highest gain values to transmit their status updates.
Substituting the definitions from Equation (20), we obtain the following:
g n x , δ , λ = λ β n V n x , δ + 1 , λ + β n E V n X n , t 1 , 1 , λ X n , t Δ t = x , Δ t = 1 ,
Then, we can obtain the optimal decision for all devices at time t as follows:
u n t = arg max g n x , δ , λ ,
We can use a dual subgradient method to solve this problem. For λ , we define the following iteration:
λ t + 1 = max λ t + α t n = 1 N u n t M , 0 ,
Let λ 0 = 0 , λ be the optimal solution to Equation (15), and let α t be the step size. The iterative process stops when the condition λ t + 1 λ t < ε is satisfied.
Based on Definition 1, we develop a computationally efficient algorithm to solve the relaxed optimization problem defined in Equation (15). The core idea of this approach is the introduction of a gain index value, which serves as a decision metric to guide the selection of scheduling actions u n t for each IoT device. The complete procedure is presented in Algorithm 1 in the form of pseudocode, which outlines the step-by-step execution of the proposed scheduling strategy.
Algorithm 1 The Gain Index Policy
1: Input:
2:          - Optimal dual variable λ *
3:          - Channel count M , devices N , stepsize α t
4:          - Device states ( X n , t Δ n t , Δ n t ) for all n N
5: Output:
6:          - Scheduling policy u n t 0 , 1 for all n N
7: Initialization:
8:          -  λ 0 = 0
9: Online Scheduling (for each time slot  t ):
10: for each device n = 1 to N  do
11:          Update current state ( X n , t Δ n t , Δ n t )
12:          Compute gain index g n x , δ , λ = Q n x , δ , λ , 1 Q n x , δ , λ , 0 use Equations (18)–(22):
13:          Select top M devices with largest positive gain index values use Equation (23)
14:          Set u n t = 1 for selected devices, 0 otherwise>
15:          Update λ t + 1 use Equation (24)
16: end for
In dual-based optimization, it is typically difficult to establish pointwise convergence of the dual iterates or the corresponding function values, especially in the absence of strong convexity or uniqueness. Instead, it is standard to focus on the time-averaged convergence, which can be rigorously derived using projection subgradient analysis. This part of the proof is provided in Appendix A.

5. Results

This section presents simulation results to demonstrate the performance of the proposed gain-index policy. We compare the gain index policy, randomized policy (if M channel resources are available, this policy randomly selects at most M devices.), round-robin policy (schedules the devices in sequence according to certain orders and rules), periodic updating (each sensor generates status updates at every time slot and stores them in a first-in-first-out (FIFO) queue. When a communication channel becomes available, the sensor transmits the earliest pending update from the queue), and MAX-AoI (at each time slot t , the MAX-AoI strategy selects the M sensors with the largest Age of Information (AoI) values for transmission).
In the numerical simulation experiment, we consider the square grid area in Figure 2, where the center of the grid is the fire point and the fire spreads outwards. Deployed IoT devices monitor environmental parameters and generate status updates when requested. During observation horizon T , communications occur through a two-hop network of M = 5 independent and identically distributed erasure channels, where the first hop involves device-to-UAV transmission with success probability l n = 0.95 , and the second hop (UAV-to-GBS) has a transmission success probability l r = 0.95 . In the iterative process of solving λ , we assume that ε = 10 3 , δ max = T .
The performance comparison of the three scheduling policies is presented in Figure 3. The normalized average penalty shown in the figure is computed by dividing the time-averaged system cost by the number of devices. Numerical results indicate that the proposed gain-index-based scheduling policy achieves a notable reduction in average penalty compared to the baseline strategies. In particular, it outperforms the random scheduling policy by 20.33% and the round-robin policy by 1.48%, in terms of long-term average system penalty. Moreover, the gain-index strategy achieves a 17.27% reduction in average penalty compared to the periodic update policy, and a 0.99% reduction compared to the MAX-AoI strategy. These results demonstrate the effectiveness of the proposed method in improving scheduling efficiency under various baseline settings.
Figure 3. (a) Time-Average penalty; (b) Time-Average penalty presented as a bar chart.
As illustrated in Figure 4, the three strategies demonstrate nearly identical performance when N M , where the number of devices matches available channels, eliminating competition and enabling full scheduling capacity. However, as N increases, the performance of the random scheduling policy degrades significantly, since it selects at most M devices randomly for status updates, regardless of their urgency. In contrast, the gain-index-based policy makes more informed scheduling decisions by jointly considering both the freshness of information and the observed environmental state. On average, the gain-index strategy achieves a 10.75% improvement over random scheduling and a 5.71% improvement over round-robin scheduling. Compared to periodic updating, the gain-index strategy reduces the average penalty by 12.58%, and it further achieves a 5.29% reduction compared to the MAX-AoI policy. These results demonstrate the superior scalability and robustness of the proposed gain-index policy.
Figure 4. Penalty with number of devices.
Considering the weight of the devices, the sum of the weights ω n is normalized to 1. Some devices are assigned weights greater than 0, reflecting higher importance, while others may have weights closer to 0. In the weighted gain strategy, each device’s scheduling priority is determined by the product ω n g n x , δ , λ , where g n x , δ , λ is the original gain index value that reflects the expected benefit of updating the device. This approach allows the scheduler to favor high-impact updates and leads to a consistent reduction in overall system penalty.
Figure 5 illustrates the evolution of system penalty over time under two scheduling strategies: weighted gain and unweighted gain, across T = 50 time slots. The weighted gain strategy consistently achieves a lower system penalty than the unweighted gain strategy, with the performance gap especially evident in the early stages. On average, it reduces the penalty by 11.96% by prioritizing updates from high-impact devices, thereby improving information freshness and reducing decision errors in dynamic environments.
Figure 5. Weighted Gain vs. Unweighted Gain.

6. Limitations and Future Work

While the proposed entropy-based scheduling framework demonstrates significant improvements in minimizing system estimation loss, several limitations warrant further exploration. The current model assumes a static UAV position and does not consider trajectory optimization, which could further enhance performance in dynamic environments. In addition, the loss function relies on predefined misclassification costs that may not fully reflect real-world operational risks. The framework also adopts a centralized scheduling scheme with perfect knowledge assumptions, which may limit scalability in large-scale or partially observable IoT systems. Future work will focus on extending the framework to support joint UAV trajectory and scheduling optimization, learning-based estimation of penalties, and scalable distributed implementations. Furthermore, integrating the proposed approach with emerging paradigms such as integrated sensing and communication (ISAC) offers a promising direction for jointly optimizing sensing quality and information freshness under resource constraints.

7. Conclusions

In this paper, we proposed an L-conditional cross-entropy-based, age-aware scheduling strategy for UAV-assisted IoT data transmission. By incorporating the L-conditional cross-entropy between the true and estimated status as the penalty metric, our approach effectively captures the estimation degradation caused by outdated information, especially under asymmetric loss costs. We modeled the scheduling problem using a Restless Multi-Armed Bandit (RMAB) framework and introduced a net-gain-based index policy. By applying Lagrange relaxation and dual gradient updates, we derived an efficient gain function that guides the UAV in selecting the most valuable sources to update at each time step. Unlike traditional Whittle index approaches, the proposed strategy does not require indexability conditions, making it more widely applicable to real-world IoT monitoring tasks with heterogeneous sensor dynamics and cost structures. Simulation results demonstrated that the proposed method significantly outperforms baseline scheduling strategies such as round-robin and random selection, particularly under challenging conditions involving limited bandwidth and critical monitoring tasks such as fire detection, where information freshness is crucial.

Author Contributions

L.J. led the research and was responsible for writing the Introduction, System Model, The Scheduling Problem and Policy, Results and Conclusions sections. H.W., Z.Q. and P.Z. guided the methodology and experimental design, and contributed to revising and polishing this paper. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (NSFC) under Grant No. 62171465.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Proof of Convergence of Dual Problem

To solve the relaxed scheduling problem in (15), we apply a dual decomposition approach with a Lagrange multiplier λ and iteratively update it using the following stochastic rule
λ t + 1 = max λ t + α t n = 1 N u n t M , 0 = 0 , λ t + α t n = 1 N u n t M ,
where α t = c t , c is a constant, u n t 0 , 1 is the scheduling decision for device n at time t , and M is the total number of available communication channels. 0 , defines the projection operator onto the non-negative orthant, λ t is the dual variable associated with the long-term average constraint, and λ is the optimal solution to the dual problem (15).
In what follows, we focus on establishing the convergence of the dual subgradient method applied to the dual optimization problem. Specifically, we analyze the sequence of dual variables generated by projected subgradient updates and study the behavior of the corresponding dual function values. By leveraging the concavity of the dual function and the basic iterate inequality from Lemma 2 in [31], we aim to demonstrate that the time-averaged dual function values asymptotically converge to the optimal dual value. As mentioned in the previous text, the dual function is defined as:
J λ = inf π Π L π , λ ,
where
L π , λ = lim sup T n = 1 N E π 1 T t = 0 T 1 ω n p n X n , t Δ t , Δ t + λ u n t λ M ,
This is the pointwise infimum of a family of affine functions in λ . Hence, J λ is concave, and J λ is convex [32,33]. Let
g t = n = 1 N u n t M .
Since this quantity is the derivative of L π , λ with respect to λ , evaluated at a fixed policy π under λ t , we have the following:
E g t J λ t .
That is, the expected value of g t belongs to the subdifferential of the dual function at λ t . Therefore, g t is an unbiased estimator of a subgradient of J λ t . Since each u n t 0 , 1 , and at most M devices can be scheduled, we have the following:
g t = n = 1 N u n t M g t M , N M g t max M , N M G ,
Therefore g t is uniformly bounded, E g t 2 < .
We choose the step-size α t = c t , c = 1 , which satisfies the following:
t = 1 1 t = t = 1 1 t 2 < ,
According to [31], for any λ > 0 , we have the following:
λ t + 1 - λ 2 λ t λ 2 2 α t J λ J λ t + α t 2 G 2 .
Sum the expressions as follows:
t = 0 T 1 λ t + 1 - λ 2 t = 0 T 1 λ t λ 2 t = 0 T 1 2 α t J λ J λ t + t = 0 T 1 α t 2 G 2
Let λ = λ ,
t = 0 T 1 2 α t J λ J λ t λ 0 λ 2 λ T λ 2 + t = 0 T 1 α t 2 G 2 ,
t = 0 T 1 α t J λ J λ t λ 0 λ 2 + G 2 t = 0 T 1 α t 2 2 ,
Divide both sides by T α t ,
1 T t = 0 T 1 J λ t J λ B T T α t ,
where B T T α t = λ 0 λ 2 2 α t T + T G 2 α t 2 2 α T t , as T , B T T α t 0 and we obtain the following:
lim inf T 1 T t = 0 T 1 J λ t J λ ,
According to the dual problem (15), the feasible region of the dual variables is D = 0 , , and the optimal solution is λ = arg max λ D J λ t , therefore, we have J λ : = max λ D J λ t and J λ t J λ . By integrating Equation (A14), we derive the following:
lim T 1 T t = 0 T 1 J λ t = J λ ,
The above result establishes that the time-averaged dual function values generated by the projected subgradient method converge to the optimal dual value.

References

  1. ITU Internet Reports 2005: The Internet of Things. 2005. Available online: http://handle.itu.int/11.1002/pub/800eae6f-en (accessed on 1 November 2022).
  2. Shah, S.K.; Joshi, K.; Khantwal, S.; Bisht, Y.S.; Chander, H.; Gupta, A. IoT and WSN integration for data acquisition and supervisory control. In Proceedings of the 2022 IEEE World Conference on Applied Intelligence and Computing, Sonbhadra, India, 17–19 June 2022; pp. 513–516. [Google Scholar]
  3. Wang, X.; Yi, M.; Liu, J.; Zhang, Y.; Wang, M.; Bai, B. Cooperative Data Collection with Multiple UAVs for Information Freshness in the Internet of Things. IEEE Trans. Commun. 2023, 71, 2740–2755. [Google Scholar] [CrossRef]
  4. Kaul, S.; Yates, R.; Gruteser, M. Real-time status: How often should one update? In Proceedings of the 2012 Proceedings IEEE INFOCOM, Orlando, FL, USA, 25–30 March 2012; pp. 2731–2735. [Google Scholar]
  5. Qian, Z.; Wu, F.; Pan, J.; Srinivasan, K.; Shroff, N.B. Minimizing age of information in multi-channel time-sensitive information update systems. In Proceedings of the IEEE Conference on Computer Communications, Toronto, ON, Canada, 6–9 July 2020; pp. 446–455. [Google Scholar]
  6. Zhu, B.; Bedeer, E.; Nguyen, H.H.; Barton, R.; Gao, Z. UAV Trajectory Planning for AoI-Minimal Data Collection in UAV-Aided IoT Networks by Transformer. IEEE Trans. Wirel. Commun. 2023, 22, 1343–1358. [Google Scholar] [CrossRef]
  7. Liu, J.; Wang, X.; Bai, B.; Dai, H. Age-optimal trajectory planning for UAV-assisted data collection. In Proceedings of the IEEE INFOCOM 2018—IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), Honolulu, HI, USA, 15–19 April 2018; pp. 553–558. [Google Scholar]
  8. Mao, C.; Liu, J.; Xie, L. Multi-UAV Aided Data Collection for Age Minimization in Wireless Sensor Networks. In Proceedings of the 2020 International Conference on Wireless Communications and Signal Processing (WCSP), Nanjing, China, 21–23 October 2020. [Google Scholar]
  9. Jia, Z.; Qin, X.; Wang, Z.; Liu, B. Age-Based Path Planning and Data Acquisition in UAV-Assisted IoT Networks. In Proceedings of the 2019 IEEE International Conference on Communications Workshops (ICC Workshops), Shanghai, China, 20–24 May 2019; pp. 1–6. [Google Scholar]
  10. Zhang, J.; Kang, K.; Yang, M.; Zhu, H.; Qian, H. AoI-minimization in UAV-assisted IoT Network with Massive Devices. In Proceedings of the 2022 IEEE Wireless Communications and Networking Conference (WCNC), Austin, TX, USA, 10–13 April 2022; pp. 1290–1295. [Google Scholar]
  11. Emami, Y.; Li, K.; Niu, Y.; Tovar, E. AoI Minimization Using Multi-Agent Proximal Policy Optimization in UAVs-Assisted Sensor Networks. In Proceedings of the ICC 2023—IEEE International Conference on Communications, Rome, Italy, 28 May–1 June 2023; pp. 228–233. [Google Scholar]
  12. Choudhury, B.; Shah, V.K.; Ferdowsi, A.; Reed, J.H.; Hou, Y.T. AoI-minimizing Scheduling in UAV-relayed IoT Networks. In Proceedings of the 2021 IEEE 18th International Conference on Mobile Ad Hoc and Smart Systems (MASS), Denver, CO, USA, 4–7 October 2021; pp. 117–126. [Google Scholar]
  13. Huang, X.; Fu, X. Fresh Data Collection for UAV-Assisted IoT Based on Aerial Collaborative Relay. IEEE Sens. J. 2023, 23, 8810–8825. [Google Scholar] [CrossRef]
  14. Long, Y.; Zhang, W.; Gong, S.; Luo, X.; Niyato, D. AoI-aware Scheduling and Trajectory Optimization for Multi-UAV-assisted Wireless Networks. In Proceedings of the GLOBECOM 2022—2022 IEEE Global Communications Conference, Rio de Janeiro, Brazil, 4–8 December 2022; pp. 2163–2168. [Google Scholar]
  15. Song, Z.; Yang, T.; Wu, X.; Feng, H.; Hu, B. Regret of Age-of-Information Bandits in Nonstationary Wireless Networks. IEEE Wirel. Commun. Lett. 2022, 11, 2415–2419. [Google Scholar] [CrossRef]
  16. Maatouk, A.; Kriouile, S.; Assaad, M.; Ephremides, A. The Age of Incorrect Information: A New Performance Metric for Status Updates. IEEE/ACM Trans. Netw. 2020, 28, 2215–2228. [Google Scholar] [CrossRef]
  17. Ismail, C.; Ulukus, S.; Akar, N. Minimizing Functions of Age of Incorrect Information for Remote Estimation. arXiv 2025, arXiv:2504.10451. [Google Scholar]
  18. Zhong, J.; Yates, R.D.; Soljanin, E. Two Freshness Metrics for Local Cache Refresh. In Proceedings of the 2018 IEEE International Symposium on Information Theory (ISIT), Vail, CO, USA, 17–22 June 2018; pp. 1924–1928. [Google Scholar]
  19. Zheng, X.; Zhou, S.; Niu, Z. Urgency of Information for Context-Aware Timely Status Updates in Remote Control Systems. IEEE Trans. Wirel. Commun. 2020, 19, 7237–7250. [Google Scholar] [CrossRef]
  20. Ornee, T.Z.; Shisher, M.K.C.; Kam, C.; Sun, Y. Context-aware Status Updating: Wireless Scheduling for Maximizing Situational Awareness in Safety-critical Systems. In Proceedings of the MILCOM 2023—2023 IEEE Military Communications Conference (MILCOM), Boston, MA, USA, 30 October–3 November 2023. [Google Scholar]
  21. Luo, J.; Pappas, N. Minimizing the Age of Missed and False Alarms in Remote Estimation of Markov Sources. In Proceedings of the Twenty-Fifth International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing (MobiHoc’24), Athens, Greece, 14–17 October 2024; pp. 381–386. [Google Scholar]
  22. Luo, J.; Nikolaos, P. On the Cost of Consecutive Estimation Error: Significance-Aware Non-linear Aging. arXiv 2024, arXiv:2410.03637. [Google Scholar]
  23. Erbayat, E.; Maatouk, A.; Zou, P.; Subramaniam, S. Age of Information Optimization and State Error Analysis for Correlated Multi-Process Multi-Sensor Systems. In Proceedings of the Twenty-fifth International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing (MobiHoc’24), Athens, Greece, 14–17 October 2024; pp. 331–340. [Google Scholar]
  24. Shisher, M.K.C.; Sun, Y. How does data freshness affect real-time supervised learning? In Proceedings of the Twenty-Third International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing, Seoul, Republic of Korea, 17–20 October 2022; pp. 31–40. [Google Scholar]
  25. Kullback, S.; Leibler, R.A. On Information and Sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
  26. Jensen, J.L.W.V. Sur les fonctions convexes et les inégalités entre les valeurs moyennes. Acta Math. 1906, 30, 175–193. [Google Scholar] [CrossRef]
  27. Weber, R.R.; Weiss, G. On an index policy for restless bandits. J. Appl. Probab. 1990, 27, 637–648. [Google Scholar] [CrossRef]
  28. Whittle, P. Restless bandits: Activity allocation in a changing world. J. Appl. Probab. 1988, 25, 287–298. [Google Scholar] [CrossRef]
  29. Bertsekas, D.P. Nonlinear programming. J. Oper. Res. Soc. 1997, 48, 334. [Google Scholar] [CrossRef]
  30. Chen, G.; Liew, S.C. An Index Policy for Minimizing the Uncertainty-of-Information of Markov Sources. IEEE Trans. Inf. Theory 2023, 70, 698–721. [Google Scholar] [CrossRef]
  31. Nedic, A.; Ozdaglar, A. Subgradient methods in network resource allocation: Rate analysis. Annu. Conf. Inf. Sci. Syst. 2008, 8, 1189–1194. [Google Scholar]
  32. Bertsekas, D. Convex Optimization Theory; Athena Scientific: Raleigh, NC, USA, 2009; Volume 1. [Google Scholar]
  33. Boyd, S. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.