Next Article in Journal
Scaling Invariance: A Gateway to Phase Transitions
Previous Article in Journal
Tight Bounds Between the Jensen–Shannon Divergence and the Minmax Divergence
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

DRL-Based Scheduling for AoI Minimization in CR Networks with Perfect Sensing

1
School of Computer and Data Engineering, NingboTech University, Ningbo 315104, China
2
Computer Science and Technology, Zhejiang University of Technology, Hangzhou 310023, China
*
Author to whom correspondence should be addressed.
Entropy 2025, 27(8), 855; https://doi.org/10.3390/e27080855
Submission received: 6 May 2025 / Revised: 6 July 2025 / Accepted: 9 July 2025 / Published: 11 August 2025
(This article belongs to the Section Information Theory, Probability and Statistics)

Abstract

Age of Information (AoI) is a newly introduced metric that quantifies the freshness and timeliness of data, playing a crucial role in applications reliant on time-sensitive information. Minimizing AoI through optimal scheduling is challenging, especially in energy-constrained Internet of Things (IoT) networks. In this work, we begin by analyzing a simplified cognitive radio network (CRN) where a single secondary user (SU) harvests RF energy from the primary user and transmits status update packets when the PU spectrum is available. Time is divided into equal time slots, and the SU performs either energy harvesting, spectrum sensing, or status update transmission in each slot. To optimize the AoI within the CRN, we formulate the sequential decision-making process as a partially observable Markov decision process (POMDP) and employ dynamic programming to determine optimal actions. Then, we extend our investigation to evaluate the long-term average weighted sum of AoIs for a multi-SU CRN. Unlike the single-SU scenario, decisions must be made regarding which SU performs sensing and which SU forwards the status update packs. Given the partially observable nature of the PU spectrum, we propose an enhanced Deep Q-Network (DQN) algorithm. Simulation results demonstrate that the proposed policies significantly outperform the myopic policy. Additionally, we analyze the effect of various parameter settings on system performance.

1. Introduction

Over the past few decades, the burgeoning number of wireless devices and the rising need for wide-band services have caused a considerable depletion of the available licensed spectrum. A report by the Federal Communications Commission (FCC) reveals that the utilization rate of the licensed spectrum is below 10 percent at any specific time and location [1]. This significant spectrum underutilization has spurred the development of cognitive radio (CR), a technology designed to enable secondary users (SUs) to access primary users’ (PUs) licensed bands while guaranteeing quality of service (QoS) for PUs [2,3,4,5]. In cognitive radio networks (CRNs), two primary spectrum sharing strategies are commonly employed: underlay and overlay. In the underlay approach, both the primary and secondary networks simultaneously transmit data over the licensed spectrum, with interference from the secondary network constrained to a predefined threshold to ensure the protection of the primary network [6]. On the other hand, the overlay strategy allows the secondary network to transmit data only when the primary network is not using the spectrum. This approach protects the primary network by enabling the secondary network to dynamically detect an idle spectrum and opportunistically access it without causing interference [4].
Beyond the issue of spectrum scarcity, energy constraints also present a substantial challenge for the evolution of future wireless networks. Recently, the ability to scavenge energy from radio frequency (RF) signals has been recognized as a promising avenue for powering low-power wireless devices. By using RF energy-harvesting technology, wireless powered communication networks (WPCNs) offer a solution to the unpredictability and intermittency of traditional charging methods by harnessing energy from renewable sources [7,8,9,10,11]. With widespread attention to WPCNs, we have previously conducted some related work. In [12], we address the problem of maximizing the aggregate computation rate in a WPCN comprising multiple hybrid access points (HAPs) and IoT devices, where we introduce a deep reinforcement learning (DRL) algorithm for determining near-optimal offloading strategies and develop an efficient Lagrangian duality-based approach to derive the corresponding optimal time allocation. In [13], we investigate the application of physical layer security (PLS) as a mechanism for ensuring the privacy of the secondary network’s communication, where both the SU and jammer can harvest RF energy from the primary network transmissions. In [14], we focus on the problem of jointly optimizing the duration of wireless power transmission, the allocation of transmission time for individual edge devices, and the decision regarding partial task offloading to achieve the maximum sum computation rate.
To address the aforementioned challenges of spectrum scarcity and energy deficiency, RF energy harvesting in CRNs has gained significant attention. This approach enhances both energy and spectral efficiency, enabling battery-free SUs to simultaneously capture energy and spectra from PUs [15]. Considering an SU intermittently generating real-time messages with a delivery deadline, the authors in [16] derive the probability of the timely delivery of data packs.
In contrast to traditional CR systems, where the SU maintains time-slot synchronization with the PU, the authors in [17] explore an RF energy-harvesting CRN with an unslotted PU, focusing on sensing intervals to strike a balance between spectrum access and energy harvesting. However, refs. [16,17] examine simple CRNs with a single primary link and a single secondary link. On the other hand, refs. [18,19,20] investigate scenarios involving multiple SUs or multiple PUs. In [18], a multichannel selection strategy is proposed to maximize the average throughput of SUs in RF energy-harvesting CRNs. In [19], the authors focus on a hybrid energy-harvesting SU model, where the SU harvests energy from both solar and ambient RF signals. A convex framework is developed to maximize throughput by optimizing the sensing duration, active probability, and detection threshold for each SU. In [20], multi-hop transmission with a time division multiple access (TDMA) echanism is employed by the SUs, where the authors address throughput maximization by jointly optimizing transmission power and time.
Existing research primarily focuses on the throughput and delay performance of SUs while considering the QoS requirements of PUs in RF energy-harvesting CRNs [16,17,18,19,20]. However, many emerging CRN applications, such as smart buildings, vehicle-to-vehicle networking, environmental monitoring, and health monitoring [21,22,23], demand the timely delivery of physical process status updates. For example, in environmental monitoring systems, sensor nodes must frequently collect and transmit critical parameters (e.g., temperature, humidity) to ensure accurate real-time tracking, highlighting the necessity of low-latency status updates. Age of Information (AoI), a newly proposed network performance metric, can be used to characterize the timeliness of status updates in RF energy-harvesting CRNs. The AoI metric tracks the temporal freshness of received status updates. It measures the duration from the generation of the latest successfully received update to the current time at the destination [24,25,26,27].
There have been some initial studies on the AoI of CRNs [28,29,30,31,32,33]. Ref. [28] investigates an interference-free interweave CR system, developing an optimized framing and scheduling approach that maximizes energy efficiency while adhering to AoI constraints. In [29], the authors conduct a comprehensive analysis of the average peak AoI, deriving asymptotic closed-form expressions for both underlay and overlay transmission schemes under ideal spectrum sensing conditions. The authors in [30] investigate an overlay CR system where SUs may either transmit their own data or serve as relays for PU transmissions. The study develops a constrained Markov decision process framework (CMDP) to optimize the joint status update and relaying strategy, simultaneously addressing AoI minimization and energy efficiency. The authors in [31] study AoI minimization for energy-harvesting SU, where the SU can harvest energy from ambient energy sources. The optimal sensing and update decision problems are initially formulated as partially observable Markov decision process (POMDP) frameworks and then solved using dynamic programming. Following [31], the authors in [32] investigate AoI minimization for wireless energy-harvesting SUs in a CRN where multiple SUs and multiple PUs coexist. In each time slot, the SUs either harvest energy from the PU transmissions or deliver status updates when no primary receivers are present in the associated guard zone. The authors first derive the outage probability of the PUs and then propose a greedy policy for the SUs performing status updates. Considering the difficulty for PUs and SUs to achieve perfect time slot synchronization, the authors in [33] study the AoI minimization problem in a CRN with unslotted PUs. The authors derived the closed-form expression for the average AoI by conducting a Markov chain analysis.
To the best of our knowledge, relatively less work has been conducted on AoI minimization in RF energy-harvesting CRNs. Motivated by this, this work focuses on optimizing the average AoI for an SU powered by RF energy harvesting from PU’s transmissions within a defined time frame. The optimization process rigorously considers both the energy causality constraint and limitations imposed by spectrum availability. In this system model, the SU relies on energy scavenged from the PU to first perform spectrum sensing. Subsequently, if the PU vacates the spectrum, the SU further decides whether to transmit its status update data packet to the common base station (CBS). Additionally, we consider a broader scenario where multiple SUs, powered by the PU, are deployed to monitor various physical processes and send their status update data to a shared CBS. The goal is to minimize the long-term average weighted sum of AoI (sum-AoI), which represents the total of AoI values for each SU at the CBS. This research presents a significant advancement in the study of RF energy-harvesting CRNs by, for the first time, simultaneously exploring AoI minimization in both single-SU and multiple-SU scenarios. Furthermore, we innovatively employ a DQN approach to solve the POMDP problem, where only the PU spectrum belief is known.
It is worth noting that our paper significantly differs from [31,32,33,34,35]. In [31], the SUs harvest energy from the ambient environment, and the scenario of multiple SUs is not considered. In contrast, our problem setup involves SUs harvesting energy from PU transmissions, and we consider the presence of multiple SUs. Furthermore, sensing actions are not considered in [32], whereas we incorporate them. The scenario considered in [33] also differs from ours. In [33], the PU’s activity is characterized by its ability to randomly seize the channel and potentially interrupt ongoing SU transmissions mid-slot, leading to SU transmission failures. However, we consider PUs and SUs achieving perfect time slot synchronization. Finally, refs. [31,32,33] do not utilize DRL-based methods, while our approach is based on DRL. Refs. [34,35] use the DDPG algorithm within an overlay CRN, while we employ the DQN. The main contributions of this paper are summarized as follows.
  • We focus on minimizing the average AoI for a single RF energy-harvesting SU with a fixed time duration. The SU obtains its energy supply through harvesting energy from PU transmissions and is enabled to deliver status update data packets to the CBS only when the PU spectrum is identified as being in an idle condition. In each discrete time interval, the SU’s spectrum sensing and status update decisions are adaptively made considering its current energy reserves, AoI, channel link quality, and the availability of the PU spectrum. The decision-making problem under consideration is modeled using a POMDP with discrete state and action sets. The optimal policy for this model is then derived through the application of dynamic programming.
  • We extend the scenario to multiple SUs, where the objective is to minimize the long-term average weighted sum-AoI by making adaptive sensing and update decisions. We model this decision-making problem as a POMDP with finite state and action spaces. However, due to the computational challenges posed by the extreme curse of dimensionality in the state space of the POMDP, we propose an improved DQN approach to learn the optimal policy. This enhanced DQN approach is tailored to handle the POMDP problem, where the partially observable state is modeled as a Markov chain.
  • We validate through extensive simulations that the proposed policies essentially improve the system performance compared to the myopic policy, and we also analyze the impact of system parameter settings on the system performance.
The remaining part of this paper is organized as follows. Section 2 describes the system model for the RF energy-harvesting CRN with one SU. Section 3 formulates the finite-horizon AoI minimization problem for the single-SU scenario as a POMDP framework and solves it through dynamic programming. Section 4 presents the system model for RF energy-harvesting CRN with multiple SUs. Section 5 formulates the infinite-horizon AoI minimization problem for the multiple-SU scenario as a POMDP framework and solves it using a new DQN approach. Section 6 presents simulation results. Finally, Section 7 concludes the paper.

2. System Model for RF Energy-Harvesting CRN with One SU

Figure 1 illustrates the CRN under investigation, which comprises one SU, one PU, and a CBS receiving status update packs sent by the SU. The SU, equipped with a sensor, monitors a physical process and transmits status updates regarding its observations to the CBS. Lacking an internal power supply, it operates by scavenging energy from the PU’s transmissions. Additionally, spectrum access is granted to it when the PU vacates the channel, enabling opportunistic operation. We adopt a discrete-time framework, dividing the operational period into T time slots, indexed by t = 0 , 1 , , T 1 . To simplify the analysis, we assume a unit duration of one second for each time slot.

2.1. Primary User Model

The PU is granted preferential access to the spectrum, and its channel occupancy dynamics are characterized by a two-state Markov chain, encompassing active (A) and idle (I) states [36,37]. During each time slot, the PU alternates between data transmission in the active state and silence in the idle state. We define the transition probabilities of the Markov chain as p i i and p a i , with p i i indicating the persistence of the idle state and p a i quantifying the likelihood of a transition from the active to the idle state. For t = 0 , 1 , , T 1 , we have
p i i = P ( q t + 1 = I | q t = I ) ,
p a i = P ( q t + 1 = I | q t = A ) .
The SU possesses prior knowledge of the transition probabilities, acquired through extended measurement periods.

2.2. Secondary User Model

The SU keeps time-slotted synchronization with the PU. The SU deliberates whether to sense the spectrum at the start of each time slot. Should the SU opt not to perform spectrum sensing, the entirety of the time slot is dedicated to energy harvesting from the PU’s transmissions, a process that is exclusively enabled during the PU’s active state and ceases when the PU is idle. Should the SU opt to perform spectrum sensing, a predetermined fraction of the time slot is assigned to this task. We assume the perfect sensing outcome for the SU [31,33]. Our decision to assume perfect sensing in this work was primarily driven by the need to isolate and focus on the core contributions of our proposed DQN algorithm regarding AoI minimization. This ideal scenario allows us to establish a performance upper bound and evaluate the fundamental effectiveness of our DQN framework without the added complexity introduced by sensing errors. If the spectrum is sensed to be occupied by the PU, the SU harvests energy in the rest time of this time slot; otherwise, the SU needs to further decide whether to update when the sensing action ends. The SU aims to minimize the average AoI by employing optimal spectrum sensing and status update policies throughout the entire operational period. The action executed within the time slot t is designated as x t = ( θ t , ϕ t ) . θ t { 0 ( not sense ) , 1 ( sense ) } , which denotes spectrum sensing performed by the SU, while ϕ t { 0 ( not update ) , 1 ( update ) } signifies the SU’s status update operation. The SU’s decisions are influenced by its state and the statistical knowledge of the PU’s activity, as described in the following.
(1) Belief model: The SU observes the availability of the PU’s spectrum by adaptively detecting and opportunistically accessing the idle spectrum. Leveraging the SU’s action and observation history, the belief state concerning the PU’s spectrum activity is formulated. Specifically, at the commencement of each time slot, the SU constructs the belief state, denoted as ϱ t , which represents the conditional probability of the PU being in the idle state, given the SU’s historical actions and observations.
(2) Channel model: The channel power gains, denoted by h t (SU-to-CBS) and g t  (PU-to-SU) for the time slot t, are modeled as independent and identically distributed (i.i.d.) random variables over time. The channels are modeled using the Rayleigh fading model and are assumed to experience quasi-static flat fading. That is, they exhibit temporal stability within individual time slots but undergo variations across successive time slots. Adhering to a well-established practice within the wireless communication community, the channel power gains during the current time slot are assumed to be perfectly known [38].
(3) Energy-harvesting model: The SU can harvest energy from the PU transmissions. For the SU, harvest-then-transmit protocol is employed [39]. The SU initiates energy harvesting from PU transmissions, subsequently employing the acquired energy to transmit status update packets to the CBS. In general, there are two scenarios for energy harvesting: (1) when a non-sensing decision coincides with the primary user’s active transmission and (2) following a sensing operation that confirms the PU’s occupancy. Let τ represent the time allocated for energy harvesting, η the energy conversion efficiency, and P the PU’s transmission power. The amount of energy acquired by the SU through harvesting is expressed as
E H , m t = τ η P g t , t = 0 , 1 , , T 1 , m = 1 , 2 .
The two distinct values of m correspond to the two scenarios for energy harvesting. The harvested energy is used for both spectrum sensing operations and the subsequent transmission of status update packets via the wireless channel. We define δ and τ s as the energy and time used for sensing, respectively. Meanwhile, we define E T t as the energy consumption and τ t as the time required for data transmission within the time slot t. Let σ 2 and W represent the noise power at the CBS and the bandwidth, respectively. The noise is modeled as Gaussian white noise. According to Shannon’s formula, E T t is expressed as
E T t = σ 2 τ t h t ( 2 S τ t W 1 ) ,
The SU’s battery capacity is B max . We define b t  as the battery’s state of charge during the time slot t, which evolves as
b t + 1 = min { b t + E H , m t θ t δ ϕ t E T t , B max } , t = 0 , 1 , , T 1 .
Here, E H , m t is the energy captured by the SU, θ t δ the energy consumed for spectrum sensing, and ϕ t E T t the energy consumed for data transmission, all within the time slot t. It is worth noting that if the PU spectrum is occupied, the SU performs energy harvesting, and the battery state increases by E H , m t . Conversely, if the PU spectrum is idle and the SU decides to send a status update pack, the battery state decreases by ϕ t E T t . Therefore, the energy causality constraint is ensured by
θ t δ + ϕ t E T t b t , t = 0 , 1 , 2 , , T 1 .
(4) Age of information: Let a t represent the AoI in the time slot t. The upper bound of the AoI is denoted as A max , calculated as A max = a 0 + T , where a t A { 1 , 2 , , A max } . For every time slot, where an update decision is made by the SU, one status update packet is generated and dispatched, utilizing the generate-at-will scheme [28,29,30,31,32,33]. The size of data packet S is small enough that it can be generated and updated immediately after the update decision is made and received by the end of the time slot. Following the successful reception of the update at the CBS, the AoI is reset to 1; conversely, it increases by 1. AoI evolves over time slots and is given by
a t + 1 = 1 , if x t = ( 1 , 1 ) , a t + 1 , otherwise .
Equation (7) posits an error-free channel, which ensures the successful reception of status update data packets at the CBS upon an update decision. The average AoI across T time slots is computed as
A ¯ = 1 T t = 0 T 1 a t , t = 0 , 1 , 2 , , T 1 .

3. FINITE POMDP FORMULATION for RF Energy-Harvesting CRN with One SU

3.1. POMDP Formulation

To address the SU’s AoI minimization problem, the optimal sensing and update strategies are modeled using a POMDP. The elements of this POMDP are detailed below.
  • Actions: Initially, the SU determines whether to perform spectrum sensing. If it does not sense the spectrum, then it harvests energy from the PU transmissions and does not deliver the status update data pack, i.e., x t = ( 0 , 0 ) . If it senses the spectrum and finds that the spectrum is occupied by the PU, it also cannot perform an update i.e., x t = ( 1 , 0 ) . If it senses the spectrum and finds that the spectrum is vacated by the PU, it needs to further decide whether to deliver the status update data pack based on its AoI, channel state from it to the CBS, the channel state from the PU to it, and the energy availability, i.e., x t = ( 1 , 0 ) or x t = ( 1 , 1 ) . Consequently, the action in each time slot can be defined as x t = ( θ t , ϕ t ) X { ( 0 , 0 ) , ( 1 , 0 ) , ( 1 , 1 ) : b t θ t δ + ϕ t E T t } , where θ t Γ θ { 0 , 1 : b t θ t δ } and ϕ t Γ ϕ { 0 , 1 : b t δ + ϕ t E T t } .
  • Observations and beliefs: The PU’s state is observed as q ^ t { A , I } , while the belief ϱ t [ 0 , 1 ] signifies the probability of spectrum availability. This belief is dynamically updated based on the sequence of past actions and observations, according to the transition function ϱ t + 1 = Λ ( ϱ t + 1 ) , as follows:
    ϱ t + 1 = Λ 0 ( ϱ t ) = ϱ t p i i + ( 1 ϱ t ) p a i , if θ t = 0 , b t + 1 = b t , Λ A ( ϱ t ) = p a i , if ( θ t = 0 , b t + 1 b t ) , or ( θ t = 1 , q ^ t = A ) , Λ I ( ϱ t ) = p i i , if θ t = 1 , q ^ t = I .
    Specifically, if the SU chooses not to perform spectrum sensing, the subsequent belief update is contingent upon two possible scenarios: (1) If the battery state remains unchanged, the belief is updated based solely on the PU state Markov chain. (2) If the battery energy increases, it means that the PU channel in the time slot t is busy, and ϱ t + 1 = p a i . When the SU performs spectrum sensing, the outcome of this process reflects the actual occupancy status of the spectrum. Equation (9) reveals that the SU is restricted to transitioning between only three distinct belief states, implying a finite belief space within the T time slot horizon. Consequently, given a finite duration of T time slots, the belief space Γ constitutes a finite set.
  • States: The discrete battery energy level of the SU at the start of time slot t is denoted by b t B { 0 , 1 , , b max } , where b max signifies the SU’s maximum battery energy capacity. Consequently, the energy associated with each quantum is B max b max Joules. The continuous available battery energy of the SU is discretized into energy levels using the formula b t = b t b max B max . The floor function applied here yields a lower bound on the AoI for the continuous system. Similarly, the continuous channel power gains are quantized into a finite number of levels according to the fading probability density function (PDF). These discrete levels of channel power gain are represented by h t H ( 0 , 1 , 2 , , h max ) and g t G ( 0 , 1 , 2 , , g max ) , where h max and g max denote the peak channel power gain values for the SU-to-CBS link and the PU-to-SU link. There are fully observable states in each time slot, including the AoI state, the SU-to-CBS channel state, the PU-to-SU channel state, and the battery state, represented by s t ( a t , h t , g t , b t ) . It is important to note that the state space S ( A × H × G × B ) is finite. Furthermore, the PU spectrum state is partially observable and characterized by the belief ϱ t . Consequently, for t = 0 , 1 , , T 1 , the entire system state is represented by ( s t , ϱ t ) . Given the finite nature of both S and Γ , the SU can only encounter a limited number of possible system states ( s t , ϱ t ) S × Γ .
  • Transition probabilities: Given the current state s t = ( a t , h t , g t , b t ) and action x t = ( θ t , ϕ t ) , the probability of transitioning to the next state s t + 1 = ( a t + 1 , h t + 1 , g t + 1 , b t + 1 ) is expressed as p x t ( s t + 1 | s t ) . Since the harvested energy and the channel power gains are i.i.d, we have
    p x t ( s t + 1 | s t ) = P ( a t + 1 | a t , x t ) P ( b t + 1 | b t , h t , g t , x t ) × P ( h t + 1 ) P ( g t + 1 ) ,
    where
    P ( a t + 1 | a t , x t ) = 1 , if a t + 1 = ( 1 ϕ t ) a t + 1 , 0 , otherwise ,
    P ( b t + 1 | b t , h t , g t , x t ) = 1 , if θ t = 0 , b t + 1 = min { b t + E H , 1 t , B max } , 1 , if θ t = 0 , b t + 1 = b t , 1 , if θ t = 1 , ϕ t = 0 , b t + 1 = min { b t δ + E H , 2 t , B max } , 1 , if θ t = 1 , ϕ t = 0 , b t + 1 = b t δ , 1 , if θ t = 1 , ϕ t = 1 , b t + 1 = b t δ E T t , 0 , otherwise .
    Equation (12) means that battery state transition probability is 1 if the battery’s state changes according to the actual action; otherwise, it is 0.
  • Cost: In the state s t , the immediate cost is represented by C ( s t ) , where C ( s t ) signifies the accumulated AoI at the time t. We then have
    C ( s t ) = a t , t = 0 , 1 , 2 , , T 1 .
  • Policy: The policy π is defined as a sequence of deterministic decision rules { ν 0 , ν 1 , , ν T 1 } , where each rule, ν t , maps the system state ( s t , ϱ t ) S × Γ into an action, x t X , i.e., x t = ν t ( s t , ϱ t ) . In this paper, let Π represent the set of all deterministic decision policies.
Given the SU’s initial state s 0 and belief ϱ 0 , the finite-horizon AoI achieved by following the policy π is given by
A ¯ π ( s 0 . ϱ 0 ) = 1 T E [ t = 0 T 1 C ( s t ) | s 0 , ϱ 0 ] ,
where the expectation is taken with respect to the policy π . Based on the above analysis, the problem of determining the optimal sensing and updating policy for minimizing the average AoI of the SU is equivalent to solving
min π Π A ¯ π ( s 0 , ϱ 0 ) .
Equation (15) represents a finite-state MDP with total cost under a given T.

3.2. Dynamic Programming-Based POMDP Solution

To solve the finite-horizon total cost minimization problem given by Equation (15), we apply dynamic programming [40]. The state-value function, denoted by V t ( s t , ϱ t ) , is expressed as
V t ( s t , ϱ t ) min { x k } k = t T 1 E [ k = t T 1 C ( s k ) | s t , ϱ t ] .
It represents the minimum expected cost accumulated from the time slot t to T 1 given the state ( s t , ϱ t ) . Therefore, the minimum AoI in Equation (15) is A * = V 0 ( s 0 , ϱ 0 ) / T . Similarly, given ( s t , ϱ t ) and a sensing action, θ t , let Q t θ t ( s t , ϱ t ) represent the Q-function or action value function, which corresponds to the minimum expected cost for taking the sensing action θ t in the state ( s t , ϱ t ) . The Q-function consists of two parts: the immediate cost for taking action in the current state and the expected sum of the state-value functions from the next time slot. The finite-horizon MDP problem can be solved recursively via dynamic programming as follows. For t = 0 , 1 , , T 1 ,
V t ( s t , ϱ t ) = min θ t Γ θ Q t θ t ( s t , ϱ t ) ,
where for t = T 1 ,
Q T 1 0 ( s T 1 , ϱ T 1 ) = C ( s T 1 ) + C ( s T ) ,
Q T 1 1 ( s T 1 , ϱ T 1 ) = ( 1 ϱ T 1 ) C ( s T 1 ) + ϱ T 1 × min ϕ T 1 Γ ϕ C ( s T 1 ) + C ( s T ) .
and for t = 0 , 1 , , T 2 ,
Q t 0 ( s t , ϱ t ) = C ( s t ) + s t + 1 p 00 ( s t + 1 | s t ) V t + 1 ( s t + 1 , Λ 0 ( ϱ t ) ) ,
Q t 1 ( s t , ϱ t ) = ( 1 ϱ t ) Q t 1 A ( s t , ϱ t ) + ϱ t min ϕ t Γ ϕ Q t 1 ϕ t ( s t , ϱ t ) ,
Q t 1 A ( s t , ϱ t ) = C ( s t ) + s t + 1 p 10 ( s t + 1 | s t ) V t + 1 ( s t + 1 , Λ A ( ϱ t ) ) ,
Q t 10 ( s t , ϱ t ) = C ( s t ) + s t + 1 p 10 ( s t + 1 | s t ) V t + 1 ( s t + 1 , Λ I ( ϱ t ) ) ,
Q t 11 ( s t , ϱ t ) = C ( s t ) + s t + 1 p 11 ( s t + 1 | s t ) V t + 1 ( s t + 1 , Λ I ( ϱ t ) ) .
Specifically, when the sensing action θ t = 1 is implemented and produces the result q ^ t = A , the minimum expected cost is represented by Q t 1 A ( s t , ϱ t ) in Equation (22), i.e., x t = ( 1 , 0 ) . Given the sensing action θ t = 1 and the sensing result q ^ t = I in Equations (23) and (24), Q t 10 ( s t , ϱ t ) and Q t 11 ( s t , ϱ t ) denote the minimum expected costs associated with implementing the update actions ϕ t = 0 and ϕ t = 1 , respectively. Then, by the recursion in (17)–(24), the optimal policies for sensing and update are obtained by
θ t * ( s t , ϱ t ) argmin θ t Γ θ Q t θ t ( s t , ϱ t ) ,
ϕ t * ( s t , ϱ t ) argmin ϕ t Γ ϕ Q t 1 ϕ t ( s t , ϱ t ) .

4. System Model for RF Energy-Harvesting CRN with Multiple SUs

We further study the long-term average weighted sum-AoI minimization in an RF energy-harvesting CRN with multiple SUs, as shown in Figure 2. During the time slot t, all SUs concurrently capture energy from the PU transmissions, or a single SU is designated for spectrum sensing while another (possibly the same or a different) SU is selected for accessing the available spectrum. Given the similarity to the RF energy-harvesting CRN with a single SU, we will not go into detail about the PU and belief model in this section, and we only provide a brief description of the SU model as follows.

Secondary Users Model

The central controller (CC) makes the sensing and update decisions for the SUs and broadcasts the decision to them at the beginning of each time slot, t. The decision that none of the SUs sense the spectrum means that all the SUs capture energy from the PU transmissions in the entire time slot. The decision to sense the spectrum means deciding which SU is selected to sense the spectrum. If the selected SU is sensing the spectrum, other SUs are in the energy-harvesting state. We assume the perfect sensing outcome for the SUs. If the sensing result indicates the spectrum is occupied by the PU, all the SUs harvest energy in the rest time of this time slot. On the other hand, if the spectrum is sensed to be vacated by the PU, the CC needs to further decide whether to assign one SU to deliver the status update data pack and which one should do so. By making optimal decisions sequentially across time slots, the CC endeavors to minimize the long-term average of the weighted sum-AoI for the SUs. x t = ( θ t , ϕ t ) denotes the decision taken in the time slot t, where θ t { 0 , 1 , , N } and ϕ t { 0 , 1 , , N } . In particular, θ t = 0 implies that all the SUs capture energy from the PU transmissions, while θ t = 1 , 2 , N means which SU is selected to sense the spectrum. ϕ t = 0 implies that no SU is designated to deliver the status update data packet, whereas ϕ t = 1 , 2 , N means which SU is selected to deliver the status update data pack. Decisions are made based on the individual state of each SU and all the SUs’ statistical knowledge of the PU activity, discussed as follows.
(1) Energy-harvesting model: Energy harvesting can occur under two different scenarios: (1) at the beginning of the time slot, no sensing action is performed, and the PU remains active; (2) a sensing decision is made at the beginning of the time slot, and the sensing result indicates that the spectrum is busy. For the second case, all SUs start to harvest energy when the sensing action ends, and all non-sensing SUs harvest energy while the sensing SU is engaged in the sensing process. Let g n , t represent the channel power gain from the PU to the n-th SU. The energy captured by the nth SU is given by
E H m , n t = τ η P g n , t , t = 0 , 1 , , T 1 , m = 1 , 2 ,
where the two different values of m corresponding to the two cases for energy harvesting described above. Let δ n and τ s , n denote the energy and time consumption for sensing, respectively, for the n-th SU. Additionally, let E T , n t and τ t , n represent the energy and time consumption for transmission in the time slot t for the n-th SU, respectively. According to Shannon’s formula, E T , n t is expressed as
E T , n t = σ 2 τ t , n h t , n ( 2 S τ i W 1 ) ,
where h t , n represents the channel power gain from the n-th SU to the CBS. The n-th SU possesses a battery capacity of B max , n Joules.
The battery state in the time slot t for the n-th SU is denoted by b t , n , which evolves as
b t + 1 , n = min { b t , n + E H m , n t 1 ( θ t = n ) δ n 1 ( ϕ t = n ) E T , n t , B max , n } , t = 0 , 1 , 2 , , T 1 ,
where 1 ( . ) represents the indicator function. The energy causality constraint for the nth SU is satisfied by
1 ( θ t = n ) δ n + 1 ( ϕ t = n ) E T , n t b t , n , t = 0 , 1 , , T 1 .
(2) Age of information: We denote AoI by a t , n in time slot t for the nth SU’s observed process n. We assume that the upper bound of a t , n is A max , n , which can be selected as an arbitrarily large value, i.e., a t , n A n { 1 , 2 , , A max , n } . Note that when a t , n reaches A max , n , the information available at the CBS regarding the process n is considered excessively outdated, rendering further tracking unnecessary. The nth SU’s AoI evolves over time slots and is expressed as
a t + 1 , n = 1 , if ϕ t = n , a t , n + 1 , otherwise .

5. INFINITE POMDP FORMULATION for RF Energy-Harvesting CRN with Multiple SUs

5.1. POMDP Formulation

To achieve the minimization of the SUs’ long-term weighted sum-AoI, the optimal sensing and update decisions are modeled as a POMDP. The constituent elements of this POMDP are detailed below. Note that the policy description is not elaborated on here, and some other details are neglected. For a more detailed explanation, please refer to Section 3.
  • Actions: The CC decides the sensing SU and the updating SU. The action implemented within each time slot is x t = ( θ t , ϕ t ) X { ( 0 , 0 ) , ( 1 , 0 ) , ( 1 , 1 ) , , ( n , n ) : b t , n 1 ( θ t = n ) δ n + 1 ( ϕ t = n ) E T , n t } , where θ t Γ θ { 0 , 1 , , n : b t , n 1 ( θ t = n ) δ n } and ϕ t , n Γ ϕ { 0 , 1 , , n : b t , n δ n + 1 ( ϕ t = n ) E T , n t } .
  • Observations and beliefs: Let ϱ t R { 0 , 1 , 2 , , ϱ max } denote the discrete belief level at the beginning of the time slot t, where ϱ max represents the maximum belief level. In this case, the continuous belief ϱ t can be converted into the discrete belief level according to ϱ t = ϱ t 1 / ϱ max .
  • State: The discrete battery energy level of the n-th SU at the start of the time slot t is denoted by b t , n , where b t , n belongs to the set B n { 0 , 1 , , b max , n } . Here, b max , n represents the maximum energy storage capacity of the n-th SU’s battery. Thus, each energy quantum of the n-th SU’s battery corresponds to B max , n b max , n Joules. In this case, the n-th SU’s continuous battery energy can be converted into the discrete battery energy level state according to b t , n = b t , n b max , n B max , n . Likewise, the continuous channel power gains for the links between the n-th SU and the CBS, and the PU and the n-th SU, are mapped to discrete levels, i.e., h t , n H n ( 0 , 1 , 2 , , h max , n ) and g t , n G n ( 0 , 1 , 2 , , g max , n ) , where h max , n and g max , n signify the upper bounds of the channel power gain levels from the n-th SU to the CBS and from the PU to the n-th SU, respectively. The completely observable state of the n-th SU at any time slot, t, is composed of its AoI value, the channel condition between it and the CBS, the channel condition from the PU to it, and its residual battery energy. These are denoted by s t , n ( a t , n , h t , n , g t , n , b t , n ) . The state of all the SUs in the time slot t is represented by s t S = { s t , n } n N . Integrating the PU spectrum belief, the complete system state is denoted by ( s t , ϱ t ) .
  • Transition probabilities: For the n-th SU, the transition probability from the current state s t , n = ( a t , n , h t , n , g t , n , b t , n ) to the next state s t + 1 , n = ( a t + 1 , n , h t + 1 , n , g t + 1 , n , b t + 1 , n ) under the action x t = ( θ t , ϕ t ) is given by
    p x t ( s t + 1 , n | s t , n ) = P ( a t + 1 , n | a t , n , x t ) P ( b t + 1 , n | b t , n , h t , n , g t , n , x t ) × P ( h t + 1 , n ) P ( g t + 1 , n ) ,
    where
    P ( a t + 1 , n | a t , n , x t ) = 1 , if a t + 1 , n = ( 1 1 ( ϕ t = n ) ) a t , n + 1 , 0 , otherwise ,
    P ( b t + 1 , n | b t , n , h t , n , g t , n , x t ) = 1 , if θ t = 0 , b t + 1 , n = min { b t , n + E H 1 , n t , B max , n } , 1 , if θ t = 0 , b t + 1 , n = b t , n , 1 , if θ t = n , ϕ t = 0 , b t + 1 , n = min { b t , n δ n + E H 2 , n t , B max , n } , 1 , if θ t n , ϕ t = 0 , b t + 1 , n = min { b t , n + E H 1 , n t , B max , n } , 1 , if θ t = n , ϕ t = 0 , b t + 1 , n = b t , n δ n , 1 , if θ t = n , ϕ t = n , b t + 1 , n = b t , n δ n E T , n t , 1 , if θ t = n , ϕ t n , b t + 1 , n = b t , n δ n , 1 , if θ t n , ϕ t = n , b t + 1 , n = b t , n E T , n t , 1 , if θ t n , θ t 0 , ϕ t n , ϕ t 0 , b t + 1 , n = b t , n , 0 , otherwise .
    Then, the overall transition probability is given by
    P x t ( s t | s t ) = n N P x t ( s t , n | s t , n ) .
  • Cost: Let the immediate cost incurred in the time slot t in the state s t be denoted by C ( s t ) , quantifying the weighted sum-AoI at that specific time instant. Therefore, we obtain
    C ( s t ) = n = 1 N β n a t , n , t = 0 , 1 , 2 , , T 1 ,
    where β n 0 and n = 1 N β n = 1 . Here, β n signifies a weighting parameter that modulates the impact of the physical process n as observed at the CBS.
Then, the optimal policy can be derived by solving the Bellman equations presented below [40].
A ¯ * + V t ( s t , ϱ t ) = min x t A ( s t , ϱ t ) Q t ( s t , ϱ t , x t ) , ( s t , ϱ t ) S × R ,
where A ¯ * signifies the optimal average AoI which does not depend on the initial state ( s 0 , ϱ 0 ) , V t ( s t , ϱ t ) is the state-value function, A ( s t , ϱ t ) is the action taken in the state ( s t , ϱ t ) , and Q t ( s t , ϱ t , x t ) is the Q-function, which is given by
Q t ( s t , ϱ t , x t ) = n = 1 N β n a t , n + s t + 1 S P x t ( s t + 1 | s t ) V t ( s t + 1 , ϱ t + 1 ) .
Therefore, given all the SUs’ states, s t , and the belief ϱ t , the optimal action is given by
π * ( s t , ϱ t ) = arg min x t A ( s t , ϱ t ) Q t ( s t , ϱ t , x t ) .
Given the intractably large state space that grows exponentially with the number of SU nodes and the granularity of state discretization, we propose a DRL approach in the subsequent subsection to determine the optimal policy.

5.2. DRL-Based POMDP Solution

The most common DRL algorithm is Q-learning, which has been widely applied in network resource optimization. In the Q-learning algorithm, the update step for the Q-function value of the current state occurs at the beginning of each time slot, based on the action taken and the resulting next state [41]. Specifically, the Q-learning algorithm’s update mechanism for the present problem, initiated at the onset of the time slot t + 1 , is given by
Q t + 1 ( s t , ϱ t , x t ) = Q t ( s t , ϱ t , x t ) + α ( t ) ( C ( s t ) + min x ¯ A ( s t + 1 , ϱ t + 1 ) Q t ( s t + 1 , ϱ t + 1 , x ¯ ) min x ¯ A ( s ¯ , ϱ ¯ ) Q t ( s ¯ , ϱ ¯ , x ¯ ) Q t ( s t , ϱ t , x t ) ) ,
where α ( t ) denotes the learning rate parameter in the time slot t, A ( s t + 1 , ϱ t + 1 ) X and A ( s ¯ , ϱ ¯ ) X are the actions taken in the states ( s t + 1 , ϱ t + 1 ) and ( s ¯ , ϱ ¯ ) , and ( s ¯ , ϱ ¯ ) denotes the time-invariant reference state, a constant value across all iterations that can be defined arbitrarily. Based on (40), the system invariably leverages the learning process by choosing the action that yields the minimum Q-function value for the current state. However, for the algorithm to achieve convergence, the exhaustive exploration of the state–action space is imperative. Consequently, the ϵ -greedy policy is adopted, where a randomized action is chosen in the present state with a probability 0 < ϵ < 1 . This policy allows the system to explore the environment rather than solely exploiting the learned knowledge.
Employing the Q-learning algorithm in isolation to determine the optimal policy proves efficacious when the cardinality of the system’s state space is limited. However, in scenarios involving an exceedingly large number of states, as encountered in our problem, maintaining the Q-function values for all state–action combinations becomes infeasible, and ensuring comprehensive visitation of all such pairs is also challenging, thus impeding convergence. To overcome this challenge, the DQN approach is utilized. While DQN retains the fundamental learning steps of Q-learning, it employs a DNN, Q(s, ϱ , x| ξ ), to approximate the Q-function, with ξ signifying the vector of the DNN’s parameters. To ensure that the stored Q-function approximated by the DNN is as close as possible to the optimal Q-function, the optimal values of ξ must be found. To this end, a loss function for any tuple ( s t , ϱ t , x t , C ( s t ) , s t + 1 , ϱ t + 1 ) is defined as
L ( ξ t + 1 ) = ( C ( s t ) + min x ¯ A ( s t + 1 , ϱ t + 1 ) Q t ( s t + 1 , ϱ t + 1 , x ¯ | ξ t ) min x ¯ A ( s ¯ , ϱ ¯ ) Q t ( s ¯ , ϱ ¯ , x ¯ | ξ t ) Q t ( s t , ϱ t , x t | ξ t + 1 ) ) 2 .
Moreover, a replay memory is leveraged to store historical experiences, with each experience comprising the current state, the action performed, the immediate cost, and and the resultant next state. Following each time slot, a random batch of a finite size of past experiences is sampled from the replay memory, and the gradient of the DNN’s weights is computed as
Δ ξ t + 1 L ( ξ t + 1 ) = ( C ( s t ) + min x ¯ A ( s t + 1 , ϱ t + 1 ) Q t ( s t + 1 , ϱ t + 1 , x ¯ | ξ t ) min x ¯ A ( s ¯ , ϱ ¯ ) Q t ( s ¯ , ϱ ¯ , x ¯ | ξ t ) Q t ( s t , ϱ t , x t | ξ t + 1 ) ) × Δ ξ t + 1 Q t ( s t , ϱ t , x t | ξ t + 1 ) .
Then, this loss function is used to train the weights of the DNN.
Due to the partially observable of the PU spectrum, our problem cannot be solved by the traditional DQN. The actions generated by the DQN must belong to the environment’s action space, and the next state is determined by the environment dynamics. On the one hand, since our input is the belief of PU spectrum availability, the DQN network may generate an infeasible action. In such cases, we need to correct the DQN’s output action based on the actual sensed state of the PU spectrum. On the other hand, in the implementation of the DQN algorithm, we correct the PU spectrum state under the current state based on the action output from the DQN network in each time slot. As the DQN network continues to train, it will be able to predict the PU spectrum status more accurately. For example, for the action x t = ( 1 , 2 ) , if the PU is in the active state, the second SU cannot perform the update. Thus, we need to make some adjustments to the traditional DQN to adapt it to our problem. When the sensing action ends, we correct the pre-selected action according to the real state of the PU spectrum. For the action x t = ( 1 , 2 ) just mentioned, first let first SU sense the spectrum. Given the sensing result, we decide whether to adjust the pre-selected action x t = ( 1 , 2 ) . If the sensing result is that the PU is in the idle state, the second SU delivers the status update data pack during the remaining duration of this time slot. Otherwise, this pre-selected action is corrected to another one; that is, all the SUs capture energy during the remaining duration of this time slot, i.e., the corresponding real action is x ^ t = ( 1 , 0 ) .
Overall, there are two cases in which the pre-selected actions are corrected: (1) the pre-selected action is related to the vacant PU spectrum, but the real state of this is busy; (2) the pre-selected action is related to the active PU spectrum, but the real state of this is idle.
For the first case, after performing sensing action, we correct the pre-selected action to another one; that is, all the SUs harvest energy in the rest time of this time slot. For the second case, after performing sensing action, we correct the pre-selected action to select the SU with the highest weighted to perform the update in the rest time of this time slot. Thus, the real action maybe the pre-selected one or another one. Specifically, when the pre-selected action does not meet the spectrum constraint, we give the pre-selected action a penalty, p, instead of the immediate cost C ( s t ) to avoid it being chosen, and we store the pre-selected action together with its corresponding real action as experiences in the replay memory. Similarly, when the real action does not meet the energy causality constraint, in addition to giving a penalty, p, instead of the immediate cost C ( s t ) , the pre-selected action is reset to x t = ( 0 , 0 ) to avoid it being chosen. And we store the real action before reset together with the real action after reset as experiences in the replay memory.
Through simulations, we observe that as the number of training cycles increases, the action and the steady-state probability of the PU spectrum state gradually become consistent, and the actions that violate energy causality constraint can almost avoid being chosen. In particular, the new DQN approach we propose is adapted to the POMDP problem with the partially observable state modeled as a Markov chain. The proposed novel DQN approach’s algorithmic steps are summarized in Algorithm 1. Here, lines 13 and 17 specify the algorithmic operations required when the DQN generates an infeasible action (i.e., when the battery level falls below zero). Lines 19 to 21 indicate the operations the algorithm performs when the pre-selected action and the actual action do not match. Lines 22 to 23 indicate the operations the algorithm performs when the pre-selected action and the actual action are equal. Lines 26 to 27 indicate that a batch of samples is randomly drawn from the replay memory for training the DQN network. Although the use of the DQN helps address complexity, the proposed solution may still face scalability challenges as network size and environmental dynamics grow [42,43].   
Algorithm 1: The new DQN for average weighted sum-AoI minimization
Entropy 27 00855 i001

6. Numerical Results and Discussions

6.1. One SU’s Finite-Horizon AoI Evaluation

This section illustrates the performance evaluation of a specific scenario involving an SU via numerical results. The state transition probabilities of the PU are set as p i i = 0.8 and p a i = 0.5 . The channel power gains between the SU and the CBS, and between the PU and the SU, are formulated as h = Υ Ψ 2 d 1 κ and g = Υ Ψ 2 d 2 κ [44]. Here, d 1 and d 2 are the distances from the SU to the CBS and from the PU to the SU, respectively. Υ signifies the signal power gain at a reference distance of one meter. Ψ exp ( 1 ) follows an exponential distribution with a mean of 1, representing the small-scale fading gain. d 1 κ and d 2 κ represent the conventional power-law path loss with the path loss exponent κ . The proposed policy is compared to the myopic policy [31], a common benchmark. Under the myopic policy, provided that the SU possesses adequate energy for spectrum sensing, it undertakes this action; conversely, it harvests energy from the PU’s transmissions. When spectrum sensing indicates that the channel is occupied, the SU proceeds to harvest energy for the remainder of the time slot; otherwise, it transmits the status update data packet, contingent upon the residual energy being sufficient for an update. The simulation parameter values used in the simulations are shown in Table 1.
Figure 3 illustrates a representative trajectory of the AoI achieved by employing the optimal policy. The PU-SU link distance is set to 5 m, and the SU-CBS link distance is configured at 25 m. The transmission power of the PU is set to 35 dBm, the battery has a capacity of 0.5 mJoules, and performing the sensing action requires an energy expenditure of one quantum. The corresponding system states and actions are listed in Table 2. The plot visually demonstrates the evolving trend of AoI across the time slots. Furthermore, it is observed that despite having adequate energy resources, the SU does not engage in spectrum sensing, which underscores the foresight inherent in our proposed strategy when compared to the myopic policy.
Figure 4 presents the relationship between battery capacity and AoI, with the PU’s transmit power set at 35 dBm, the status update data packet size at 15 Mbits, and the energy expenditure for sensing fixed at one energy quantum, based on a maximum battery capacity of 0.5 mJoules. The SU is located 5 m away from the PU. The presented results distinctly evidence the advantageous nature of the optimal policy over the myopic policy. We note that the average AoI exhibits a decreasing trend with an increase in battery capacity and a reduction in the distance between the SU and the CBS. This is attributed to the capacity of a larger battery to accommodate greater energy reserves, coupled with the fact that a diminished distance between the SU and the CBS curtails the energy expenditure necessary for the transmission of the status update data packet. Consequently, this enhances the likelihood of the SU possessing sufficient energy for transmitting the status update data packet, thereby leading to a reduction in its observed AoI.
Figure 5 illustrates the correlation between the status update data packet size and AoI, with the PU’s transmit power set at 35 dBm, the battery capacity at 0.2 mJoules, the energy expenditure for sensing at 0.125 mJoules, and the distance from the SU to the CBS fixed at 25 m. We note that the AoI exhibits an increasing trend with an increase in the size of the status update data packet and the distance between the PU and SU. This phenomenon arises from the fact that transmitting larger status update data packets necessitates greater energy expenditure, and an increased separation between the PU and SU leads to a reduction in the energy harvested by the SU. Consequently, the probability of the SU possessing sufficient energy for update packet transmission diminishes, resulting in an elevated AoI for the observed process.
Figure 6 illustrates the correlation between the PU’s transmission power and the AoI under specific conditions: a battery capacity of 0.2 mJoules, a status update packet size of 15 Mbits, a PU-to-SU distance of 5 m, and an SU-to-CBS distance of 25 m. The data reveals an inverse relationship between the PU’s transmit power and the average AoI. specifically, higher transmit power correlates with a lower average AoI. Furthermore, a reduction in energy expenditure for the sensing operation is also associated with a decrease in the average AoI. The rationale behind this observation lies in the fact that a higher transmit power from the PU enhances the amount of energy that can be harvested and stored by the battery. Concurrently, lower energy expenditure for sensing activities preserves a greater energy reserve in the battery following the sensing phase. As a result, the probability of the SU possessing adequate energy to transmit the status update is elevated, which in turn leads to a decrease in the AoI of the monitored process. Moreover, the data reveals that under conditions of elevated energy expenditure for sensing, the optimal policy exhibits a considerably more pronounced performance advantage over the myopic policy when contrasted with scenarios involving lower sensing energy consumption. This disparity arises because the myopic policy, when confronted with high sensing energy demands and primary user spectrum occupancy, tends to expend more energy on superfluous sensing operations. Conversely, the optimal policy strategically manages the spectrum sensing decision, thereby mitigating energy wastage on redundant sensing activities.

6.2. Multiple-SU Infinite-Horizon AoI Evaluation

In this section, we evaluate the performance of the multiple-SU scenario through numerical results. We compare the proposed DQN approach with the myopic policy. For the myopic policy, we first sort the SUs in descending order of their weight. Then, we assign the first SU to sense the spectrum if it has enough energy. Otherwise, we assign the second highest-weighted SU to sense the spectrum, and so on. If the energy of each SU is insufficient to perform sensing, all SUs harvest energy from the PU transmissions. If the sensing result indicates that the spectrum is occupied, all SUs begin harvesting energy once the sensing action concludes. Upon determining that the spectrum is unoccupied, the SU with the highest-weighted and sufficient energy reserves is selected to transmit the status update data packet. If it does not, the second-highest weighted SU is assigned, and so on. In the simulations, unless otherwise specified, the parameter settings remained consistent with the scenario where only one SU is involved. Specifically, there were three SUs, each with a weight of 1/3. There is no fixed rule that dictates the exact number of SUs. The ideal quantity often depends on network scale, the available spectrum, and interference management. In many research studies and simulations, using three SUs is quite common. It is a good balance, simple enough to model and analyze yet complex enough to demonstrate interactions and potential interference scenarios among different users. The value of β n signifies that each SU’s observed physical process holds equal importance for the CBS. The discrete power gain for the channels connecting the PU to each SU, and likewise from each SU to the CBS, are uniformly configured at a value of 5. The upper limit for the AoI of each SU’s monitored process is set at 10, and each SU’s discrete battery energy level is initialized to 10 units. Furthermore, the battery capacity of each SU, the distance between the PU and each SU, and the distance between each SU and the CBS are all considered uniform across the system.
Figure 7 depict the temporal evolution of the average weighted sum-AoI and the individual AoI of each SU, respectively, across successive time slots. Under the conditions examined, the PU’s transmission power is set to 35 dBm, the status update packet size is 15 Mbits, each SU’s battery capacity is 0.5 mJoules, the distance from each SU to the CBS is 10 m, and the distance from the PU to each SU is 5 m. Each spectrum sensing operation incurs an energy expenditure of one quantum unit. From Figure 7, it is observed that the average weighted sum-AoI gradually stabilizes, which demonstrates the effectiveness of the proposed DQN approach. In order to clearly show the changing trend of the each SU’s AoI with the time slots, we only plot the first 30 time slots in Figure 7. We can observe that although each SU’s weight is set to be the same value, the peak AoI of the first SU is lower than the other two SUs. The AoI of the second and third SUs fluctuate more than that of the first SU.
Figure 8 illustrates the relationship between the battery capacity of each sSU and the resulting average weighted sum-AoI, given a PU transmit power of 35 dBm, a status update data packet size of 15 Mbits, and a distance of 5 m between the PU and each SU. The energy cost associated with each sensing operation is one quantum unit, determined with reference to each SU’s battery capacity of 0.5 mJoules. It can be observed that the proposed DQN approach significantly outperforms the myopic policy. Consistently with the findings observed in the single-SU scenario, a reduction in the distance between each SU and the CBS, as well as an increase in the battery capacity of each SU, leads to a decrease in the average weighted sum-AoI.
Figure 9 depicts the correlation between the status update data packet size and the average weighted sum-AoI, under the conditions of a 15 dBm PU transmit power, a 0.5 mJoules battery capacity for each SU, a sensing energy consumption of one quantum unit, and a 20-m distance from each SU to the CBS. The data indicates a positive correlation between the average weighted sum-AoI and both the size of the status update data packet and the distance separating the PU from each SU.
Figure 10 illustrates the connection between the PU’s transmission power and the average weighted sum-AoI, given that each SU has a battery capacity of 0.5 mJoules, the status update data packet size is 14 Mbits, the distance from the PU to each SU is 5 m, and the distance from each SU to the CBS is 25 m. The findings reveal an inverse relationship between the PU’s transmit power and the average weighted sum-AoI. Similarly, a reduction in the energy expended for sensing operations is associated with a lower average weighted sum-AoI.

7. Conclusions

In this paper, we first investigate a single RF energy-harvesting SU with the goal of AoI minimization. We adopt a POMDP to formulate the average AoI minimization problem, subject to energy causality and spectrum constraints. Then, dynamic programming is used to find the optimal decisions regarding energy harvesting, spectrum sensing, and information updating for AoI minimization. Furthermore, we extend our study to multiple RF energy-harvesting SUs, aiming to minimize the long-term average weighted sum-AoI. This problem is also formulated using the POMDP, and we propose an improved DQN to solve it. The numerical outcomes underscore the influence of various system parameters on overall performance and clearly demonstrate the substantial performance gains achieved by the proposed policies in comparison to myopic approaches. For future work, we will investigate alternative DRL approaches for CRNs, with special attention to imperfect sensing.

Author Contributions

Conceptualization, J.S. and S.Z.; Methodology, X.Y.; Software, J.S.; Validation, S.Z. and X.Y.; Formal analysis, J.S. and S.Z.; Resources, S.Z.; Data curation, S.Z.; Writing—original draft, J.S.; Writing—review & editing, J.S.; Supervision, X.Y.; Project administration, S.Z. and X.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Force, S. Spectrum Policy Task Force Report; Federal Communications Commission: Washington, DC, USA, 2002. [Google Scholar]
  2. Zheng, K.; Jia, X.; Chi, K.; Liu, X. DDPG-based joint time and energy management in ambient backscatter-assisted hybrid underlay CRNs. IEEE Trans. Commun. 2023, 71, 441–456. [Google Scholar] [CrossRef]
  3. Ghosh, S.; Maity, S.P.; Chakraborty, C. On EE maximization in D2D-CRN with eavesdropping using LSTM-based channel estimation. IEEE Trans. Consum. Electron. 2024, 70, 3906–3913. [Google Scholar] [CrossRef]
  4. Wu, Y.; Zhou, F.; Wu, W.; Wu, Q.; Hu, R.Q.; Wong, K.-K. Multi-Objective Optimization for Spectrum and Energy Efficiency Tradeoff in IRS-Assisted CRNs With NOMA. IEEE Trans. Wirel. Commun. 2022, 21, 6627–6642. [Google Scholar] [CrossRef]
  5. Zheng, K.; Liu, X.; Liu, X.; Zhu, Y. Hybrid overlay-underlay cognitive radio networks with energy harvesting. IEEE Trans. Wirel. Commun. 2019, 67, 4669–4682. [Google Scholar] [CrossRef]
  6. Thakur, S.; Singh, A.; Majhi, S. Secrecy Analysis of Underlay CRN in the Presence of Correlated and Imperfect Channel. IEEE Trans. Cogn. Commun. Netw. 2023, 9, 754–764. [Google Scholar] [CrossRef]
  7. Bi, S.; Zeng, Y.; Zhang, R. Wireless powered communication networks: An overview. IEEE Wirel. Commun. 2016, 23, 10–18. [Google Scholar] [CrossRef]
  8. Zhang, G.; Xu, J.; Wu, Q.; Cui, M.; Li, X.; Lin, F. Wireless powered cooperative jamming for secure OFDM system. IEEE Trans. Veh. Technol. 2018, 67, 1331–1346. [Google Scholar] [CrossRef]
  9. Sudevalayam, S.; Kulkarni, P. Energy harvesting sensor nodes: Survey and implications. IEEE Commun. Surveys Tuts. 2011, 13, 443–461. [Google Scholar] [CrossRef]
  10. Ju, H.; Zhang, R. Throughput maximization in wireless powered communication networks. IEEE Trans. Wirel. Commun. 2014, 13, 418–428. [Google Scholar] [CrossRef]
  11. Chen, Y.; Zhao, Q.; Swami, A. Distributed spectrum sensing and access in cognitive radio networks with energy constraint. IEEE Trans. Signal Process. 2009, 57, 783–797. [Google Scholar] [CrossRef]
  12. Zhang, S.; Bao, S.; Chi, K.; Yu, K.; Mumtaz, S. DRL-based computation rate maximization for wireless powered Multi-AP edge computing. IEEE Trans. Commun. 2024, 72, 1105–1118. [Google Scholar] [CrossRef]
  13. Chi, K.; Sun, J.; Zhang, S.; Huang, L. Secrecy rate maximization for multicarrier-based cognitive radio networks with an energy harvesting Jammer. IEEE Sens. J. 2023, 23, 3220–3232. [Google Scholar] [CrossRef]
  14. Zhang, S.; Gu, H.; Chi, K.; Huang, L.; Yu, K.; Mumtaz, S. DRL-based partial offloading for maximizing sum computation rate of wireless Powered mobile edge computing network. IEEE Trans. Wirel. Commun. 2022, 21, 10934–10948. [Google Scholar] [CrossRef]
  15. Zhang, Y.; Han, W.; Li, D.; Zhang, P.; Cui, S. Power versus spectrum 2-D sensing in energy harvesting cognitive radio networks. IEEE Trans. Signal Process. 2015, 63, 6200–6212. [Google Scholar] [CrossRef]
  16. Bae, Y.H.; Baek, J.W. Performance analysis of delay-constrained traffic in a cognitive radio network with RF energy harvesting. IEEE Commun. Lett. 2019, 23, 2177–2181. [Google Scholar] [CrossRef]
  17. Pratibha, K.; Li, H.; Teh, K.C. Optimal spectrum access and energy supply for cognitive radio systems with opportunistic RF energy harvesting. IEEE Trans. Veh. Technol. 2017, 66, 7114–7122. [Google Scholar] [CrossRef]
  18. Xu, M.; Jin, M.; Guo, Q.; Li, Y. Multichannel selection for cognitive radio networks with RF energy harvesting. IEEE Commun. Lett. 2018, 7, 178–181. [Google Scholar] [CrossRef]
  19. Celik, A.; Alsharoa, A.; Kamal, A.E. Hybrid energy harvesting-Based cooperative spectrum sensing and access in heterogeneous cognitive radio networks. IEEE Tran. Cognit. Commun. Netw. 2017, 3, 37–48. [Google Scholar] [CrossRef]
  20. Xu, C.; Zheng, M.; Liang, W.; Yu, H.; Liang, Y. End-to-End throughput maximization for underlay multi-Hop cognitive radio networks with RF energy harvesting. IEEE Trans. Wirel. Commun. 2017, 16, 3561–3572. [Google Scholar] [CrossRef]
  21. Perera, C.; Liu, C.H.; Jayawardena, S. The emerging internet of things marketplace from an industrial perspective: A survey. IEEE Trans. Emerg. Top. Comput. 2015, 3, 585–598. [Google Scholar] [CrossRef]
  22. Khan, A.A.; Rehmani, M.H.; Rachedi, A. Cognitive-radio-based internet of things: Applications, architectures, spectrum related functionalities, and future research directions. IEEE Wirel. Commun. 2017, 24, 17–25. [Google Scholar] [CrossRef]
  23. Khan, A.A.; Rehmani, M.H.; Rachedi, A. When cognitive radio meets the Internet of Things? In Proceedings of the 2016 International Wireless Communications and Mobile Computing Conference (IWCMC), Paphos, Cyprus, 5–9 September 2016; pp. 469–474. [Google Scholar]
  24. Zhu, B.; Bedeer, E.; Nguyen, H.H.; Barton, R.; Gao, Z. UAV trajectory planning for AoI-minimal data collection in UAV-Aided IoT networks by transformer. IEEE Trans. Wirel. Commun. 2023, 22, 1343–1358. [Google Scholar] [CrossRef]
  25. Zhang, G.; Lu, Y.; Lin, Y.; Zhong, Z.; Ding, Z.; Niyato, D. AoI minimization in RIS-aided SWIPT systems. IEEE Trans. Veh. Technol. 2024, 73, 2895–2900. [Google Scholar] [CrossRef]
  26. Gao, X.; Zhu, X.; Zhai, L. AoI-sensitive data collection in multi-UAV-assisted wireless sensor networks. IEEE Trans. Wirel. Commun. 2023, 22, 5185–5197. [Google Scholar] [CrossRef]
  27. Zhang, G.; Shen, C.; Shi, Q.; Ai, B.; Zhong, Z. AoI minimization for WSN data collection with periodic updating scheme. IEEE Trans. Wirel. Commun. 2023, 22, 32–46. [Google Scholar] [CrossRef]
  28. Valehi, A.; Razi, A. Maximizing energy efficiency of cognitive wireless sensor networks with constrained age of information. IEEE Tran. Cognit. Commun. Netw. 2017, 3, 643–654. [Google Scholar] [CrossRef]
  29. Gu, Y.; Chen, H.; Zhai, C.; Li, Y.; Vucetic, B. Minimizing age of information in cognitive radio-based IoT systems: Underlay or Overlay? IEEE Internet Things J. 2019, 6, 10273–10288. [Google Scholar] [CrossRef]
  30. Zhao, Y.; Zhou, B.; Saad, W.; Luo, X. Age of information analysis for dynamic spectrum sharing. In Proceedings of the 2019 IEEE Global Conference on Signal and Information Processing (GlobalSIP), Ottawa, ON, Canada, 11–14 November 2019; pp. 1–5. [Google Scholar]
  31. Leng, S.; Yener, A. Age of information minimization for an energy harvesting cognitive radio. IEEE Trans. Cognit. Commun. Netw. 2019, 5, 427–439. [Google Scholar] [CrossRef]
  32. Leng, S.; Ni, X.; Yener, A. Age of information for wireless energy harvesting secondary users in cognitive radio networks. In Proceedings of the 2019 IEEE 16th International Conference on Mobile Ad Hoc and Sensor Systems (MASS), Monterey, CA, USA, 4–7 November 2019; pp. 353–361. [Google Scholar]
  33. Wang, Q.; Chen, H.; Gu, Y.; Li, Y.; Vucetic, B. Minimizing the Age of Information of cognitive radio-based IoT systems under a collision constraint. IEEE Trans. Wirel. Commun. 2020, 19, 8054–8067. [Google Scholar] [CrossRef]
  34. Agarwal, P.; Ojha, S.; Srivastava, V.; Prasad, B. STAR-RIS assisted overlay cognitive radio network using DRL. In Proceedings of the 2024 IEEE International Conference on Intelligent Signal Processing and Effective Communication Technologies (INSPECT), Gwalior, India, 7–8 December 2024; pp. 1–6. [Google Scholar]
  35. Jia, X.; Zheng, K.; Chi, K.; Liu, X. DDPG-based throughput optimization with AoI constraint in ambient backscatter-assisted overlay CRN. Sensors 2022, 22, 3262. [Google Scholar] [CrossRef]
  36. López-Benítez, M.; Casadevall, F. Time-dimension models of spectrum usage for the analysis, design, and simulation of cognitive radio networks. IEEE Trans. Veh. Technol. 2013, 62, 2091–2104. [Google Scholar] [CrossRef]
  37. Lopez-Benitez, M.; Casadevall, F. Empirical time-dimension model of spectrum use based on a discrete-time Markov chain with deterministic and stochastic duty cycle models. IEEE Trans. Veh. Technol. 2011, 60, 2519–2533. [Google Scholar] [CrossRef]
  38. Abd-Elmagid, M.A.; Dhillon, H.S.; Pappas, N. A reinforcement learning framework for optimizing Age of Information in RF-powered communication systems. IEEE Trans. Commun. 2020, 68, 4747–4760. [Google Scholar] [CrossRef]
  39. Ho, C.K.; Zhang, R. Optimal energy allocation for wireless communications with energy harvesting constraints. IEEE Trans. Signal Process. 2012, 60, 4808–4818. [Google Scholar] [CrossRef]
  40. Bertsekas, D.P. Dynamic Programming and Optimal Control; Athena Scientific: Belmont, MA, USA, 2005; Volume 1–2. [Google Scholar]
  41. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
  42. Miuccio, L.; Riolo, S.; Samarakoon, S.; Bennis, M.; Panno, D. On learning generalized wireless MAC communication protocols via a feasible multi-agent reinforcement learning framework. IEEE Trans. Mach. Learn. Commun. Netw. 2024, 2, 298–317. [Google Scholar] [CrossRef]
  43. Miuccio, L.; Riolo, S.; Bennis, M.; Panno, D. Design of a feasible wireless MAC communication protocol via multi-agent reinforcement learning. In Proceedings of the 2024 IEEE International Conference on Machine Learning for Communication and Networking (ICMLCN), Stockholm, Sweden, 5–8 May 2024; pp. 94–100. [Google Scholar]
  44. Jin, W.; Sun, J.; Chi, K.; Zhang, B. Deep reinforcement learning based scheduling for minimizing age of information in wireless powered sensor networks. Comput. Commun. 2022, 191, 1–10. [Google Scholar] [CrossRef]
Figure 1. System model for one SU. In each time slot, the SU can harvest energy from PU transmissions and deliver the status update data pack to the CBS when the channel is in an idle state.
Figure 1. System model for one SU. In each time slot, the SU can harvest energy from PU transmissions and deliver the status update data pack to the CBS when the channel is in an idle state.
Entropy 27 00855 g001
Figure 2. System model for multiple SUs.
Figure 2. System model for multiple SUs.
Entropy 27 00855 g002
Figure 3. One sample path of AoI by the optimal policy.
Figure 3. One sample path of AoI by the optimal policy.
Entropy 27 00855 g003
Figure 4. The batterry capacity versus AoI for T = 10 .
Figure 4. The batterry capacity versus AoI for T = 10 .
Entropy 27 00855 g004
Figure 5. The size of status update data packet versus AoI for T = 10 .
Figure 5. The size of status update data packet versus AoI for T = 10 .
Entropy 27 00855 g005
Figure 6. The transmit power of the PU versus AoI for T = 10 .
Figure 6. The transmit power of the PU versus AoI for T = 10 .
Entropy 27 00855 g006
Figure 7. The changing trend of the average weighted sum-AoI and each SU’s AoI versus the time slots, respectively.
Figure 7. The changing trend of the average weighted sum-AoI and each SU’s AoI versus the time slots, respectively.
Entropy 27 00855 g007
Figure 8. The battery capacity versus average weighted sum-AoI for T = 10,000.
Figure 8. The battery capacity versus average weighted sum-AoI for T = 10,000.
Entropy 27 00855 g008
Figure 9. The status update date pack versus average weighted sum-AoI for T = 10,000.
Figure 9. The status update date pack versus average weighted sum-AoI for T = 10,000.
Entropy 27 00855 g009
Figure 10. The transmit power of the PU versus average weighted sum-AoI for T = 10,000.
Figure 10. The transmit power of the PU versus average weighted sum-AoI for T = 10,000.
Entropy 27 00855 g010
Table 1. Simulation parameter values.
Table 1. Simulation parameter values.
Simulation ParameterValue
W1 MHz
σ 2 −95 dBm
η 0.5
κ 2
Υ 0.2
τ s 0.2 s
A max 13
h max 10
g max 10
ϱ 0 p i i
b max 5
Table 2. State–action pairs per time slot.
Table 2. State–action pairs per time slot.
tabgh ϱ Action
032300.8(0,0)
144270.5(1,0)
254670.5(1,0)
364740.5(1,0)
474650.5(1,1)
512320.8(0,0)
622670.8(1,1)
710770.8(0,0)
824370.5(1,1)
912370.8(1,1)
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sun, J.; Zhang, S.; Yu, X. DRL-Based Scheduling for AoI Minimization in CR Networks with Perfect Sensing. Entropy 2025, 27, 855. https://doi.org/10.3390/e27080855

AMA Style

Sun J, Zhang S, Yu X. DRL-Based Scheduling for AoI Minimization in CR Networks with Perfect Sensing. Entropy. 2025; 27(8):855. https://doi.org/10.3390/e27080855

Chicago/Turabian Style

Sun, Juan, Shubin Zhang, and Xinjie Yu. 2025. "DRL-Based Scheduling for AoI Minimization in CR Networks with Perfect Sensing" Entropy 27, no. 8: 855. https://doi.org/10.3390/e27080855

APA Style

Sun, J., Zhang, S., & Yu, X. (2025). DRL-Based Scheduling for AoI Minimization in CR Networks with Perfect Sensing. Entropy, 27(8), 855. https://doi.org/10.3390/e27080855

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop