Skip Content
You are currently on the new version of our website. Access the old version .
ElectronicsElectronics
  • Article
  • Open Access

8 December 2025

Throughput Maximization in EH Symbiotic Radio System Based on LSTM-Attention-Driven DDPG

,
,
and
1
School of Information and Communication Engineering, Shanxi University of Electronic Science and Technology, Linfen 041000, China
2
School of Electronic Information Engineering, Taiyuan University of Science and Technology, Taiyuan 030024, China
3
Shanxi Key Laboratory of Wireless Communication and Detection, Shanxi University, Taiyuan 030006, China
4
China Unicom Software Research Institute, Beijing 100176, China

Abstract

Massive Internet of Things (IoT) deployments face critical spectrum crowding and energy scarcity challenges. Energy harvesting (EH) symbiotic radio (SR), where secondary devices share spectrum and harvest energy from non-orthogonal multiple access (NOMA)-based primary systems, offers a sustainable solution. We consider long-term throughput maximization in an EHSR network with a nonlinear EH model. To solve this non-convex problem, we designed a two-layered optimization algorithm combining convex optimization with a deep reinforcement learning (DRL) framework. The derived optimal power, time allocation factor, and the time-varying environment state are fed into the proposed long short-term memory (LSTM) attention mechanism combined Deep Deterministic Policy Gradient, named the LAMDDPG algorithm to achieve the optimal long-term throughput. Simulation results demonstrate that by equipping the Actor with LSTM to capture temporal state and enhancing the Critic with channel-wise attention mechanism, namely Squeeze-and-Excitation Block, for precise Q-evaluation, the LAMDDPG algorithm achieves a faster convergence rate and optimal long-term throughput compared to the baseline algorithms. Moreover, we find the optimal number of PDs to maintain efficient network performance under NLPM, which is highly significant for guiding practical EHSR applications.

1. Introduction

As wireless communication and sensing technologies have advanced rapidly, the Internet of Things (IoT) is increasingly being integrated into various vertical domains, such as smart transportation and smart cities [1]. A surge in the number of IoT devices connected to networks has led to severe spectrum congestion. Given the limited availability of spectrum resources, providing low-latency and high-reliability services to this massive number of IoT devices poses a significant and growing challenge [2].
Non-orthogonal multiple access (NOMA) technology has appeared as a core technology for accommodating massive connections in IoT networks [3]. By adopting superposition coding at the transmitter and successive interference cancellation (SIC) at the receiver, NOMA permits various devices to share the same spectrum simultaneously [4]. The SIC process decodes signals sequentially, starting with a device with a lower channel gain. Once the signal is decoded, it is subtracted from the received signal to iteratively eliminate co-channel interference and improve the quality of service (QoS) for all connected devices [5]. To enhance spectrum utilization further, cognitive radio (CR) technology has been integrated with NOMA, enabling secondary users (SUs) or unlicensed devices to access the spectrum allocated to primary users (PUs) or licensed devices [6]. Different CR-NOMA frameworks, such as frameworks based on secondary NOMA relays [7], cooperative frameworks under imperfect SIC [8], and fair power allocation-based frameworks [9], have been proposed to enhance spectrum efficiency. In addition, the CR-NOMA framework can improve the throughput of primary devices [10] and the total system [11]. Furthermore, the CR-NOMA architecture can enhance the system’s energy efficiency in an energy-constrained IoT network. For instance, the transmit power allocation scheme based on an improved artificial fish optimizing algorithm [12] and non-cooperative game theory [13] enhances energy efficiency significantly. Despite advancements in enhancing energy efficiency systems, the large-scale deployment of IoT applications is still limited by energy constraints. Battery-powered devices create notable energy bottlenecks, presenting a key challenge for practical implementation [14].
Energy Harvesting (EH) has appeared as a transformative technology for addressing the energy constraints faced by IoT devices. Instead of relying on wind and solar power, EH enables devices to draw energy from a variety of renewable sources, such as Radio Frequency (RF) signals, thereby offering a sustainable approach to powering IoT networks. Establishing an EH-enabled CR-NOMA IoT network where symbiotic secondary devices (SD) share the spectrum of primary devices (PD) and harvest energy from emitted signals of PDs can greatly enhance the spectrum efficiency and sustainability of IoT applications.
Considering the randomness of energy harvesting from a wireless environment complicates the acquisition of precise channel state information (CSI). Deep reinforcement learning (DRL) algorithms offer solutions by providing optimal spectrum and power allocation in such networks. However, the inherent nonlinearity of practical energy harvesting circuits, particularly those characterized by piecewise functions, presents challenges for DRL-based spectrum access and power allocation in achieving long-term throughput enhancement.
In this study, we explored an EH symbiotic radio (EHSR) IoT network in which the symbiotic SD is permitted to access the spectrum assigned to PD by CR and harvest energy from emitted signals of PDs via a nonlinear power model (NLPM). We determine the optimal power and time allocation factor for symbiotic SD using a convex optimization tool. Then, we designed a long short-term memory (LSTM) attention mechanism combined with a Deep Deterministic Policy Gradient (DDPG) algorithm (LAMDDPG) to solve the long-term throughput maximization for SD. Moreover, we found the optimal number of PDs to maintain efficient network performance under NLPM, which is highly significant for guiding practical EHSR applications.

3. System Model

This study investigated an uplink communication scenario in an energy-constrained CR-NOMA IoT network. As illustrated in Figure 1, the network comprises a base station (BS), M ( 1 m M ) primary devices (PDs) and an EHSR. PDs are ensured to transmit data in distinct time slots following a time-division multiple access (TDMA) scheduling protocol. Specifically, the m th PD is assigned the m th time slot to guarantee its data transmission priority. As depicted in Figure 2, the EHSR, denoted as S e 0 , achieves non-orthogonal multiplexing by sharing one of the PDs’ time slots to send information to the BS and simultaneously harvesting energy during this process based on the underlay mode [40]. The BS utilizes SIC technology, aided by known CSI that includes both large-scale and multipath fading. The BS first decodes the PD’s signal and subtracts it from the received mixed signal to decode the signal for S e 0 . During any given time, slot t k , P D k commences data transmission, where k = ( k 1 ) M + 1 . h k represents the channel gain between the BS and P D k . h k , 0 signifies the channel gain between P D k and S e 0 . h 0 , k represents the channel gain between S e 0 and BS. These channel gains encompass both effects, and the channel-state information is assumed to be known. S e 0 uses β k T seconds for data transmission, where β k , 0 β k 1 is the time allocation factor, T is the duration for every time slot. The remaining ( 1 β k ) T seconds are utilized for energy harvesting. We summarize the main notations of this paper in Table 1.
Figure 1. EH symbiotic radio system.
Figure 2. Data transmission mode.
Table 1. Summary of main notations.
In this paper, we employ a piece-wise linear function-based NLPM [41] to calculate the harvested energy during ( 1 β k ) T .
E ( 1 β k ) T = ( 1 β k ) T η P k h k , 0 2 ,    P t h ( 1 β k ) T    ,    η P k h k , 0 2 P t h η P k h k , 0 2 > P t h
where P k is the transmit power of P D k , P t h denotes the energy harvesting threshold of the EH circuit, and η is the energy harvesting efficiency.
Let E k denote the residual energy in the battery of S e 0 at the beginning of the time slot t k , after energy harvesting, the energy of S e 0 in the subsequent time slot t k + 1 can be formulated as:
  E k + 1 = min E max , E ( 1 β k ) T β k T P 0 , k + E k
Here, E max is the maximum battery capacity of S e 0 , β k T P 0 , k denotes the energy consumed by S e 0 for data transmission in time slot t k . P 0 , k is the transmit power of S e 0 , with a maximum value of E k . Based on Formula (2), S e 0 ’s transmit power is dynamically constrained by its harvested energy and E max . This energy constraint inherently limits S e 0 ’s interference to the co-channel PD, which will guarantee the quality of service (QoS) for PD.
In time slot t k , the achievable data rate of S e 0 is represented as:
R k = β k ln P 0 , k h 0 , k 2 1 + P k h k 2 + 1
The BS first decodes the signal of S e 0 , then removes the signal of S e 0 to decode the signal of P D k using SIC. Our primary objective is to maximize the long-term throughput of S e 0 . Essentially, this involves a decision-making process for S e 0 to determine which PD to access at each time slot t k as well as to optimize its own power and time allocation coefficient to achieve the highest possible throughput over an extended period. The problem can be mathematically formulated as:
P 1 :    max P 0 , k , β k E k = 1 γ k 1 R k s . t . E k + 1 = min E max , E ( 1 β k ) T β k T P 0 , k + E k ( P 1 a )   β k T P 0 , k E k         ( P 1 b )   0 β k 1          ( P 1 c )   0 P 0 , k P max          ( P 1 d )
where P max   denotes the maximum transmit power of S e 0 . γ , 0 γ 1 is the discount rate, which serves as a tradeoff between short-term and long-term rewards. Constraint (P1b) ensures that the energy used for data transmission by S e 0 does not exceed its remaining energy at time slot t k . The parameter β k and P 0 , k are coupled and have distinct value ranges. This coupling makes it challenging to directly use them as input actions for DRL, as it can result in unstable training.

4. Problem Formulation and Decomposition

4.1. Problem Transformation

Optimization problem P1 is solved via a layered optimization approach. In this section, we design a logic flow diagram to analyze this process (Figure 3).
Figure 3. Logic flow diagram for the two-layered optimization process.
We introduce a variable E k sur to represent the energy surplus. If E k sur 0 it indicates that the harvested energy can cover the energy consumed during data transmission. This condition helps extend the operational lifespan of the network, which is a key objective of the system.
E k sur = E ( 1 β k ) T β k T P 0 , k
β k and P 0 , k can be denoted by E k sur . Then problem P1 can be transformed into problem P2.
P 2 : max π E k sur E k = 1 γ k 1 R k s . t . E k + 1 = min E max , E k sur + E k
In essence, maximizing the long-term throughput of S e 0 involves optimizing its E k sur and β k . When E k sur = 0 , we can obtain:
R k = sup R k E ( 1 β k ) T β k T P 0 , k = E k sur , ( P 1 b ) , ( P 1 c ) , ( P 1 d )
To maximize the long-term throughput, we can focus on maximizing R k at each time slot. Specifically, the throughput at each time slot can be optimized by adjusting β k and P 0 , k , as follows:
P 3 :    max P 0 , k , β k R k s . t . E ( 1 β k ) T β k T P 0 , k = E k sur ( P 3 a )   ( P 1 b ) , ( P 1 c ) , ( P 1 d )
When E k sur is given, we can determine the optimal values of β k and P 0 , k as β k ( E k sur ) and ( E k sur ) , respectively. By substituting these optimal values into P2, we can transform P2 in to P4:
P 4 :    max π E k sur E k = 1 γ k 1 R k ( β k ( E k sur ) , P 0 , k ( E k sur ) ) s . t .   E k + 1 = min E max , E k sur + E k
Thus, problem P1 is decomposed into sub-problems P3 and P4, both of which are closely related to E k sur . For a given E k sur , we first solve P3 using convex optimization techniques [42]. Subsequently, P4 can be transformed into a problem solved by the LAMDDPG method.

4.2. Solving Problem P3

Because E ( 1 β k ) T β k T P 0 , k = E k sur is not affine, and constraint (P1b) involves the optimization variable β k , P3 is a non-convex problem. As indicated in [41], P3 can be reformulated as follows:
P 5 : max β k f 0 ( β k ) s . t .   0 β k 1
where f 0 ( β k ) = sup R k E ( 1 β k ) T β k T P 0 , k E k sur = 0 , ( P 1 b ) , ( P 1 d ) .
Following the formula E ( 1 β k ) T β k T P 0 , k E k sur = 0   , we can derive an expression for P 0 , k as following:
P 0 , k = ( 1 β k ) T η P k h k , 0 2 E k sur β k T ,      P t h ( 1 β k ) T E k sur β k T    ,    η P k h k , 0 2 P t h η P k h k , 0 2 > P t h
According to constraint (P1b), we can obtain:
P 0 , k E k β k T
Thus finding f 0 ( β k ) is equivalent to solving the optimization problem:
P 6 :    max P 0 , k R k s . t . P 0 , k = E ( 1 β k ) T E k sur β k T ( P 6 a )     P 0 , k E k β k T        ( P 6 b )     0 P 0 , k P max      ( P 6 c )  
Problem P6 is a function of P 0 , k , where β k is fixed. Given the constraints (P6a), f 0 ( β k ) can be expressed as:
f 0 ( β k ) = β k ln E ( 1 β k ) T E k sur β k T h 0 , k 2 1 + P k h k 2 + 1
The constraints (P6b) and (P6c) can be satisfied by specifying the domain of f 0 ( β k ) as:
D = β k 0 E ( 1 β k ) T E k sur β k T min P max , E k β k T
By Formulas (8) and (9), when η P k h k , 0 2 P t h , P6 can be denoted as:
P 7 :    max β k β k ln E ( 1 β k ) T E k sur β k T h 0 , k 2 1 + P k h k 2 + 1 s . t . β k 1 E k sur T η P k h k , 0 2        ( P 7 a )     β k 1 E k + E k sur T η P k h k , 0 2         ( P 7 b )     β k T η P k h k , 0 2 E k sur T η P k h k , 0 2 + T P max       ( P 7 c )   0 β k 1            ( P 7 d )  
When η P k h k , 0 2   > P t h , P6 can be denoted as:
P 8 :    max β k β k ln E ( 1 β k ) T E k sur β k T h 0 , k 2 1 + P k h k 2 + 1 s . t . β k 1 E k sur P t h T            ( P 8 a )     β k 1 E k + E k sur P t h T          ( P 8 b )   β k P t h T E k sur P t h T + P max T         ( P 8 c )   0 β k 1           ( P 8 d )  
Both P7 and P8 involve three lower bounds and two upper bounds related to β k . Moreover, the objective functions in P7 and P8 impose constraints on the selection of β k . Therefore, it is crucial to demonstrate the feasibility of the optimization problems P7 and P8. The proof of the feasibility of problems P7 and P8 is provided in Appendix A. In addition, we prove that P7 and P8 are concave functions of β k , β k 0 , in Appendix B.
The substantial number of inequality constraints in problems P7 and P8 complicated the process of obtaining an optimal solution for β k ( E k sur ) . Moreover, the first-order derivative of the objective function takes the form of the Lambert W function with two branches [27]. Following the steps outlined in [27], if E k sur satisfies:
E k sur T η P k h k , 0 2 ,    P t h T     ,    η P k h k , 0 2 P t h η P k h k , 0 2 > P t h
otherwise β k ( E k sur ) = 0 , we can achieve a closed form for β k ( E k sur ) , as follows:
β k ( E k sur ) = min 1 , max λ 1 λ 2 e W 0 ( e 1 ( λ 1 1 ) ) + 1 1 + λ 1 , ζ 0 ,    min 1 , max λ 3 λ 2 e W 0 ( e 1 ( λ 3 1 ) ) + 1 1 + λ 3 , ζ 1   ,    η P k h k , 0 2 P t h η P k h k , 0 2 > P t h
where W 0 ( ) denotes the principal branch of the Lambert W function [43]. λ 1 = η P k h k , 0 2 ( 1 + P k h k 2 ) h 0 , k 2 , λ 2 = E k sur h 0 , k 2 T ( 1 + P k h k 2 ) , λ 3 = P t h ( 1 + P k h k 2 ) h 0 , k 2 , ζ 0 = max 1 E k + E k sur T η P k h k , 0 2   , T η P k h k , 0 2 E k sur T η P k h k , 0 2 + T P max   , ζ 1 = max 1 E k + E k sur P t h T   , P t h T E k sur P t h T + P max T   .
Based on β k ( E k sur ) , the optimal power allocation coefficient is achieved by:
P 0 , k ( E k sur ) = ( 1 β k ) η P k h k , 0 2 β k E k sur β k T ,    ( 1 β k ) P t h β k E k sur β k T    ,    η P k h k , 0 2 P t h η P k h k , 0 2 > P t h
If E k sur satisfies:
E k sur = T η P k h k , 0 2 ,    P t h T    ,    η P k h k , 0 2 P t h η P k h k , 0 2 > P t h
S e 0 harvests energy at the whole time slot, there is no data transmission, and there is no need to specify the value for P 0 , k ( E k sur ) .

4.3. Solving Problem P4

With β k ( E k sur ) and P 0 , k ( E k sur ) obtained, the long-term throughput maximizing problem is reformulated as:
P 9 : max π E k sur E k = 1 γ k 1 β k ( E k sur ) ln P 0 , k ( E k sur ) h 0 , k 2 1 + P k h k 2 + 1 s . t .   E k + 1 = min E max , E k sur + E k    ( P 9 a )  
This problem is a function of E k sur , which can be optimized through the LAMDDPG method.

5. Proposed LAMDDPG Algorithm

5.1. Framework of LAMDDPG Algorithm

The optimal long-term throughput decision for P9 is served as a Markov Decision Process (MDP) and solved by the LAMDDPG framework. Here, S e 0 acts as an agent. The LAMDDPG framework is employed to identify a sequence of optimal decisions that maximizes the long-term expected cumulative discounted reward.
As shown in Figure 4, the LAMDDPG framework contains an Actor network ω μ and Actor target network ω μ , Critic network ω Q and Critic target network ω Q , an experience replay memory. The LSTM layers are integrated after the input layer for Actor network and the Actor target network. The attention layers are adopted after hidden layers for the Critic network and Critic target network.
Figure 4. The framework of LAMDDPG. Black solid lines denote the agent, environment, network components, black dashed lines illustrate the network structure of the proposed LAMDDPG algorithm, blue solid lines depict the training phase, blue dashed lines illustrate the execution phase of the algorithm.
The current state s from the environment are input to the Actor and Actor target network, considering the temporal correlation of residual energy and channel state information in the current scenario, LSTM layers are introduced to capture dynamic dependencies. The LSTM layers learn from past observations to adjust the weights and bias to predict the real-time state. The output data is input into the hidden layer of the Actor network ω μ and Actor target network ω μ , respectively. The action a ( s ) = μ s | ω μ and target action μ s | ω μ are output. To enhance the exploration capability of the Actor network, noise is introduced in the output of the Actor network. Thus, the action to be chosen is a ( s ) = μ s | ω μ + n , where n represents the noise following a Gaussian distribution.
Moreover, attention mechanisms are integrated into the Critic network and Critic target network. F from the hidden layer are input into the attention module. Through squeezing and excitation, features of different channels are assigned distinct channel weights. By multiplying F with the channel weights, F can be adaptively enhanced, the Critic network can dynamically focus on more important features most relevant to the current agent’s decision-making, thereby enhancing the accuracy of decision-making. The Critic network ω Q outputs the state-action value function Q ( s , a ω Q ) under the current state and action by the Actor network. The Critic target network ω Q outputs Q ( s , a ω Q ) . Optimal actions generated by these networks are applied to the environment to obtain a reward r , and then the environment transitions to a new state s . The experience ( s , a , r , s ) is then stored in the replay memory.
The state space, action space, and reward function are defined as follows
State space: In this CR-NOMA-enabled EHSR network, the state space is defined as the channel state information associated with the PD and S e 0 , the residual energy of S e 0 . The system state at time slot t k can be denoted as:
s k = ( h 0 , k ,   h k ,   h k , 0 ,   E k )
Action space: The agent selects transmitting data or energy harvesting according to the current system state
a k = E k sur
Considering the extreme situation that S e 0 only transmits data for the whole time slot t k , the lower bound is:
E k sur min E k , T P max
On the other hand, if S e 0 only harvests energy for the whole time slot t k , the upper bound is:
E k sur min E max E k , T η P k h k , 0 2 ,    min E max E k , P t h T      ,    η P k h k , 0 2 P t h η P k h k , 0 2 > P t h
Thus, the value for action is very large, which brings instability to the network. The value of E k sur can be normalized as
E k sur = ( 1 ε k ) min E k , T P max + ε k min E max E k , T η P k h k , 0 2 ,    min E max E k , P t h T      ,    η P k h k , 0 2 P t h η P k h k , 0 2 > P t h
where ε k 0 , 1 , hence, the suitable action parameter can be constrained to improve the stability of the networks.
Reward function: When the agent selects an action at any time slot t k , it will receive a corresponding reward, set as
r k = R k
where R k is the achievable data rate at t k .
The LAMDDPG algorithm employs centralized training with distributed execution. During the training phase, a batch of experience B = ( s k , a k , r k , s k + 1 ) , k = 1 , , N is selected from the replay memory for training, N is the batch size.
The predicted action a k + 1 = μ ( s k + 1 ) is fed into the Critic target network. Based on these inputs, the Critic target network computes the target value y k . The target value for the state function is calculated by:
y k = r k + γ Q ( s k + 1 , a k + 1 | ω Q )
s k and a k = μ ( s k ) are fed into the Critic network. Based on these inputs, the Critic network calculates the corresponding Q value, denoted as Q ( s k , a k ω Q ) . The parameters of the Critic network are updated using gradient descent. The loss function for the Critic network is defined as the difference between the target value y k and the predicted Q ( s k , a k ω Q ) , which is essentially the error term of the Bellman equation. Therefore, the parameters of the Critic network can be updated according to the following formula:
L oss ( ω Q ) = E [ ( Q s k , a k | ω Q y k ) ) 2 ]
When updating the parameters of the Actor network, gradient ascent is employed. The parameters can be updated according to the following formula:
ω μ J E [ ω μ Q s , a | ω Q | s = s k , a = μ ( s k | ω μ ) ]    = E [ a Q s , a | ω Q | s = s k , a = μ ( s k ) ω μ μ s | ω μ | s = s k ]
The parameters of both target networks are updated using a soft update method, described as follows:
ω μ τ ω μ + ( 1 τ ) ω μ
ω Q τ ω Q + ( 1 τ ) ω Q
where τ is the update coefficient.

5.2. Neural Network Architectures and Training Parameters

In this subsection, we present the neural network architecture of the proposed LAMDDPG algorithm. In the LAMDDPG algorithm, both the Actor network and Actor target network contain an input layer, an LSTM layer, two hidden layers, and an output layer. Both the Critic network and Critic target network consist of an input layer, two hidden layers, an attention module (based on the Squeeze-and-Excitation module), and an output layer. Both the Actor and Critic networks adopt the Adam optimizer.
The input layer of the Actor (target) network takes the state s as input, performing input reshaping to adapt to the input format required by the LSTM layer. The LSTM layer contains 64 units and outputs a vector with a dimension of (32, 64), which is fed into the first hidden layer. After ReLU activation, the first hidden layer outputs a vector of (32, 64), which is then passed to the second hidden layer. Following tanh activation, the second hidden layer outputs a (32, 64) vector that is fed into the output layer. After tanh activation, the output action is scaled to the range between the maximum and minimum magnitudes of the action (target action).
The Critic (target) network takes the state s ( s ) and action a ( a ) as inputs. These inputs undergo feature fusion in the first hidden layer, and after ReLU activation, a vector with a dimension of (32, 64) is output. This vector is fed into the second hidden layer, which outputs a feature vector of dimension (32, 64) following ReLU activation. Subsequently, the vector enters the attention module: first, average pooling is performed on the feature dimension to complete the squeezing operation; then, it passes through two fully connected layers, activated by ReLU and Sigmoid, respectively, to generate channel-wise attention weights. The vectors are then multiplied by the attention weights to obtain an output vector of (32, 64). Finally, the action value function Q ( Q ) is output after passing through the output layer.
The LAMDDPG algorithm is presented in the LAMDDPG algorithm (Algorithm 1).
Algorithm 1 LAMDDPG algorithm
Input: Environment, settings of S e 0 and PDs
Output: parameters ω μ ,   ω μ ,   ω Q ,   ω Q
Initialize system parameters, initialize Actor network, and Critic network
       Initialize target network weights parameters
       Initialize experience replay memory
1For episode = 1 to nep, do:
2    Initialization of noise n, initialization of large-scale fading, and small-scale random fading
3    Obtain the initial state s1
4    For k = 1 to T, do:
5Select a k = μ s k | ω μ + n
6      Execute action ak, receive reward rk, and environment state sk+1, and store the array s k , a k , r k , s k + 1 in the experience replay memory
7      Randomly sample a batch of experience B = ( s k , a k , r k , s k + 1 ) , k = 1 , , N from the replay memory
8Set y k = r k + γ Q ( s k + 1 , a k + 1 | ω Q )
9      Minimize the loss function to update the Critic Q network
L oss ( ω Q ) = E [ ( Q s k , a k | ω Q y k ) ) 2 ]
10      Sample strategy gradient update Actor policy network
ω μ J ω μ = E [ a Q s , a | ω Q | s = s k , a = μ ( s k ) ω μ μ s | ω μ | s = s k ]
11      Update target network
ω μ τ ω μ + ( 1 τ ) ω μ ,   ω Q τ ω Q + ( 1 τ ) ω Q
12      End for
13End for

6. Simulation Results

In this section, we prove the effectiveness of the proposed algorithm. The path loss model from [13] is adopted, with the following parameter settings in Table 2 [27].
Table 2. Parameter settings.
BS is located at the origin of the x-y plane. S e 0 is deployed at (1 m, 1 m). To compare the performance of the LAMDDPG algorithm, the following baseline algorithms are introduced:
(1)
DDPG algorithm: using the baseline algorithm proposed in [27].
(2)
Greedy algorithm: S e 0 consumes all battery energy during the data transmission and then begins energy harvesting [27].
(3)
Random algorithm: S e 0 transmit data using P max as the transmit power, β k is randomly generated between 0 and min 1 , E k T P max .
(4)
LMDDPG algorithm: To demonstrate the effect of LSTM and attention mechanisms, we design the LMDDPG algorithm by incorporating LSTM layers into the Actor networks of DDPG.
The channels remain unchanged in each experiment, consisting of multiple episodes. Independent and identically distributed (i.i.d.) complex Gaussian random variables with a mean of zero and a unit variance are employed to simulate small-scale fading.

6.1. Rewards Analysis Under Different Algorithms

We first examine the data rate performance of different algorithms when M = 2, corresponding to two PDs positioned at (0 m, 1 m) and (0 m, 1000 m). As illustrated in Figure 5, after independent and repeated experiments, the data rates achieved by the DDPG, LMDDPG, and LAMDDPG algorithms significantly surpass those of the Greedy and Random algorithms. This superiority stems from the fact that S e 0 can choose optimal actions under the guidance of DRL algorithms effectively. The incorporation of LSTM layers in the Actor networks enables LMDDPG to converge faster than DDPG. Furthermore, the integration of attention mechanisms into the Critic networks allows the LAMDDPG algorithm to converge at the 50th episode while the data rate increases by 19% over LMDDPG. The combination of LSTM and attention mechanisms improves the data rate by 25% over DDPG.
Figure 5. Rewards under different algorithms when M = 2.
Next, we investigate the data rate performance of various algorithms under different numbers of PDs. Figure 6a,b shows scenarios with 5 PDs and 10 PDs in the network, respectively. These PDs are positioned between (0 m, 1 m) and (0 m, 1000 m). In these complex and dynamic environments with more PDs, both the Random and Greedy algorithms achieve very low data rates. As depicted in Figure 6a,b, after independent and repeated experiments, the data rates achieved by DDPG, LMDDPG, and LAMDDPG algorithms obviously surpass those of the Greedy and Random algorithms. When the number of PDs is 5, the maximum data rate achieved by LAMDDPG is nearly 8 bps. As the number of PDs increases to 10, the maximum data rate achieved by LAMDDPG rises to nearly 10 bps. The LAMDDPG algorithm converges faster than LMDDPG and DDPG in Figure 6a,b. However, compared with Figure 6a, LAMDDPG converges more slowly in Figure 6b, due to the increase in the number of PDs participating in NOMA transmission leads to an increase in the state space of the DRL algorithms, which consequently results in a slower convergence rate. On the other hand, the data rates in Figure 6a,b are higher than those in Figure 5. This is because with more PDs involved, there is relatively more time available for S e 0 to transmit data.
Figure 6. (a) Rewards under different algorithms when M = 5. (b) Rewards under different algorithms when M = 10.

6.2. Rewards Analysis Under Different EH Model

We then investigate the performance of different algorithms under various EH models when M = 2. In Figure 7, real indicates the EH model based on NLPM, which is more practical in real environments, while ideal indicates the EH model based on the linear model, commonly used in theoretical analyses. Two PDs are positioned at (0 m, 1 m) and (0 m, 1000 m), respectively. As depicted in Figure 7, after independent and repeated experiments, the upper bound of data rate is 5 bps achieved by the DDPG algorithm with the linear EH model. Under the NLPM, the data rate of DDPG reaches nearly 4.6 bps, which is smaller than that of LAMDDPG with NLPM, achieving nearly 5 bps. The LAMDDPG algorithm converges approximately at 50 episodes to achieve the maximal data rate. This superiority stems from the ability of the LAMDDPG algorithm to help S e 0 to choose optimal actions through the combination of LSTM and attention mechanisms.
Figure 7. Rewards under different EH models when M = 2.
We further investigate the performance of different algorithms under various EH models with different numbers of PDs. Here, real indicates the EH model based on the NLPM, while ideal denotes the EH model based on the linear model. Figure 8a and Figure 8b show scenarios with five and ten PDs in the network, respectively. These PDs are positioned between (0 m, 1 m) and (0 m, 1000 m). As depicted in Figure 8a,b, after independent and repeated experiments, the upper bound of the data rates is 7 bps and 8 bps, respectively, achieved by the DDPG algorithm with the linear EH model. The maximum data rate increases with the number of PDs participating in NOMA transmission, due to more PDs providing more opportunities for S e 0 transmitting data. Under the NLPM, the data rates of DDPG in Figure 8a,b reach nearly 5.8 bps and 6 bps, respectively, which are lower than those of LAMDDPG with NLPMs. The data rates of LAMDDPG with NLPM achieve the maximum data rates in both Figure 8a,b. Compared with Figure 8a, when there are ten PDs, the rapid increase in states caused by the growing number of PDs participating in NOMA transmission results in a slower convergence speed, and increased fluctuation under NPLM in Figure 8b.
Figure 8. (a) Rewards under different EH models when M = 5. (b) Rewards under the different EH models when M = 10.

6.3. Mechanism Analysis of Performance Improvement

Through ablation experiments (Figure 5, Figure 6, Figure 7 and Figure 8), we compare the LMDDPG algorithm with an added LSTM layer, the LAMDDPG algorithm (integrating LSTM and attention mechanisms), and DDPG, Greedy, and Random algorithms. The simulation results demonstrate that LAMDDPG achieves faster convergence and higher cumulative rewards across scenarios with varying numbers of PDs and different energy-harvesting models.
In our scenario, inherent dependencies exist between CSI and the remaining energy of EHSR. The introduced LSTM layer leverages its hidden state to encode historical data, enabling the agent to capture implicit states that are critical for decision-making. Thus, LMDDPG outperforms DDPG, greedy, and random algorithms in reward accumulation.
By integrating the attention module, the extracted features from hidden layers of the Critic (target) network are squeezed and activated by the Sigmoid function, which mitigates the overestimation bias of Q-values. Meanwhile, by squeeze-excitation-scaling steps, the attention module adaptively assigns weights to the extracted features, which emphasizes those that contribute more to Q-value estimation while suppressing redundant ones. This enhances the accuracy of Q-value predictions, reducing the agent’s ineffective exploration and thus accelerating algorithm convergence while boosting cumulative rewards.

6.4. Rewards Analysis Under Different Numbers of PDs

Furthermore, we deploy more PDs to investigate the maximum data rate under different PDs according to different EH models. As depicted in Figure 9, the average data rate initially increases with the increase in PDs, then decreases with the increase in PDs under different algorithms. This is because when the number of PDs exceeds two, the system allocates more time slots, thereby enhancing the data rate. However, when the number of PDs surpasses 10, the system experiences a sharp decline due to the strong inter-device interference caused by the increased number of PDs. As seen from Figure 10, when PD = 10, S e 0 is more inclined to select actions for data transmission than to conserve energy compared with the scenario where the number is 15. In other words, the action corresponds to the surplus energy, and the probability of the surplus energy being small is relatively high. As seen from Figure 11, when the number of PD is 15, the probability of the surplus energy being large is relatively high, indicating that S e 0 tends to prioritize energy harvesting over data transmission to manage the increased interference, which further contributes to the decline in data rate. Regarding the algorithms, LAMDDPG demonstrates a superior data rate compared to DDPG under NLPM, indicating that the LSTM and attention mechanism are more effective in managing the complex environments. Moreover, the system can achieve ideal values under the linear EH model, due to the linear EH model providing a more favorable environment. The optimal data rate can be achieved when the number of PDs is ten, highlighting the importance of balancing the number of PDs to maintain efficient network performance.
Figure 9. Rewards under different numbers of PDs.
Figure 10. Action selection when PD = 10.
Figure 11. Action selection when PD = 15.

7. Conclusions

In this paper, we consider the long-term throughput maximization problem for an EHSR IoT device in a CR-NOMA-enabled IoT network that comprises multiple primary IoT devices, a base station, and an EHSR IoT device. To be closer to practical applications, we adopt a piece-wise linear function-based NLPM. We addressed this optimization problem by integrating convex optimization with the LAMDDPG algorithm. Experimental results demonstrate that the LSTM layer in the Actor network can predict channel state information from historical data, effectively solving the agent’s partial observability problem. Meanwhile, the channel attention SE block in the Critic network mitigates Q-value overestimation in DRL algorithms through squeeze-excitation-scale operations. The synergy of these two mechanisms accelerates exploration, improves reward acquisition, and speeds up convergence. Moreover, we find the optimal number of PDs to maintain efficient network performance under NLPM, which is highly significant for guiding practical EHSR applications. However, we consider the ideal SIC condition in such an EH-CR-NOMA symbiotic system. In future work, we will extend our research to non-ideal SIC scenarios and further explore improving other types of DRL algorithms (e.g., PPO, TD3) to address the throughput maximization problem in EH-CR-NOMA symbiotic networks with a nonlinear energy harvesting model.

Author Contributions

Conceptualization, Y.Z. and L.K.; methodology, L.K.; investigation, Y.Z.; writing—original draft preparation, L.K.; writing—review and editing, Y.Z. and L.K.; visualization, J.S.; supervision, D.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China (U23A20627); Key R&D Program Project of Shanxi Province (202302150401004); Shanxi Key Laboratory of Wireless Communication and Detection (2025002); Doctoral Research Start up Fund of Taiyuan University of Science and Technology (20222118), and Scientific Research Startup Fund of Shanxi University of Electronic Science and Technology (2025KJ027).

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
IoTInternet of Things
EHEnergy harvesting
SRSymbiotic radio
NOMANon-orthogonal multiple access
CRCognitive radio
LSTMLong short-term memory
DDPGDeep Deterministic Policy Gradient
SICSuccessive interference cancellation
QoSQuality of service
SUsSecondary users
PUsPrimary users
SWIPTSimultaneous wireless information and power transfer
TSTime-switching
PSPower-splitting
CSIChannel state information
DRLDeep reinforcement learning
PPOProximal Policy Optimization
CER-DDPGCombined experience replay with DDPG
NLPMsNonlinear power models
PERPrioritized experience replay
TDMATime-Division Multiple Access
PDsPrimary devices

Appendix A. Proof That Problems P7 and P8 Are Feasible

Since E k sur denotes the energy surplus at time slot t k , when S e 0 transmit data at the whole time slot, i.e., β k = 1 , using the maximum power P max . E k sur achieves its lowest value T P max . Thus E k sur satisfies the constraint as follows:
E k sur T P max
The energy supply for data transmission comes from the remaining energy E k at t k . The maximized consumed energy during this process is denoted by T P max = E k sur which cannot exceed E k , that means E k sur E k E k sur E k . Thus, the domain of E k sur can be denoted as:
E k sur min E k , T P max
On the other hand, when S e 0 harvests energy at the whole time slot, i.e., β k = 0 , E k sur cannot exceed the constraint denoted as:
E k sur T η P k h k , 0 2 ,    P t h T    ,    η P k h k , 0 2 P t h η P k h k , 0 2 > P t h
Due to the battery ‘s capacity constraint E k sur , and considering that the remaining energy at t k is E k , the upper bound for E k sur is:
E k sur min E max E k , T η P k h k , 0 2 ,    min E max E k , P t h T      ,    η P k h k , 0 2 P t h η P k h k , 0 2 > P t h
Considering the condition of η P k h k , 0 2 and P t h , the bounds for E k sur can be obtained by combining Formula (A2) with (A4).
(1) When η P k h k , 0 2 P t h , we rewrite Formula (A4) as E k sur min E max E k , T η P k h k , 0 2 T η P k h k , 0 2 , thus 1 E k sur T η P k h k , 0 2 0   , the lower bound of constraint (P7d) does not conflict with the upper bound in (P7a). According to Formula (A2), E k sur E k , so E k + E k sur T η P k h k , 0 2     0 , then 1 E k + E k sur T η P k h k , 0 2     1 , the lower bound of constraint (P7b) does not conflict with the upper bound in (P7d). Furthermore, according to Formula (A2) E k sur T P max , then E k sur T P max , we can obtain T η P k h k , 0 2 E k sur T η P k h k , 0 2 + T P max , so T η P k h k , 0 2 E k sur T η P k h k , 0 2 + T P max 1 , the lower bound of constraint (P7c) does not conflict with the upper bound in (P7d). According to Formula (A4), E k sur T η P h k , 0 2 , then 0 < E k sur T η P h k , 0 2 1 , apparently, T η P k h k , 0 2 E k sur T η P k h k , 0 2 + T P max 1 E k sur T η P k h k , 0 2 , the lower bound of constraint (P7c) does not conflict with the upper bound in (P7a). Then we know that the constraints of P7 do not conflict with each other, and the set defined by the constraints of P7 is not empty. The domain of the objective function of problem P7 satisfies E ( 1 β k ) T E k sur β k T ( 1 + P k h k 2 ) h 0 , k 2 + 1 > 0 . The domain for constraint (P7a) satisfies
( 1 β k ) T η P k h k , 0 2 E k sur 0 ,  
Combining with Formula (6), we can obtain P 0 , k β k T = ( 1 β k ) T η P k h k , 0 2 E k sur 0 , so β k T P 0 , k β k T ( 1 + P k h k 2 ) h 0 , k 2 0 , the form is changed into
E ( 1 β k ) T E k sur β k T ( 1 + P k h k 2 ) h 0 , k 2 0 ,  
Thus, the intersection of the domain for the constraint and for the objective function is not empty.
(2) When η P k h k , 0 2   > P t h we rewrite Formula (A4) as E k sur min E max E k , P t h T   P t h T , thus 1 E k sur P t h T   0 , the lower bound of constraint (P8d) does not conflict with the upper bound in (P8a). According to Formula (A2), E k sur E k , E k + E k sur P t h T   0 , thus 1 E k + E k sur P t h T     1 , the lower bound of constraint (P8b) does not conflict with the upper bound in (P8d). Furthermore, according to Formula (A2) E k sur T P max , then E k sur T P max , we can obtain P t h T E k sur P t h T + T P max , so P t h T E k sur P t h T + P max T   1 , the lower bound of constraint (P8c) does not conflict with the upper bound in (P8d). According to Formula (A4), E k sur P t h T , then 0 < E k sur P t h T 1 , apparently, P t h T E k sur P t h T + P max T   1 E k sur P t h T   , the lower bound of constraint (P8c) does not conflict with the upper bound in (P8a). Then we know that the constraints of P8 do not conflict with each other, and the set defined by the constraints of P8 is not empty. The domain of the objective function of problem P8 satisfies E ( 1 β k ) T E k sur β k T ( 1 + P k h k 2 ) h 0 , k 2 + 1 > 0 . The domain for constraint (P8a) satisfies
( 1 β k ) P t h T E k sur 0 ,  
Combining with Formula (6), we can obtain P 0 , k β k T = P t h ( 1 β k ) T E k sur 0 , so β k T P 0 , k β k T ( 1 + P k h k 2 ) h 0 , k 2 0 , the form is changed into E ( 1 β k ) T E k sur β k T ( 1 + P k h k 2 ) h 0 , k 2 0 ; thus, the intersection of the domain for the constraint and for the objective function is not empty.
As a result, optimization problems P7 and P8 are feasible.

Appendix B. Proof That Problems P7 and P8 Are Concave Functions

Since the objective function of P7 and P8 can be denoted as:
f 0 ( β k ) = β k ln E ( 1 β k ) T E k sur β k T h 0 , k 2 1 + P k h k 2 + 1 .
We know β k = 0 is in the domain of f 0 ( β k ) , by combining Formulas (A1) with (A5), and (A3) with (A7), respectively.
To simplify expressions, we define λ 1 = η P k h k , 0 2 ( 1 + P k h k 2 ) h 0 , k 2 , λ 2 = E k sur h 0 , k 2 T ( 1 + P k h k 2 ) λ 3 = P t h ( 1 + P k h k 2 ) h 0 , k 2 and x = β k , then the objective function can be denoted as
f 0 ( x ) = x ln ( 1 λ 1 + λ 1 λ 2 x )    x ln ( 1 λ 3 + λ 3 λ 2 x )    η P k h k , 0 2 P t h η P k h k , 0 2 > P t h
Similar to the steps in [27], we can obtain the first-order derivative and the second-order derivative of f 0 ( x ) to prove that the objective functions of P7 and P8 are concave functions.

References

  1. Donta, P.K.; Srirama, S.N.; Amgoth, T.; Annavarapu, C.S.R. Survey on recent advances in IoT application layer protocols and machine learning scope for research directions. Digit. Commun. Netw. 2022, 8, 727–744. [Google Scholar] [CrossRef]
  2. Andrews, J.G.; Buzzi, S.; Choi, W.; Hanly, S.V.; Lozano, A.; Soong, A.K.; Zhang, J.C. What will 5G be? IEEE J. Sel. Areas Commun. 2014, 32, 1065–1082. [Google Scholar] [CrossRef]
  3. Makki, B.; Chitti, K.; Behravan, A.; Alouini, M.-S. A survey of NOMA: Current status and open research challenges. IEEE Open J. Commun. Soc. 2020, 1, 179–189. [Google Scholar] [CrossRef]
  4. Lei, H.; She, X.; Park, K.-H.; Ansari, I.S.; Shi, Z.; Jiang, J.; Alouini, M.-S. On secure CDRT with NOMA and physical-layer network coding. IEEE Trans. Commun. 2023, 71, 381–396. [Google Scholar] [CrossRef]
  5. Kilzi, A.; Farah, J.; Nour, C.A.; Douillard, C. Mutual successive interference cancellation strategies in NOMA for enhancing the spectral efficiency of CoMP systems. IEEE Trans. Commun. 2020, 68, 1213–1226. [Google Scholar] [CrossRef]
  6. Li, X.; Zheng, Y.; Khan, W.U.; Zeng, M.; Li, D.; Ragesh, G.K.; Li, L. Physical layer security of cognitive ambient backscatter communications for green Internet-of-Things. IEEE Trans. Green Commun. Netw. 2021, 5, 1066–1076. [Google Scholar] [CrossRef]
  7. Chen, B.; Chen, Y.; Chen, Y.; Cao, Y.; Zhao, N.; Ding, Z. A novel spectrum sharing scheme assisted by secondary NOMA relay. IEEE Wirel. Commun. Lett. 2018, 7, 732–735. [Google Scholar] [CrossRef]
  8. Do, D.-T.; Le, A.-T.; Lee, B.M. NOMA in Cooperative Underlay Cognitive Radio Networks Under Imperfect SIC. IEEE Access 2020, 8, 86180–86195. [Google Scholar] [CrossRef]
  9. Ali, Z.; Khan, W.U.; Sidhu, G.A.S.; K, N.; Li, X.; Kwak, K.S.; Bilal, M. Fair power allocation in cooperative cognitive systems under NOMA transmission for future IoT networks. Alex. Eng. J. 2022, 61, 575–583. [Google Scholar] [CrossRef]
  10. Jiang, Q.; Zhang, C.; Zheng, W.; Wen, X. Research on Delay DRL in Energy-Constrained CR-NOMA Networks based on Multi-Threads Markov Reward Process. In Proceedings of the 2022 IEEE 8th International Conference on Computer and Communications (ICCC), Nanjing, China, 29 March–1 April 2021. [Google Scholar] [CrossRef]
  11. Elmadina, N.N.; Saeid, E.; Mokhtar, R.A.; Saeed, R.A.; Ali, E.S.; Khalifa, O.O. Performance of Power Allocation Under Priority User in CR-NOMA. In Proceedings of the 2023 IEEE 3rd International Maghreb Meeting of the Conference on Sciences and Techniques of Automatic Control and Computer Engineering (MI-STA), Benghazi, Libya, 21–23 May 2023. [Google Scholar] [CrossRef]
  12. Alhamad, R.; Boujemâa, H. Optimal power allocation for CRN-NOMA systems with adaptive transmit power. Signal Image Video Process. 2020, 14, 1327–1334. [Google Scholar] [CrossRef]
  13. Abidrabbu, S.S.; Arslan, H. Energy-Efficient Resource Allocation for 5G Cognitive Radio NOMA Using Game Theory. In Proceedings of the 2021 IEEE Wireless Communications and Networking Conference (WCNC), Nanjing, China, 29 March–1 April 2021. [Google Scholar] [CrossRef]
  14. Xie, N.; Tan, H.; Huang, L.; Liu, A.X. Physical-layer authentication in wirelessly powered communication networks. IEEE/ACM Trans. Netw. 2021, 29, 1827–1840. [Google Scholar] [CrossRef]
  15. Huang, J.; Xing, C.; Guizani, M. Power allocation for D2D communications with SWIPT. IEEE Trans. Wirel. Commun. 2020, 19, 2308–2320. [Google Scholar] [CrossRef]
  16. Liu, Y.; Ding, Z.; Elkashlan, M.; Poor, H.V. Cooperative non-orthogonal multiple access with simultaneous wireless information and power transfer. IEEE J. Sel. Areas Commun. 2016, 34, 938–953. [Google Scholar] [CrossRef]
  17. Mazhar, N.; Ullah, S.A.; Jung, H.; Nadeem, Q.-U.-A.; Hassan, S.A. Enhancing spectral efficiency in IoT networks using deep deterministic policy gradient and opportunistic NOMA. In Proceedings of the 2024 IEEE 100th Vehicular Technology Conference (VTC2024-Fall), Washington, DC, USA, 7–10 October 2024. [Google Scholar] [CrossRef]
  18. Yang, J.; Cheng, Y.; Peppas, K.P.; Mathiopoulos, P.T.; Ding, J. Outage performance of cognitive DF relaying networks employing SWIPT. China Commun. 2018, 15, 28–40. [Google Scholar] [CrossRef]
  19. Song, Z.; Wang, X.; Liu, Y.; Zhang, Z. Joint Spectrum Resource Allocation in NOMA-based Cognitive Radio Network with SWIPT. IEEE Access 2019, 7, 89594–89603. [Google Scholar] [CrossRef]
  20. Yang, C.; Lu, W.; Huang, G.; Qian, L.; Li, B.; Gong, Y. Power Optimization in Two-way AF Relaying SWIPT based Cognitive Sensor Networks. In Proceedings of the 2020 IEEE 92nd Vehicular Technology Conference (VTC2020-Fall), Victoria, BC, Canada, 18 November–16 December 2020. [Google Scholar] [CrossRef]
  21. Liu, X.; Zheng, K.; Chi, K.; Zhu, Y.-H. Cooperative Spectrum Sensing Optimization in Energy-Harvesting Cognitive Radio Networks. IEEE Trans. Wirel. Commun. 2020, 19, 7663–7676. [Google Scholar] [CrossRef]
  22. Wang, Y.; Chen, S.; Wu, Y.; Zhao, C. Maximizing Average Throughput of Cooperative Cognitive Radio Networks Based on Energy Harvesting. Sensors 2022, 22, 8921. [Google Scholar] [CrossRef]
  23. Luong, N.C.; Hoang, D.T.; Gong, S.; Niyato, D.; Wang, P.; Liang, Y.-C.; Kim, D.I. Applications of deep reinforcement learning in communications and networking: A survey. IEEE Commun. Surv. Tutor. 2019, 21, 3133–3174. [Google Scholar] [CrossRef]
  24. Umeonwuka, O.O.; Adejumobi, B.S.; Shongwe, T. Deep Learning Algorithms for RF Energy Harvesting Cognitive IoT Devices: Applications, Challenges and Opportunities. In Proceedings of the 2022 International Conference on Electrical, Computer and Energy Technologies (ICECET), Prague, Czech Republic, 20–22 July 2022. [Google Scholar] [CrossRef]
  25. Du, K.; Xie, X.; Shi, Z.; Li, M. Joint Time and Power Control of Energy Harvesting CRN Based on PPO. In Proceedings of the 2022 Wireless Telecommunications Symposium (WTS), Pomona, CA, USA, 6–8 April 2022. [Google Scholar] [CrossRef]
  26. Al Rabee, F.T.; Masadeh, A.; Abdel-Razeq, S.; Salameh, H.B. Actor–Critic Reinforcement Learning for Throughput-Optimized Power Allocation in Energy Harvesting NOMA Relay-Assisted Networks. IEEE Open J. Commun. Soc. 2024, 5, 7941–7953. [Google Scholar] [CrossRef]
  27. Ding, Z.; Schober, R.; Poor, H.V. No-Pain No-Gain: DRL Assisted Optimization in Energy-Constrained CR-NOMA Networks. IEEE Trans. Commun. 2021, 69, 5917–5932. [Google Scholar] [CrossRef]
  28. Shi, Z.; Xie, X.; Lu, H.; Yang, H.; Cai, J.; Ding, Z. Deep Reinforcement Learning-Based Multidimensional Resource Management for Energy Harvesting Cognitive NOMA Communications. IEEE Trans. Commun. 2022, 70, 3110–3125. [Google Scholar] [CrossRef]
  29. Ullah, A.; Zeb, S.; Mahmood, A.; Hassan, S.A.; Gidlund, M. Opportunistic CR-NOMA Transmissions for Zero-Energy Devices: A DRL-Driven Optimization Strategy. IEEE Wirel. Commun. Lett. 2023, 12, 893–897. [Google Scholar] [CrossRef]
  30. Du, K.; Xie, X.; Shi, Z.; Li, M. Throughput maximization of EH-CRN-NOMA based on PPO. In Proceedings of the 2023 International Conference on Inventive Computation Technologies (ICICT), Lalitpur, Nepal, 26–28 April 2023. [Google Scholar] [CrossRef]
  31. Zhou, F.; Chu, Z.; Wu, Y.; Al-Dhahir, N.; Xiao, P. Enhancing PHY security of MISO NOMA SWIPT systems with a practical non-linear EH model. In Proceedings of the 2018 IEEE International Conference on Communications Workshops (ICC Workshops), Kansas City, MO, USA, 20–24 May 2018. [Google Scholar] [CrossRef]
  32. Kumar, D.; Singya, P.K.; Choi, K.; Bhatia, V. SWIPT enabled cooperative cognitive radio sensor network with non-linear power amplifier. IEEE Trans. Cogn. Commun. Netw. 2023, 9, 884–896. [Google Scholar] [CrossRef]
  33. Mohammed, A.A.; Baig, M.W.; Sohail, M.A.; Ullah, S.A.; Jung, H.; Hassan, S.A. Navigating boundaries in quantifying robustness: A DRL expedition for non-linear energy harvesting IoT networks. IEEE Commun. Lett. 2024, 28, 2447–2451. [Google Scholar] [CrossRef]
  34. Ullah, S.A.; Mahmood, A.; Nasir, A.A.; Gidlund, M.; Hassan, S.A. DRL-driven optimization of a wireless powered symbiotic radio with non-linear EH model. IEEE Open J. Commun. Soc. 2024, 5, 5232–5247. [Google Scholar] [CrossRef]
  35. Li, K.; Ni, W.; Dressler, F. LSTM-Characterized Deep Reinforcement Learning for Continuous Flight Control and Resource Allocation in UAV-Assisted Sensor Network. IEEE Internet Things J. 2022, 9, 4179–4189. [Google Scholar] [CrossRef]
  36. He, X.; Mao, Y.; Liu, Y.; Ping, P.; Hong, Y.; Hu, H. Channel assignment and power allocation for throughput improvement with PPO in B5G heterogeneous edge networks. Digit. Commun. Netw. 2024, 10, 109–116. [Google Scholar] [CrossRef]
  37. Ullah, I.; Singh, S.K.; Adhikari, D.; Khan, H.; Jiang, W.; Bai, X. Multi-Agent Reinforcement Learning for task allocation in the Internet of Vehicles: Exploring benefits and paving the future. Swarm Evol. Comput. 2025, 94, 101878. [Google Scholar] [CrossRef]
  38. Alhartomi, M.; Salh, A.; Audah, L.; Alzahrani, S.; Alzahmi, A. Enhancing Sustainable Edge Computing Offloading via Renewable Prediction for Energy Harvesting. IEEE Access 2024, 12, 74011–74023. [Google Scholar] [CrossRef]
  39. Choi, J.; Lee, B.-J.; Zhang, B.-T. Multi-focus Attention Network for Efficient Deep Reinforcement Learning. In Proceedings of the AAAI 2017 Workshop on What’s Next for AI in Games, AAAI 2017, San Francisco, CA, USA, 4–9 February 2017. [Google Scholar] [CrossRef]
  40. Zhou, X.; Zhang, R.; Ho, C.K. Wireless Information and Power Transfer: Architecture Design and Rate-Energy Tradeoff. IEEE Trans. Commun. 2013, 61, 4754–4767. [Google Scholar] [CrossRef]
  41. Yuan, T.; Liu, M.; Feng, Y. Performance Analysis for SWIPT Cooperative DF Communication Systems with Hybrid Receiver and Non-Linear Energy Harvesting Model. Sensors 2020, 20, 2472. [Google Scholar] [CrossRef]
  42. Boyd, S.; Vandenberghe, L. Convex Optimization, 1st ed.; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
  43. Gradshteyn, I.S.; Ryzhik, I.M. Table of Integrals, Series and Products, 6th ed.; Academic Press: New York, NY, USA, 2000. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.