Throughput Maximization in EH Symbiotic Radio System Based on LSTM-Attention-Driven DDPG

Yanjun Zhu; Lin Kang; Jinrong Su; Di Yang

doi:10.3390/electronics14244835

,

and

¹

School of Information and Communication Engineering, Shanxi University of Electronic Science and Technology, Linfen 041000, China

²

School of Electronic Information Engineering, Taiyuan University of Science and Technology, Taiyuan 030024, China

³

Shanxi Key Laboratory of Wireless Communication and Detection, Shanxi University, Taiyuan 030006, China

⁴

China Unicom Software Research Institute, Beijing 100176, China

Electronics2025, 14(24), 4835;https://doi.org/10.3390/electronics14244835

Version Notes

Order Reprints

Review Reports

Abstract

Massive Internet of Things (IoT) deployments face critical spectrum crowding and energy scarcity challenges. Energy harvesting (EH) symbiotic radio (SR), where secondary devices share spectrum and harvest energy from non-orthogonal multiple access (NOMA)-based primary systems, offers a sustainable solution. We consider long-term throughput maximization in an EHSR network with a nonlinear EH model. To solve this non-convex problem, we designed a two-layered optimization algorithm combining convex optimization with a deep reinforcement learning (DRL) framework. The derived optimal power, time allocation factor, and the time-varying environment state are fed into the proposed long short-term memory (LSTM) attention mechanism combined Deep Deterministic Policy Gradient, named the LAMDDPG algorithm to achieve the optimal long-term throughput. Simulation results demonstrate that by equipping the Actor with LSTM to capture temporal state and enhancing the Critic with channel-wise attention mechanism, namely Squeeze-and-Excitation Block, for precise Q-evaluation, the LAMDDPG algorithm achieves a faster convergence rate and optimal long-term throughput compared to the baseline algorithms. Moreover, we find the optimal number of PDs to maintain efficient network performance under NLPM, which is highly significant for guiding practical EHSR applications.

Keywords:

energy-harvesting; symbiotic radio; non-orthogonal multiple access; cognitive radio; nonlinear power model; long short-term memory; attention mechanism

1. Introduction

As wireless communication and sensing technologies have advanced rapidly, the Internet of Things (IoT) is increasingly being integrated into various vertical domains, such as smart transportation and smart cities [1]. A surge in the number of IoT devices connected to networks has led to severe spectrum congestion. Given the limited availability of spectrum resources, providing low-latency and high-reliability services to this massive number of IoT devices poses a significant and growing challenge [2].

Non-orthogonal multiple access (NOMA) technology has appeared as a core technology for accommodating massive connections in IoT networks [3]. By adopting superposition coding at the transmitter and successive interference cancellation (SIC) at the receiver, NOMA permits various devices to share the same spectrum simultaneously [4]. The SIC process decodes signals sequentially, starting with a device with a lower channel gain. Once the signal is decoded, it is subtracted from the received signal to iteratively eliminate co-channel interference and improve the quality of service (QoS) for all connected devices [5]. To enhance spectrum utilization further, cognitive radio (CR) technology has been integrated with NOMA, enabling secondary users (SUs) or unlicensed devices to access the spectrum allocated to primary users (PUs) or licensed devices [6]. Different CR-NOMA frameworks, such as frameworks based on secondary NOMA relays [7], cooperative frameworks under imperfect SIC [8], and fair power allocation-based frameworks [9], have been proposed to enhance spectrum efficiency. In addition, the CR-NOMA framework can improve the throughput of primary devices [10] and the total system [11]. Furthermore, the CR-NOMA architecture can enhance the system’s energy efficiency in an energy-constrained IoT network. For instance, the transmit power allocation scheme based on an improved artificial fish optimizing algorithm [12] and non-cooperative game theory [13] enhances energy efficiency significantly. Despite advancements in enhancing energy efficiency systems, the large-scale deployment of IoT applications is still limited by energy constraints. Battery-powered devices create notable energy bottlenecks, presenting a key challenge for practical implementation [14].

Energy Harvesting (EH) has appeared as a transformative technology for addressing the energy constraints faced by IoT devices. Instead of relying on wind and solar power, EH enables devices to draw energy from a variety of renewable sources, such as Radio Frequency (RF) signals, thereby offering a sustainable approach to powering IoT networks. Establishing an EH-enabled CR-NOMA IoT network where symbiotic secondary devices (SD) share the spectrum of primary devices (PD) and harvest energy from emitted signals of PDs can greatly enhance the spectrum efficiency and sustainability of IoT applications.

Considering the randomness of energy harvesting from a wireless environment complicates the acquisition of precise channel state information (CSI). Deep reinforcement learning (DRL) algorithms offer solutions by providing optimal spectrum and power allocation in such networks. However, the inherent nonlinearity of practical energy harvesting circuits, particularly those characterized by piecewise functions, presents challenges for DRL-based spectrum access and power allocation in achieving long-term throughput enhancement.

In this study, we explored an EH symbiotic radio (EHSR) IoT network in which the symbiotic SD is permitted to access the spectrum assigned to PD by CR and harvest energy from emitted signals of PDs via a nonlinear power model (NLPM). We determine the optimal power and time allocation factor for symbiotic SD using a convex optimization tool. Then, we designed a long short-term memory (LSTM) attention mechanism combined with a Deep Deterministic Policy Gradient (DDPG) algorithm (LAMDDPG) to solve the long-term throughput maximization for SD. Moreover, we found the optimal number of PDs to maintain efficient network performance under NLPM, which is highly significant for guiding practical EHSR applications.

2. Related Works

2.1. EH in CR-NOMA Networks

Simultaneous wireless information and power transfer (SWIPT) has become a key enabler for sustainable IoT. By integrating time-switching (TS) or power-splitting (PS) protocols, SWIPT allows devices to decode information and harvest energy simultaneously [15]. The integration of SWIPT with CR and NOMA technologies can further enhance system performance. For example, Liu et al. proposed a cooperative SWIPT-NOMA protocol and improved system throughput [16]. Through convex optimization and a DDPG-based opportunistic NOMA access method, the spectral efficiency of EH-enabled secondary IoT devices has been enhanced [17]. By applying the Lagrangian dual decomposition method, the outage performance under Nakagami-m fading channel was improved in a cognitive network with SWIPT relays [18]. A bisection-based framework was proposed to maximize throughput for SUs by jointly optimizing power allocation and sensing time in SWIPT-enabled CR-NOMA networks [19]. Furthermore, combining Lagrangian dual decomposition with the bisection method, system throughput was enhanced in SWIPT-enabled cognitive sensor networks [20]. In addition, in other CR networks, SDs obtain energy supplementation by harvesting RF signals from PDs. A k-out-of-M fusion spectrum access strategy was developed to maximize the achievable throughput in an EH SR network with a mobile SD [21]. Wang et al. improved the average throughput in EH-enabled CR networks via sub-gradient descent and single-linear optimization methods. The authors consider both energy and transmission power constraints, making it more applicable to real-world environments [22]. To better harness EHSR for IoT networks, it is important to design efficient resource allocation algorithms to provide optimal spectrum access and power.

2.2. Deep Reinforcement Learning for Resource Allocation

Although the aforementioned algorithms can effectively enhance system throughput in EHSR IoT networks, accurate CSI is essential. However, the randomness of energy harvesting from a wireless environment complicates the acquisition of a precise CSI. Moreover, maximizing throughput faces significant challenges due to the tight coupling between data transmission of symbiotic secondary devices and their instantaneous energy availability, compounded by the stochastic nature of energy conversion that profoundly affects transmission rates. DRL algorithms can provide optimal resource allocation decisions for devices in such a network by leveraging the interactions between agents and the environment, thereby enhancing overall system throughput [23]. Umeonwuka et al. proposed a DDPG framework to maximize the long-term throughput by dynamically adjusting continuous power allocation in EH-enabled cognitive IoT [24]. In addition, a Proximal Policy Optimization (PPO)-based algorithm was designed to enhance the throughput of SUs by jointly tuning EH time and power control in EH-enabled CR networks. The PPO framework, with its clip mechanism, offers higher training stability than DDPG [25]. To effectively explore the continuous action space in DRL algorithms, an actor-critic architecture was proposed to enhance throughput in EH-enabled NOMA networks [26]. Furthermore, different DRL architectures can enhance the throughput of EH-CR-NOMA networks. For example, Ding et al. designed a DDPG-based method for EH-SUs to maximize long-term throughput. In addition, optimal pairing of SUs and PUs to form NOMA pairs was achieved using a DDPG-based algorithm [27]. A simplified practical action adjuster was also introduced into the DDPG framework to reduce packet loss for SUs in such IoT networks [28]. Ullah et al. designed an energy-efficient transmission strategy by integrating combined experience replay with DDPG (CER-DDPG) [29]. A PPO-based algorithm with a simplified action space was proposed to maximize throughput in EH-CR-NOMA networks [30].

However, owing to the nonlinear features of EH circuit components, some researchers have explored NLPMs to enhance the throughput in EH-enabled CR-NOMA networks [31]. For instance, a SWIPT-CR sensor network framework with NLPM was designed, and its throughput performance was analyzed [32]. In EH-CR-NOMA networks with NLPMs, the throughput of secondary devices is maximized using a prioritized experience replay (PER)-DDPG framework. Compared with DDPG, CER-DDPG, and PPO, PER-DDPG converges faster by adopting prioritized sampling from the replay buffer [33]. Additionally, the uplink transmission of EH devices with NLPMs was optimized via a two-layer optimization method. The authors utilized a convex optimization method and modified the DDPG algorithm for this purpose [34]. Furthermore, Li et al. designed an LSTM-driven actor-critic framework to solve the service offloading problem with a faster convergence rate than DDPG in an IoT network [35]. In another heterogeneous edge network, a distributed multi-agent PPO resource allocation algorithm was proposed to maximize the sum rate [36]. When offloading tasks occurred in fast-varying vehicular networks, various multi-agent reinforcement learning frameworks were leveraged for efficient resource allocation [37].

2.3. Motivations

However, the EHSR faces a trade-off. It has the option of transmitting data to boost throughput, which may face interference with primary devices. Alternatively, it can opt for energy harvesting to ensure a sufficient energy supply, which may reduce the long-term throughput. Therefore, it is essential to develop effective plans for EHSR to enhance long-term throughput. The plans can serve as a decision-making process to solve it via the single-agent DRL algorithm. Challenges arise when aiming for long-term throughput. First, the dynamic nature of the channel, combined with the randomness of radio frequency energy harvesting and the use of nonlinear models, can rapidly increase the dimension of the state space. This increases the complexity of the DRL algorithm. In addition to an actor-critic structure, the critic network evaluates the actions taken by the agent to aid in the determination of the most advantageous strategies. In such a dynamic environment, high-dimensional inputs, such as complex states and actions, reduce the accuracy and efficiency of the evaluation, which leads to a decrease in algorithm performance. Therefore, accurately predicting the time-varying channel conditions and energy collection state is crucial for reducing the complexity of the state space. Simultaneously, focusing on key information to provide precise Q-values is of great importance. The LSTM layer or network can be used to forecast the channel state [38]. In this context, the LSTM layer can be imported to forecast the channel and energy collection states of nodes with NLPMs, thereby reducing the state dimension in the state space. Meanwhile, the attention mechanism enables the network to concentrate on specific features to improve the accuracy and efficiency of evaluation [39]. The attention mechanism is imported to the critic network.

Considering the value differences between the jointly optimized variables and the piecewise nature of the NLPM, an intermediate variable named energy surplus is introduced to simplify the optimization problem, making it easier to solve via the DRL method. First, we determine the optimal power allocation and time allocation factor using a convex optimization tool. Then, we designed an LSTM attention mechanism combined DDPG algorithm (LAMDDPG) to solve the long-term throughput maximization problem.

The main contributions are listed as follows:

We established a long-term throughput maximization problem for EHSR in a CR-NOMA-enabled IoT network with NLPM. The optimization problem can be transformed into a layer optimization problem. The expressions of power and time allocation factor for the secondary device are derived, and then these optimal parameters are fed into the LAMDDPG framework to optimize long-term throughput for the secondary device.

We imported LSTM layers into the actor networks of DDPG to predict the channel state and energy harvested. Additionally, an attention mechanism is employed in critic networks to enhance evaluation performance.

We also propose an LSTM combined DDPG (LMDDPG) algorithm by incorporating LSTM layers into the actor networks of DDPG. The simulation outcomes prove the superiority of the proposed framework compared to DDPG, LMDDPG, and random and greedy algorithms. Moreover, we find the optimal number of PDs to maintain efficient network performance under NLPM.

3. System Model

This study investigated an uplink communication scenario in an energy-constrained CR-NOMA IoT network. As illustrated in Figure 1, the network comprises a base station (BS), M

(1 \leq m \leq M)

primary devices (PDs) and an EHSR. PDs are ensured to transmit data in distinct time slots following a time-division multiple access (TDMA) scheduling protocol. Specifically, the

m th

PD is assigned the

m th

time slot to guarantee its data transmission priority. As depicted in Figure 2, the EHSR, denoted as

S e_{0}

, achieves non-orthogonal multiplexing by sharing one of the PDs’ time slots to send information to the BS and simultaneously harvesting energy during this process based on the underlay mode [40]. The BS utilizes SIC technology, aided by known CSI that includes both large-scale and multipath fading. The BS first decodes the PD’s signal and subtracts it from the received mixed signal to decode the signal for

S e_{0}

. During any given time, slot

t_{k}

,

P D_{k}

commences data transmission, where

k = (k - 1) \oplus M + 1

.

h_{k}

represents the channel gain between the BS and

P D_{k}

.

h_{k, 0}

signifies the channel gain between

P D_{k}

and

S e_{0}

.

h_{0, k}

represents the channel gain between

S e_{0}

and BS. These channel gains encompass both effects, and the channel-state information is assumed to be known.

S e_{0}

uses

β_{k} T

seconds for data transmission, where

β_{k}, 0 \leq β_{k} \leq 1

is the time allocation factor,

T

is the duration for every time slot. The remaining

(1 - β_{k}) T

seconds are utilized for energy harvesting. We summarize the main notations of this paper in Table 1.

Figure 1. EH symbiotic radio system.

Figure 2. Data transmission mode.

Table 1. Summary of main notations.

In this paper, we employ a piece-wise linear function-based NLPM [41] to calculate the harvested energy during

(1 - β_{k}) T

.

E_{(1 - β_{k}) T} = \{\begin{cases} (1 - β_{k}) T η P_{k} {|h_{k, 0}|}^{2}, \\ P_{t h} (1 - β_{k}) T, \end{cases} \begin{matrix} η P_{k} {|h_{k, 0}|}^{2} \leq P_{t h} \\ η P_{k} {|h_{k, 0}|}^{2} > P_{t h} \end{matrix}

(1)

where

P_{k}

is the transmit power of

P D_{k}

,

P_{t h}

denotes the energy harvesting threshold of the EH circuit, and

η

is the energy harvesting efficiency.

Let

E_{k}

denote the residual energy in the battery of

S e_{0}

at the beginning of the time slot

t_{k}

, after energy harvesting, the energy of

S e_{0}

in the subsequent time slot

t_{k + 1}

can be formulated as:

E_{k + 1} = \min \{E_{\max}, E_{(1 - β_{k}) T} - β_{k} T P_{0, k} + E_{k}\}

(2)

Here,

E_{\max}

is the maximum battery capacity of

S e_{0}

,

β_{k} T P_{0, k}

denotes the energy consumed by

S e_{0}

for data transmission in time slot

t_{k}

.

P_{0, k}

is the transmit power of

S e_{0}

, with a maximum value of

E_{k}

. Based on Formula (2),

S e_{0}

’s transmit power is dynamically constrained by its harvested energy and

E_{\max}

. This energy constraint inherently limits

S e_{0}

’s interference to the co-channel PD, which will guarantee the quality of service (QoS) for PD.

In time slot

t_{k}

, the achievable data rate of

S e_{0}

is represented as:

R_{k} = β_{k} \ln (\frac{P_{0, k} {|h_{0, k}|}^{2}}{1 + P_{k} {|h_{k}|}^{2}} + 1)

(3)

The BS first decodes the signal of

S e_{0}

, then removes the signal of

S e_{0}

to decode the signal of

P D_{k}

using SIC. Our primary objective is to maximize the long-term throughput of

S e_{0}

. Essentially, this involves a decision-making process for

S e_{0}

to determine which PD to access at each time slot

t_{k}

as well as to optimize its own power and time allocation coefficient to achieve the highest possible throughput over an extended period. The problem can be mathematically formulated as:

\begin{array}{l} P 1 : & \max_{P_{0, k}, β_{k}} E [\sum_{k = 1}^{\infty} γ^{k - 1} R_{k}] \\ s . t . & E_{k + 1} = \min \{E_{\max}, E_{(1 - β_{k}) T} - β_{k} T P_{0, k} + E_{k}\} (P 1 a) \\ β_{k} T P_{0, k} \leq E_{k} (P 1 b) \\ 0 \leq β_{k} \leq 1 (P 1 c) \\ 0 \leq P_{0, k} \leq P_{\max} (P 1 d) \end{array}

(4)

where

P_{\max}

denotes the maximum transmit power of

S e_{0}

.

γ, 0 \leq γ \leq 1

is the discount rate, which serves as a tradeoff between short-term and long-term rewards. Constraint (P1b) ensures that the energy used for data transmission by

S e_{0}

does not exceed its remaining energy at time slot

t_{k}

. The parameter

β_{k}

and

P_{0, k}

are coupled and have distinct value ranges. This coupling makes it challenging to directly use them as input actions for DRL, as it can result in unstable training.

4. Problem Formulation and Decomposition

4.1. Problem Transformation

Optimization problem P1 is solved via a layered optimization approach. In this section, we design a logic flow diagram to analyze this process (Figure 3).

Figure 3. Logic flow diagram for the two-layered optimization process.

We introduce a variable

E_{k}^{sur}

to represent the energy surplus. If

E_{k}^{sur} \geq 0

it indicates that the harvested energy can cover the energy consumed during data transmission. This condition helps extend the operational lifespan of the network, which is a key objective of the system.

E_{k}^{sur} = E_{(1 - β_{k}) T} - β_{k} T P_{0, k}

(5)

β_{k}

and

P_{0, k}

can be denoted by

E_{k}^{sur}

. Then problem P1 can be transformed into problem P2.

\begin{array}{l} P 2 : \max_{π_{E_{k}^{sur}}} E [\sum_{k = 1}^{\infty} γ^{k - 1} R_{k}] \\ s . t . E_{k + 1} = \min \{E_{\max}, E_{k}^{sur} + E_{k}\} \end{array}

(6)

In essence, maximizing the long-term throughput of

S e_{0}

involves optimizing its

E_{k}^{sur}

and

β_{k}

. When

E_{k}^{sur} = 0

, we can obtain:

\tilde{R_{k}} = \sup \{R_{k}| E_{(1 - β_{k}) T} - β_{k} T P_{0, k} = E_{k}^{sur}, (P 1 b), (P 1 c), (P 1 d)\}

(7)

To maximize the long-term throughput, we can focus on maximizing

R_{k}

at each time slot. Specifically, the throughput at each time slot can be optimized by adjusting

β_{k}

and

P_{0, k}

, as follows:

\begin{array}{l} P 3 : & \max_{P_{0, k}, β_{k}} R_{k} \\ s . t . & E_{(1 - β_{k}) T} - β_{k} T P_{0, k} = E_{k}^{sur} (P 3 a) \\ (P 1 b), (P 1 c), (P 1 d) \end{array}

(8)

When

E_{k}^{sur}

is given, we can determine the optimal values of

β_{k}

and

P_{0, k}

as

β_{k}^{*} (E_{k}^{sur})

and

(E_{k}^{sur})

, respectively. By substituting these optimal values into P2, we can transform P2 in to P4:

\begin{array}{l} P 4 : \max_{π_{E_{k}^{sur}}} E [\sum_{k = 1}^{\infty} γ^{k - 1} R_{k} (β_{k}^{*} (E_{k}^{sur}), P_{0, k}^{*} (E_{k}^{sur}))] \\ s . t . E_{k + 1} = \min \{E_{\max}, E_{k}^{sur} + E_{k}\} \end{array}

(9)

Thus, problem P1 is decomposed into sub-problems P3 and P4, both of which are closely related to

E_{k}^{sur}

. For a given

E_{k}^{sur}

, we first solve P3 using convex optimization techniques [42]. Subsequently, P4 can be transformed into a problem solved by the LAMDDPG method.

4.2. Solving Problem P3

Because

E_{(1 - β_{k}) T} - β_{k} T P_{0, k} = E_{k}^{sur}

is not affine, and constraint (P1b) involves the optimization variable

β_{k}

, P3 is a non-convex problem. As indicated in [41], P3 can be reformulated as follows:

\begin{array}{l} P 5 : \max_{β_{k}} \tilde{f_{0}} (β_{k}) \\ s . t . 0 \leq β_{k} \leq 1 \end{array}

(10)

where

\tilde{f_{0}} (β_{k}) = \sup \{R_{k}| E_{(1 - β_{k}) T} - β_{k} T P_{0, k} - E_{k}^{sur} = 0, (P 1 b), (P 1 d)\}

.

Following the formula

E_{(1 - β_{k}) T} - β_{k} T P_{0, k} - E_{k}^{sur} = 0

, we can derive an expression for

P_{0, k}

as following:

P_{0, k} = \{\begin{cases} \frac{(1 - β_{k}) T η P_{k} {|h_{k, 0}|}^{2} - E_{k}^{sur}}{β_{k} T}, \\ \frac{P_{t h} (1 - β_{k}) T - E_{k}^{sur}}{β_{k} T}, \end{cases} \begin{matrix} η P_{k} {|h_{k, 0}|}^{2} \leq P_{t h} \\ η P_{k} {|h_{k, 0}|}^{2} > P_{t h} \end{matrix}

(11)

According to constraint (P1b), we can obtain:

P_{0, k} \leq \frac{E_{k}}{β_{k} T}

(12)

Thus finding

\tilde{f_{0}} (β_{k})

is equivalent to solving the optimization problem:

\begin{array}{l} P 6 : & \max_{P_{0, k}} R_{k} \\ s . t . & P_{0, k} = \frac{E_{(1 - β_{k}) T} - E_{k}^{sur}}{β_{k} T} (P 6 a) \\ P_{0, k} \leq \frac{E_{k}}{β_{k} T} (P 6 b) \\ 0 \leq P_{0, k} \leq P_{\max} (P 6 c) \end{array}

(13)

Problem P6 is a function of

P_{0, k}

, where

β_{k}

is fixed. Given the constraints (P6a),

\tilde{f_{0}^{*}} (β_{k})

can be expressed as:

\tilde{f_{0}^{*}} (β_{k}) = β_{k} \ln (\frac{\frac{E_{(1 - β_{k}) T} - E_{k}^{sur}}{β_{k} T} {|h_{0, k}|}^{2}}{1 + P_{k} {|h_{k}|}^{2}} + 1)

(14)

The constraints (P6b) and (P6c) can be satisfied by specifying the domain of

\tilde{f_{0}^{*}} (β_{k})

as:

D = \{β_{k}| 0 \leq \frac{E_{(1 - β_{k}) T} - E_{k}^{sur}}{β_{k} T} \leq \min \{P_{\max}, \frac{E_{k}}{β_{k} T}\}\}

(15)

By Formulas (8) and (9), when

η P_{k} {|h_{k, 0}|}^{2} \leq P_{t h}

, P6 can be denoted as:

\begin{array}{l} P 7 : & \max_{β_{k}} β_{k} \ln (\frac{\frac{E_{(1 - β_{k}) T} - E_{k}^{sur}}{β_{k} T} {|h_{0, k}|}^{2}}{1 + P_{k} {|h_{k}|}^{2}} + 1) \\ s . t . & β_{k} \leq 1 - \frac{E_{k}^{sur}}{T η P_{k} {|h_{k, 0}|}^{2}} (P 7 a) \\ β_{k} \geq 1 - \frac{E_{k} + E_{k}^{sur}}{T η P_{k} {|h_{k, 0}|}^{2}} (P 7 b) \\ β_{k} \geq \frac{T η P_{k} {|h_{k, 0}|}^{2} - E_{k}^{sur}}{T η P_{k} {|h_{k, 0}|}^{2} + T P_{\max}} (P 7 c) \\ 0 \leq β_{k} \leq 1 (P 7 d) \end{array}

(16)

When

η P_{k} {|h_{k, 0}|}^{2} > P_{t h}

, P6 can be denoted as:

\begin{array}{l} P 8 : & \max_{β_{k}} β_{k} \ln (\frac{\frac{E_{(1 - β_{k}) T} - E_{k}^{sur}}{β_{k} T} {|h_{0, k}|}^{2}}{1 + P_{k} {|h_{k}|}^{2}} + 1) \\ s . t . & β_{k} \leq 1 - \frac{E_{k}^{sur}}{P_{t h} T} (P 8 a) \\ β_{k} \geq 1 - \frac{E_{k} + E_{k}^{sur}}{P_{t h} T} (P 8 b) \\ β_{k} \geq \frac{P_{t h} T - E_{k}^{sur}}{P_{t h} T + P_{\max} T} (P 8 c) \\ 0 \leq β_{k} \leq 1 (P 8 d) \end{array}

(17)

Both P7 and P8 involve three lower bounds and two upper bounds related to

β_{k}

. Moreover, the objective functions in P7 and P8 impose constraints on the selection of

β_{k}

. Therefore, it is crucial to demonstrate the feasibility of the optimization problems P7 and P8. The proof of the feasibility of problems P7 and P8 is provided in Appendix A. In addition, we prove that P7 and P8 are concave functions of

β_{k}, β_{k} \geq 0,

in Appendix B.

The substantial number of inequality constraints in problems P7 and P8 complicated the process of obtaining an optimal solution for

β_{k}^{*} (E_{k}^{sur})

. Moreover, the first-order derivative of the objective function takes the form of the Lambert W function with two branches [27]. Following the steps outlined in [27], if

E_{k}^{sur}

satisfies:

E_{k}^{sur} \neq \{\begin{cases} T η P_{k} {|h_{k, 0}|}^{2}, \\ P_{t h} T, \end{cases} \begin{matrix} η P_{k} {|h_{k, 0}|}^{2} \leq P_{t h} \\ η P_{k} {|h_{k, 0}|}^{2} > P_{t h} \end{matrix}

(18)

otherwise

β_{k}^{*} (E_{k}^{sur}) = 0

, we can achieve a closed form for

β_{k}^{*} (E_{k}^{sur})

, as follows:

β_{k}^{*} (E_{k}^{sur}) = \{\begin{cases} \min \{1, \max \{\frac{λ_{1} - λ_{2}}{e^{W_{0} (e^{- 1} (λ_{1} - 1)) + 1} - 1 + λ_{1}}, ζ_{0}\}\}, \\ \min \{1, \max \{\frac{λ_{3} - λ_{2}}{e^{W_{0} (e^{- 1} (λ_{3} - 1)) + 1} - 1 + λ_{3}}, ζ_{1}\}\}, \end{cases} \begin{matrix} η P_{k} {|h_{k, 0}|}^{2} \leq P_{t h} \\ η P_{k} {|h_{k, 0}|}^{2} > P_{t h} \end{matrix}

(19)

where

W_{0} (\cdot)

denotes the principal branch of the Lambert W function [43].

λ_{1} = \frac{η P_{k} {|h_{k, 0}|}^{2}}{(1 + P_{k} {|h_{k}|}^{2})} {|h_{0, k}|}^{2}

,

λ_{2} = \frac{E_{k}^{sur} {|h_{0, k}|}^{2}}{T (1 + P_{k} {|h_{k}|}^{2})}

,

λ_{3} = \frac{P_{t h}}{(1 + P_{k} {|h_{k}|}^{2})} {|h_{0, k}|}^{2}

,

ζ_{0} = \max \{1 - \frac{E_{k} + E_{k}^{sur}}{T η P_{k} {|h_{k, 0}|}^{2}}, \frac{T η P_{k} {|h_{k, 0}|}^{2} - E_{k}^{sur}}{T η P_{k} {|h_{k, 0}|}^{2} + T P_{\max}}\}

,

ζ_{1} = \max \{1 - \frac{E_{k} + E_{k}^{sur}}{P_{t h} T}, \frac{P_{t h} T - E_{k}^{sur}}{P_{t h} T + P_{\max} T}\}

.

Based on

β_{k}^{*} (E_{k}^{sur})

, the optimal power allocation coefficient is achieved by:

P_{0, k}^{*} (E_{k}^{sur}) = \{\begin{cases} \frac{(1 - β_{k}^{*}) η P_{k} {|h_{k, 0}|}^{2}}{β_{k}^{*}} - \frac{E_{k}^{sur}}{β_{k}^{*} T}, \\ \frac{(1 - β_{k}^{*}) P_{t h}}{β_{k}^{*}} - \frac{E_{k}^{sur}}{β_{k}^{*} T}, \end{cases} \begin{matrix} η P_{k} {|h_{k, 0}|}^{2} \leq P_{t h} \\ η P_{k} {|h_{k, 0}|}^{2} > P_{t h} \end{matrix}

(20)

If

E_{k}^{sur}

satisfies:

E_{k}^{sur} = \{\begin{cases} T η P_{k} {|h_{k, 0}|}^{2}, \\ P_{t h} T, \end{cases} \begin{matrix} η P_{k} {|h_{k, 0}|}^{2} \leq P_{t h} \\ η P_{k} {|h_{k, 0}|}^{2} > P_{t h} \end{matrix}

(21)

S e_{0}

harvests energy at the whole time slot, there is no data transmission, and there is no need to specify the value for

P_{0, k}^{*} (E_{k}^{sur})

.

4.3. Solving Problem P4

With

β_{k}^{*} (E_{k}^{sur})

and

P_{0, k}^{*} (E_{k}^{sur})

obtained, the long-term throughput maximizing problem is reformulated as:

\begin{array}{l} P 9 : \max_{π_{E_{k}^{sur}}} E [\sum_{k = 1}^{\infty} γ^{k - 1} β_{k}^{*} (E_{k}^{sur}) \ln (\frac{P_{0, k}^{*} (E_{k}^{sur}) {|h_{0, k}|}^{2}}{1 + P_{k} {|h_{k}|}^{2}} + 1)] \\ s . t . E_{k + 1} = \min \{E_{\max}, E_{k}^{sur} + E_{k}\} (P 9 a) \end{array}

(22)

This problem is a function of

E_{k}^{sur}

, which can be optimized through the LAMDDPG method.

5. Proposed LAMDDPG Algorithm

5.1. Framework of LAMDDPG Algorithm

The optimal long-term throughput decision for P9 is served as a Markov Decision Process (MDP) and solved by the LAMDDPG framework. Here,

S e_{0}

acts as an agent. The LAMDDPG framework is employed to identify a sequence of optimal decisions that maximizes the long-term expected cumulative discounted reward.

As shown in Figure 4, the LAMDDPG framework contains an Actor network

ω^{μ}

and Actor target network

ω^{μ^{'}}

, Critic network

ω^{Q}

and Critic target network

ω^{Q^{'}}

, an experience replay memory. The LSTM layers are integrated after the input layer for Actor network and the Actor target network. The attention layers are adopted after hidden layers for the Critic network and Critic target network.

Figure 4. The framework of LAMDDPG. Black solid lines denote the agent, environment, network components, black dashed lines illustrate the network structure of the proposed LAMDDPG algorithm, blue solid lines depict the training phase, blue dashed lines illustrate the execution phase of the algorithm.

The current state

s

from the environment are input to the Actor and Actor target network, considering the temporal correlation of residual energy and channel state information in the current scenario, LSTM layers are introduced to capture dynamic dependencies. The LSTM layers learn from past observations to adjust the weights and bias to predict the real-time state. The output data is input into the hidden layer of the Actor network

ω^{μ}

and Actor target network

ω^{μ^{'}}

, respectively. The action

a (s) = μ (s | ω^{μ})

and target action

μ^{'} (s | ω^{μ^{'}})

are output. To enhance the exploration capability of the Actor network, noise is introduced in the output of the Actor network. Thus, the action to be chosen is

a (s) = μ (s | ω^{μ}) + n

, where

n

represents the noise following a Gaussian distribution.

Moreover, attention mechanisms are integrated into the Critic network and Critic target network.

F

from the hidden layer are input into the attention module. Through squeezing and excitation, features of different channels are assigned distinct channel weights. By multiplying

F

with the channel weights,

F

can be adaptively enhanced, the Critic network can dynamically focus on more important features most relevant to the current agent’s decision-making, thereby enhancing the accuracy of decision-making. The Critic network

ω^{Q}

outputs the state-action value function

Q (s, a| ω^{Q})

under the current state and action by the Actor network. The Critic target network

ω^{Q^{'}}

outputs

Q^{'} (s, a| ω^{Q^{'}})

. Optimal actions generated by these networks are applied to the environment to obtain a reward

r

, and then the environment transitions to a new state

s^{'}

. The experience

(s, a, r, s^{'})

is then stored in the replay memory.

The state space, action space, and reward function are defined as follows

State space: In this CR-NOMA-enabled EHSR network, the state space is defined as the channel state information associated with the PD and

S e_{0}

, the residual energy of

S e_{0}

. The system state at time slot

t_{k}

can be denoted as:

s_{k} = (h_{0, k}, h_{k}, h_{k, 0}, E_{k})

(23)

Action space: The agent selects transmitting data or energy harvesting according to the current system state

a_{k} = E_{k}^{sur}

(24)

Considering the extreme situation that

S e_{0}

only transmits data for the whole time slot

t_{k}

, the lower bound is:

E_{k}^{sur} \geq - \min \{E_{k}, T P_{\max}\}

(25)

On the other hand, if

S e_{0}

only harvests energy for the whole time slot

t_{k}

, the upper bound is:

E_{k}^{sur} \leq \{\begin{cases} \min \{E_{\max} - E_{k}, T η P_{k} {|h_{k, 0}|}^{2}\}, \\ \min \{E_{\max} - E_{k}, P_{t h} T\}, \end{cases} \begin{matrix} η P_{k} {|h_{k, 0}|}^{2} \leq P_{t h} \\ η P_{k} {|h_{k, 0}|}^{2} > P_{t h} \end{matrix}

(26)

Thus, the value for action is very large, which brings instability to the network. The value of

E_{k}^{sur}

can be normalized as

E_{k}^{sur} = - (1 - ε_{k}) \min \{E_{k}, T P_{\max}\} + ε_{k} \{\begin{cases} \min \{E_{\max} - E_{k}, T η P_{k} {|h_{k, 0}|}^{2}\}, \\ \min \{E_{\max} - E_{k}, P_{t h} T\}, \end{cases} \begin{matrix} η P_{k} {|h_{k, 0}|}^{2} \leq P_{t h} \\ η P_{k} {|h_{k, 0}|}^{2} > P_{t h} \end{matrix}

(27)

where

ε_{k} \in [0, 1]

, hence, the suitable action parameter can be constrained to improve the stability of the networks.

Reward function: When the agent selects an action at any time slot

t_{k}

, it will receive a corresponding reward, set as

r_{k} = R_{k}

(28)

where

R_{k}

is the achievable data rate at

t_{k}

.

The LAMDDPG algorithm employs centralized training with distributed execution. During the training phase, a batch of experience

B = \{(s_{k}, a_{k}, r_{k}, s_{k + 1}), k = 1, \dots, N\}

is selected from the replay memory for training,

N

is the batch size.

The predicted action

a_{k + 1} = μ (s_{k + 1})

is fed into the Critic target network. Based on these inputs, the Critic target network computes the target value

y_{k}

. The target value for the state function is calculated by:

y_{k} = r_{k} + γ Q^{'} (s_{k + 1}, a_{k + 1} | ω^{Q^{'}})

(29)

s_{k}

and

a_{k} = μ (s_{k})

are fed into the Critic network. Based on these inputs, the Critic network calculates the corresponding Q value, denoted as

Q (s_{k}, a_{k}| ω^{Q})

. The parameters of the Critic network are updated using gradient descent. The loss function for the Critic network is defined as the difference between the target value

y_{k}

and the predicted

Q (s_{k}, a_{k}| ω^{Q})

, which is essentially the error term of the Bellman equation. Therefore, the parameters of the Critic network can be updated according to the following formula:

L oss (ω^{Q}) = E {[(Q (s_{k}, a_{k} | ω^{Q}) - y_{k}))}^{2}]

(30)

When updating the parameters of the Actor network, gradient ascent is employed. The parameters can be updated according to the following formula:

\begin{array}{l} \nabla_{ω^{μ}} J \approx E [\nabla_{ω^{μ}} Q (s, a | ω^{Q}) |_{s = s_{k}, a = μ (s_{k} | ω^{μ})}] \\ = E [\nabla_{a} Q (s, a | ω^{Q}) |_{s = s_{k}, a = μ (s_{k})} \nabla_{ω^{μ}} μ (s | ω^{μ}) | s = s_{k}] \end{array}

(31)

The parameters of both target networks are updated using a soft update method, described as follows:

ω^{μ^{'}} \leftarrow τ ω^{μ} + (1 - τ) ω^{μ^{'}}

(32)

ω^{Q^{'}} \leftarrow τ ω^{Q} + (1 - τ) ω^{Q^{'}}

(33)

where

τ

is the update coefficient.

5.2. Neural Network Architectures and Training Parameters

In this subsection, we present the neural network architecture of the proposed LAMDDPG algorithm. In the LAMDDPG algorithm, both the Actor network and Actor target network contain an input layer, an LSTM layer, two hidden layers, and an output layer. Both the Critic network and Critic target network consist of an input layer, two hidden layers, an attention module (based on the Squeeze-and-Excitation module), and an output layer. Both the Actor and Critic networks adopt the Adam optimizer.

The input layer of the Actor (target) network takes the state

s

as input, performing input reshaping to adapt to the input format required by the LSTM layer. The LSTM layer contains 64 units and outputs a vector with a dimension of (32, 64), which is fed into the first hidden layer. After ReLU activation, the first hidden layer outputs a vector of (32, 64), which is then passed to the second hidden layer. Following tanh activation, the second hidden layer outputs a (32, 64) vector that is fed into the output layer. After tanh activation, the output action is scaled to the range between the maximum and minimum magnitudes of the action (target action).

The Critic (target) network takes the state

s

(

s^{'}

) and action

a

(

a^{'}

) as inputs. These inputs undergo feature fusion in the first hidden layer, and after ReLU activation, a vector with a dimension of (32, 64) is output. This vector is fed into the second hidden layer, which outputs a feature vector of dimension (32, 64) following ReLU activation. Subsequently, the vector enters the attention module: first, average pooling is performed on the feature dimension to complete the squeezing operation; then, it passes through two fully connected layers, activated by ReLU and Sigmoid, respectively, to generate channel-wise attention weights. The vectors are then multiplied by the attention weights to obtain an output vector of (32, 64). Finally, the action value function

Q

(

Q^{'}

) is output after passing through the output layer.

The LAMDDPG algorithm is presented in the LAMDDPG algorithm (Algorithm 1).

Algorithm 1 LAMDDPG algorithm
Input: Environment, settings of $S e_{0}$ and PDs Output: parameters $ω^{μ}, ω^{μ^{'}}, ω^{Q}, ω^{Q^{'}}$ Initialize system parameters, initialize Actor network, and Critic network
Initialize target network weights parameters
Initialize experience replay memory
1	For episode = 1 to n_ep, do:
2	Initialization of noise n, initialization of large-scale fading, and small-scale random fading
3	Obtain the initial state s₁
4	For k = 1 to T, do:
5	Select $a_{k} = μ (s_{k} \| ω^{μ}) + n$
6	Execute action a_k, receive reward r_k, and environment state s_k+1, and store the array $(s_{k}, a_{k}, r_{k}, s_{k + 1})$ in the experience replay memory
7	Randomly sample a batch of experience $B = \{(s_{k}, a_{k}, r_{k}, s_{k + 1}), k = 1, \dots, N\}$ from the replay memory
8	Set $y_{k} = r_{k} + γ Q^{'} (s_{k + 1}, a_{k + 1} \| ω^{Q^{'}})$
9	Minimize the loss function to update the Critic Q network
	$L oss (ω^{Q}) = E {[(Q (s_{k}, a_{k} \| ω^{Q}) - y_{k}))}^{2}]$
10	Sample strategy gradient update Actor policy network
	$\nabla_{ω^{μ}} J (ω^{μ}) = E [\nabla_{a} Q (s, a \| ω^{Q}) \|_{s = s_{k}, a = μ (s_{k})} \nabla_{ω^{μ}} μ (s \| ω^{μ}) \| s = s_{k}]$
11	Update target network
	$ω^{μ^{'}} \leftarrow τ ω^{μ} + (1 - τ) ω^{μ^{'}}, ω^{Q^{'}} \leftarrow τ ω^{Q} + (1 - τ) ω^{Q^{'}}$
12	End for
13	End for

6. Simulation Results

In this section, we prove the effectiveness of the proposed algorithm. The path loss model from [13] is adopted, with the following parameter settings in Table 2 [27].

Table 2. Parameter settings.

BS is located at the origin of the x-y plane.

S e_{0}

is deployed at (1 m, 1 m). To compare the performance of the LAMDDPG algorithm, the following baseline algorithms are introduced:

(1): DDPG algorithm: using the baseline algorithm proposed in [27].
(2): Greedy algorithm: $S e_{0}$ consumes all battery energy during the data transmission and then begins energy harvesting [27].
(3): Random algorithm: $S e_{0}$ transmit data using $P_{\max}$ as the transmit power, $β_{k}$ is randomly generated between 0 and $\min \{1, \frac{E_{k}}{T P_{\max}}\}$ .
(4): LMDDPG algorithm: To demonstrate the effect of LSTM and attention mechanisms, we design the LMDDPG algorithm by incorporating LSTM layers into the Actor networks of DDPG.

The channels remain unchanged in each experiment, consisting of multiple episodes. Independent and identically distributed (i.i.d.) complex Gaussian random variables with a mean of zero and a unit variance are employed to simulate small-scale fading.

6.1. Rewards Analysis Under Different Algorithms

We first examine the data rate performance of different algorithms when M = 2, corresponding to two PDs positioned at (0 m, 1 m) and (0 m, 1000 m). As illustrated in Figure 5, after independent and repeated experiments, the data rates achieved by the DDPG, LMDDPG, and LAMDDPG algorithms significantly surpass those of the Greedy and Random algorithms. This superiority stems from the fact that

S e_{0}

can choose optimal actions under the guidance of DRL algorithms effectively. The incorporation of LSTM layers in the Actor networks enables LMDDPG to converge faster than DDPG. Furthermore, the integration of attention mechanisms into the Critic networks allows the LAMDDPG algorithm to converge at the 50th episode while the data rate increases by 19% over LMDDPG. The combination of LSTM and attention mechanisms improves the data rate by 25% over DDPG.

Figure 5. Rewards under different algorithms when M = 2.

Next, we investigate the data rate performance of various algorithms under different numbers of PDs. Figure 6a,b shows scenarios with 5 PDs and 10 PDs in the network, respectively. These PDs are positioned between (0 m, 1 m) and (0 m, 1000 m). In these complex and dynamic environments with more PDs, both the Random and Greedy algorithms achieve very low data rates. As depicted in Figure 6a,b, after independent and repeated experiments, the data rates achieved by DDPG, LMDDPG, and LAMDDPG algorithms obviously surpass those of the Greedy and Random algorithms. When the number of PDs is 5, the maximum data rate achieved by LAMDDPG is nearly 8 bps. As the number of PDs increases to 10, the maximum data rate achieved by LAMDDPG rises to nearly 10 bps. The LAMDDPG algorithm converges faster than LMDDPG and DDPG in Figure 6a,b. However, compared with Figure 6a, LAMDDPG converges more slowly in Figure 6b, due to the increase in the number of PDs participating in NOMA transmission leads to an increase in the state space of the DRL algorithms, which consequently results in a slower convergence rate. On the other hand, the data rates in Figure 6a,b are higher than those in Figure 5. This is because with more PDs involved, there is relatively more time available for

S e_{0}

to transmit data.

Figure 6. (a) Rewards under different algorithms when M = 5. (b) Rewards under different algorithms when M = 10.

6.2. Rewards Analysis Under Different EH Model

We then investigate the performance of different algorithms under various EH models when M = 2. In Figure 7, real indicates the EH model based on NLPM, which is more practical in real environments, while ideal indicates the EH model based on the linear model, commonly used in theoretical analyses. Two PDs are positioned at (0 m, 1 m) and (0 m, 1000 m), respectively. As depicted in Figure 7, after independent and repeated experiments, the upper bound of data rate is 5 bps achieved by the DDPG algorithm with the linear EH model. Under the NLPM, the data rate of DDPG reaches nearly 4.6 bps, which is smaller than that of LAMDDPG with NLPM, achieving nearly 5 bps. The LAMDDPG algorithm converges approximately at 50 episodes to achieve the maximal data rate. This superiority stems from the ability of the LAMDDPG algorithm to help

S e_{0}

to choose optimal actions through the combination of LSTM and attention mechanisms.

Figure 7. Rewards under different EH models when M = 2.

We further investigate the performance of different algorithms under various EH models with different numbers of PDs. Here, real indicates the EH model based on the NLPM, while ideal denotes the EH model based on the linear model. Figure 8a and Figure 8b show scenarios with five and ten PDs in the network, respectively. These PDs are positioned between (0 m, 1 m) and (0 m, 1000 m). As depicted in Figure 8a,b, after independent and repeated experiments, the upper bound of the data rates is 7 bps and 8 bps, respectively, achieved by the DDPG algorithm with the linear EH model. The maximum data rate increases with the number of PDs participating in NOMA transmission, due to more PDs providing more opportunities for

S e_{0}

transmitting data. Under the NLPM, the data rates of DDPG in Figure 8a,b reach nearly 5.8 bps and 6 bps, respectively, which are lower than those of LAMDDPG with NLPMs. The data rates of LAMDDPG with NLPM achieve the maximum data rates in both Figure 8a,b. Compared with Figure 8a, when there are ten PDs, the rapid increase in states caused by the growing number of PDs participating in NOMA transmission results in a slower convergence speed, and increased fluctuation under NPLM in Figure 8b.

Figure 8. (a) Rewards under different EH models when M = 5. (b) Rewards under the different EH models when M = 10.

6.3. Mechanism Analysis of Performance Improvement

Through ablation experiments (Figure 5, Figure 6, Figure 7 and Figure 8), we compare the LMDDPG algorithm with an added LSTM layer, the LAMDDPG algorithm (integrating LSTM and attention mechanisms), and DDPG, Greedy, and Random algorithms. The simulation results demonstrate that LAMDDPG achieves faster convergence and higher cumulative rewards across scenarios with varying numbers of PDs and different energy-harvesting models.

In our scenario, inherent dependencies exist between CSI and the remaining energy of EHSR. The introduced LSTM layer leverages its hidden state to encode historical data, enabling the agent to capture implicit states that are critical for decision-making. Thus, LMDDPG outperforms DDPG, greedy, and random algorithms in reward accumulation.

By integrating the attention module, the extracted features from hidden layers of the Critic (target) network are squeezed and activated by the Sigmoid function, which mitigates the overestimation bias of Q-values. Meanwhile, by squeeze-excitation-scaling steps, the attention module adaptively assigns weights to the extracted features, which emphasizes those that contribute more to Q-value estimation while suppressing redundant ones. This enhances the accuracy of Q-value predictions, reducing the agent’s ineffective exploration and thus accelerating algorithm convergence while boosting cumulative rewards.

6.4. Rewards Analysis Under Different Numbers of PDs

Furthermore, we deploy more PDs to investigate the maximum data rate under different PDs according to different EH models. As depicted in Figure 9, the average data rate initially increases with the increase in PDs, then decreases with the increase in PDs under different algorithms. This is because when the number of PDs exceeds two, the system allocates more time slots, thereby enhancing the data rate. However, when the number of PDs surpasses 10, the system experiences a sharp decline due to the strong inter-device interference caused by the increased number of PDs. As seen from Figure 10, when PD = 10,

S e_{0}

is more inclined to select actions for data transmission than to conserve energy compared with the scenario where the number is 15. In other words, the action corresponds to the surplus energy, and the probability of the surplus energy being small is relatively high. As seen from Figure 11, when the number of PD is 15, the probability of the surplus energy being large is relatively high, indicating that

S e_{0}

tends to prioritize energy harvesting over data transmission to manage the increased interference, which further contributes to the decline in data rate. Regarding the algorithms, LAMDDPG demonstrates a superior data rate compared to DDPG under NLPM, indicating that the LSTM and attention mechanism are more effective in managing the complex environments. Moreover, the system can achieve ideal values under the linear EH model, due to the linear EH model providing a more favorable environment. The optimal data rate can be achieved when the number of PDs is ten, highlighting the importance of balancing the number of PDs to maintain efficient network performance.

Figure 9. Rewards under different numbers of PDs.

Figure 10. Action selection when PD = 10.

Figure 11. Action selection when PD = 15.

7. Conclusions

In this paper, we consider the long-term throughput maximization problem for an EHSR IoT device in a CR-NOMA-enabled IoT network that comprises multiple primary IoT devices, a base station, and an EHSR IoT device. To be closer to practical applications, we adopt a piece-wise linear function-based NLPM. We addressed this optimization problem by integrating convex optimization with the LAMDDPG algorithm. Experimental results demonstrate that the LSTM layer in the Actor network can predict channel state information from historical data, effectively solving the agent’s partial observability problem. Meanwhile, the channel attention SE block in the Critic network mitigates Q-value overestimation in DRL algorithms through squeeze-excitation-scale operations. The synergy of these two mechanisms accelerates exploration, improves reward acquisition, and speeds up convergence. Moreover, we find the optimal number of PDs to maintain efficient network performance under NLPM, which is highly significant for guiding practical EHSR applications. However, we consider the ideal SIC condition in such an EH-CR-NOMA symbiotic system. In future work, we will extend our research to non-ideal SIC scenarios and further explore improving other types of DRL algorithms (e.g., PPO, TD3) to address the throughput maximization problem in EH-CR-NOMA symbiotic networks with a nonlinear energy harvesting model.

Author Contributions

Conceptualization, Y.Z. and L.K.; methodology, L.K.; investigation, Y.Z.; writing—original draft preparation, L.K.; writing—review and editing, Y.Z. and L.K.; visualization, J.S.; supervision, D.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China (U23A20627); Key R&D Program Project of Shanxi Province (202302150401004); Shanxi Key Laboratory of Wireless Communication and Detection (2025002); Doctoral Research Start up Fund of Taiyuan University of Science and Technology (20222118), and Scientific Research Startup Fund of Shanxi University of Electronic Science and Technology (2025KJ027).

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

IoT	Internet of Things
EH	Energy harvesting
SR	Symbiotic radio
NOMA	Non-orthogonal multiple access
CR	Cognitive radio
LSTM	Long short-term memory
DDPG	Deep Deterministic Policy Gradient
SIC	Successive interference cancellation
QoS	Quality of service
SUs	Secondary users
PUs	Primary users
SWIPT	Simultaneous wireless information and power transfer
TS	Time-switching
PS	Power-splitting
CSI	Channel state information
DRL	Deep reinforcement learning
PPO	Proximal Policy Optimization
CER-DDPG	Combined experience replay with DDPG
NLPMs	Nonlinear power models
PER	Prioritized experience replay
TDMA	Time-Division Multiple Access
PDs	Primary devices

Appendix A. Proof That Problems P7 and P8 Are Feasible

Since

E_{k}^{sur}

denotes the energy surplus at time slot

t_{k}

, when

S e_{0}

transmit data at the whole time slot, i.e.,

β_{k} = 1

, using the maximum power

P_{\max}

.

E_{k}^{sur}

achieves its lowest value

- T P_{\max}

. Thus

E_{k}^{sur}

satisfies the constraint as follows:

E_{k}^{sur} \geq - T P_{\max}

(A1)

The energy supply for data transmission comes from the remaining energy

E_{k}

at

t_{k}

. The maximized consumed energy during this process is denoted by

T P_{\max} = - E_{k}^{sur}

which cannot exceed

E_{k}

, that means

- E_{k}^{sur} \leq E_{k}

- E_{k}^{sur} \leq E_{k}

. Thus, the domain of

E_{k}^{sur}

can be denoted as:

E_{k}^{sur} \geq - \min \{E_{k}, T P_{\max}\}

(A2)

On the other hand, when

S e_{0}

harvests energy at the whole time slot, i.e.,

β_{k} = 0

,

E_{k}^{sur}

cannot exceed the constraint denoted as:

E_{k}^{sur} \leq \{\begin{cases} T η P_{k} {|h_{k, 0}|}^{2}, \\ P_{t h} T, \end{cases} \begin{matrix} η P_{k} {|h_{k, 0}|}^{2} \leq P_{t h} \\ η P_{k} {|h_{k, 0}|}^{2} > P_{t h} \end{matrix}

(A3)

Due to the battery ‘s capacity constraint

E_{k}^{sur}

, and considering that the remaining energy at

t_{k}

is

E_{k}

, the upper bound for

E_{k}^{sur}

is:

E_{k}^{sur} \leq \{\begin{cases} \min \{E_{\max} - E_{k}, T η P_{k} {|h_{k, 0}|}^{2}\}, \\ \min \{E_{\max} - E_{k}, P_{t h} T\}, \end{cases} \begin{matrix} η P_{k} {|h_{k, 0}|}^{2} \leq P_{t h} \\ η P_{k} {|h_{k, 0}|}^{2} > P_{t h} \end{matrix}

(A4)

Considering the condition of

η P_{k} {|h_{k, 0}|}^{2}

and

P_{t h}

, the bounds for

E_{k}^{sur}

can be obtained by combining Formula (A2) with (A4).

(1) When

η P_{k} {|h_{k, 0}|}^{2} \leq P_{t h}

, we rewrite Formula (A4) as

E_{k}^{sur} \leq \min \{E_{\max} - E_{k}, T η P_{k} {|h_{k, 0}|}^{2}\} \leq T η P_{k} {|h_{k, 0}|}^{2}

, thus

1 - \frac{E_{k}^{sur}}{T η P_{k} {|h_{k, 0}|}^{2}} \geq 0

, the lower bound of constraint (P7d) does not conflict with the upper bound in (P7a). According to Formula (A2),

E_{k}^{sur} \geq - E_{k}

, so

- \frac{E_{k} + E_{k}^{sur}}{T η P_{k} {|h_{k, 0}|}^{2}} \leq 0

, then

1 - \frac{E_{k} + E_{k}^{sur}}{T η P_{k} {|h_{k, 0}|}^{2}} \leq 1

, the lower bound of constraint (P7b) does not conflict with the upper bound in (P7d). Furthermore, according to Formula (A2)

E_{k}^{sur} \geq - T P_{\max}

, then

- E_{k}^{sur} \leq T P_{\max}

, we can obtain

T η P_{k} {|h_{k, 0}|}^{2} - E_{k}^{sur} \leq T η P_{k} {|h_{k, 0}|}^{2} + T P_{\max}

, so

\frac{T η P_{k} {|h_{k, 0}|}^{2} - E_{k}^{sur}}{T η P_{k} {|h_{k, 0}|}^{2} + T P_{\max}} \leq 1

, the lower bound of constraint (P7c) does not conflict with the upper bound in (P7d). According to Formula (A4),

E_{k}^{sur} \leq T η P {|h_{k, 0}|}^{2}

, then

0 < \frac{E_{k}^{sur}}{T η P {|h_{k, 0}|}^{2}} \leq 1

, apparently,

\frac{T η P_{k} {|h_{k, 0}|}^{2} - E_{k}^{sur}}{T η P_{k} {|h_{k, 0}|}^{2} + T P_{\max}} \leq 1 - \frac{E_{k}^{sur}}{T η P_{k} {|h_{k, 0}|}^{2}}

, the lower bound of constraint (P7c) does not conflict with the upper bound in (P7a). Then we know that the constraints of P7 do not conflict with each other, and the set defined by the constraints of P7 is not empty. The domain of the objective function of problem P7 satisfies

\frac{E_{(1 - β_{k}) T} - E_{k}^{sur}}{β_{k} T (1 + P_{k} {|h_{k}|}^{2})} {|h_{0, k}|}^{2} + 1 > 0

. The domain for constraint (P7a) satisfies

(1 - β_{k}) T η P_{k} {|h_{k, 0}|}^{2} - E_{k}^{sur} \geq 0,

(A5)

Combining with Formula (6), we can obtain

P_{0, k} β_{k} T = (1 - β_{k}) T η P_{k} {|h_{k, 0}|}^{2} - E_{k}^{sur} \geq 0

, so

\frac{β_{k} T P_{0, k}}{β_{k} T (1 + P_{k} {|h_{k}|}^{2})} {|h_{0, k}|}^{2} \geq 0

, the form is changed into

\frac{E_{(1 - β_{k}) T} - E_{k}^{sur}}{β_{k} T (1 + P_{k} {|h_{k}|}^{2})} {|h_{0, k}|}^{2} \geq 0,

(A6)

Thus, the intersection of the domain for the constraint and for the objective function is not empty.

(2) When

η P_{k} {|h_{k, 0}|}^{2} > P_{t h}

we rewrite Formula (A4) as

E_{k}^{sur} \leq \min \{E_{\max} - E_{k}, P_{t h} T\} \leq P_{t h} T

, thus

1 - \frac{E_{k}^{sur}}{P_{t h} T} \geq 0

, the lower bound of constraint (P8d) does not conflict with the upper bound in (P8a). According to Formula (A2),

E_{k}^{sur} \geq - E_{k}

,

- \frac{E_{k} + E_{k}^{sur}}{P_{t h} T} \leq 0

, thus

1 - \frac{E_{k} + E_{k}^{sur}}{P_{t h} T} \leq 1

, the lower bound of constraint (P8b) does not conflict with the upper bound in (P8d). Furthermore, according to Formula (A2)

E_{k}^{sur} \geq - T P_{\max}

, then

- E_{k}^{sur} \leq T P_{\max}

, we can obtain

P_{t h} T - E_{k}^{sur} \leq P_{t h} T + T P_{\max}

, so

\frac{P_{t h} T - E_{k}^{sur}}{P_{t h} T + P_{\max} T} \leq 1

, the lower bound of constraint (P8c) does not conflict with the upper bound in (P8d). According to Formula (A4),

E_{k}^{sur} \leq P_{t h} T

, then

0 < \frac{E_{k}^{sur}}{P_{t h} T} \leq 1

, apparently,

\frac{P_{t h} T - E_{k}^{sur}}{P_{t h} T + P_{\max} T} \leq 1 - \frac{E_{k}^{sur}}{P_{t h} T}

, the lower bound of constraint (P8c) does not conflict with the upper bound in (P8a). Then we know that the constraints of P8 do not conflict with each other, and the set defined by the constraints of P8 is not empty. The domain of the objective function of problem P8 satisfies

\frac{E_{(1 - β_{k}) T} - E_{k}^{sur}}{β_{k} T (1 + P_{k} {|h_{k}|}^{2})} {|h_{0, k}|}^{2} + 1 > 0

. The domain for constraint (P8a) satisfies

(1 - β_{k}) P_{t h} T - E_{k}^{sur} \geq 0,

(A7)

Combining with Formula (6), we can obtain

P_{0, k} β_{k} T = P_{t h} (1 - β_{k}) T - E_{k}^{sur} \geq 0

, so

\frac{β_{k} T P_{0, k}}{β_{k} T (1 + P_{k} {|h_{k}|}^{2})} {|h_{0, k}|}^{2} \geq 0

, the form is changed into

\frac{E_{(1 - β_{k}) T} - E_{k}^{sur}}{β_{k} T (1 + P_{k} {|h_{k}|}^{2})} {|h_{0, k}|}^{2} \geq 0

; thus, the intersection of the domain for the constraint and for the objective function is not empty.

As a result, optimization problems P7 and P8 are feasible.

Appendix B. Proof That Problems P7 and P8 Are Concave Functions

Since the objective function of P7 and P8 can be denoted as:

\tilde{f_{0}^{*}} (β_{k}) = β_{k} \ln (\frac{\frac{E_{(1 - β_{k}) T} - E_{k}^{sur}}{β_{k} T} {|h_{0, k}|}^{2}}{1 + P_{k} {|h_{k}|}^{2}} + 1) .

(A8)

We know

β_{k} = 0

is in the domain of

\tilde{f_{0}^{*}} (β_{k})

, by combining Formulas (A1) with (A5), and (A3) with (A7), respectively.

To simplify expressions, we define

λ_{1} = \frac{η P_{k} {|h_{k, 0}|}^{2}}{(1 + P_{k} {|h_{k}|}^{2})} {|h_{0, k}|}^{2}

,

λ_{2} = \frac{E_{k}^{sur} {|h_{0, k}|}^{2}}{T (1 + P_{k} {|h_{k}|}^{2})}

λ_{3} = \frac{P_{t h}}{(1 + P_{k} {|h_{k}|}^{2})} {|h_{0, k}|}^{2}

and

x = β_{k}

, then the objective function can be denoted as

\tilde{f_{0}^{*}} (x) = \{\begin{cases} x \ln (1 - λ_{1} + \frac{λ_{1} - λ_{2}}{x}) \\ x \ln (1 - λ_{3} + \frac{λ_{3} - λ_{2}}{x}) \end{cases} \begin{matrix} η P_{k} {|h_{k, 0}|}^{2} \leq P_{t h} \\ η P_{k} {|h_{k, 0}|}^{2} > P_{t h} \end{matrix}

(A9)

Similar to the steps in [27], we can obtain the first-order derivative and the second-order derivative of

\tilde{f_{0}^{*}} (x)

to prove that the objective functions of P7 and P8 are concave functions.

References

Donta, P.K.; Srirama, S.N.; Amgoth, T.; Annavarapu, C.S.R. Survey on recent advances in IoT application layer protocols and machine learning scope for research directions. Digit. Commun. Netw. 2022, 8, 727–744. [Google Scholar] [CrossRef]
Andrews, J.G.; Buzzi, S.; Choi, W.; Hanly, S.V.; Lozano, A.; Soong, A.K.; Zhang, J.C. What will 5G be? IEEE J. Sel. Areas Commun. 2014, 32, 1065–1082. [Google Scholar] [CrossRef]
Makki, B.; Chitti, K.; Behravan, A.; Alouini, M.-S. A survey of NOMA: Current status and open research challenges. IEEE Open J. Commun. Soc. 2020, 1, 179–189. [Google Scholar] [CrossRef]
Lei, H.; She, X.; Park, K.-H.; Ansari, I.S.; Shi, Z.; Jiang, J.; Alouini, M.-S. On secure CDRT with NOMA and physical-layer network coding. IEEE Trans. Commun. 2023, 71, 381–396. [Google Scholar] [CrossRef]
Kilzi, A.; Farah, J.; Nour, C.A.; Douillard, C. Mutual successive interference cancellation strategies in NOMA for enhancing the spectral efficiency of CoMP systems. IEEE Trans. Commun. 2020, 68, 1213–1226. [Google Scholar] [CrossRef]
Li, X.; Zheng, Y.; Khan, W.U.; Zeng, M.; Li, D.; Ragesh, G.K.; Li, L. Physical layer security of cognitive ambient backscatter communications for green Internet-of-Things. IEEE Trans. Green Commun. Netw. 2021, 5, 1066–1076. [Google Scholar] [CrossRef]
Chen, B.; Chen, Y.; Chen, Y.; Cao, Y.; Zhao, N.; Ding, Z. A novel spectrum sharing scheme assisted by secondary NOMA relay. IEEE Wirel. Commun. Lett. 2018, 7, 732–735. [Google Scholar] [CrossRef]
Do, D.-T.; Le, A.-T.; Lee, B.M. NOMA in Cooperative Underlay Cognitive Radio Networks Under Imperfect SIC. IEEE Access 2020, 8, 86180–86195. [Google Scholar] [CrossRef]
Ali, Z.; Khan, W.U.; Sidhu, G.A.S.; K, N.; Li, X.; Kwak, K.S.; Bilal, M. Fair power allocation in cooperative cognitive systems under NOMA transmission for future IoT networks. Alex. Eng. J. 2022, 61, 575–583. [Google Scholar] [CrossRef]
Jiang, Q.; Zhang, C.; Zheng, W.; Wen, X. Research on Delay DRL in Energy-Constrained CR-NOMA Networks based on Multi-Threads Markov Reward Process. In Proceedings of the 2022 IEEE 8th International Conference on Computer and Communications (ICCC), Nanjing, China, 29 March–1 April 2021. [Google Scholar] [CrossRef]
Elmadina, N.N.; Saeid, E.; Mokhtar, R.A.; Saeed, R.A.; Ali, E.S.; Khalifa, O.O. Performance of Power Allocation Under Priority User in CR-NOMA. In Proceedings of the 2023 IEEE 3rd International Maghreb Meeting of the Conference on Sciences and Techniques of Automatic Control and Computer Engineering (MI-STA), Benghazi, Libya, 21–23 May 2023. [Google Scholar] [CrossRef]
Alhamad, R.; Boujemâa, H. Optimal power allocation for CRN-NOMA systems with adaptive transmit power. Signal Image Video Process. 2020, 14, 1327–1334. [Google Scholar] [CrossRef]
Abidrabbu, S.S.; Arslan, H. Energy-Efficient Resource Allocation for 5G Cognitive Radio NOMA Using Game Theory. In Proceedings of the 2021 IEEE Wireless Communications and Networking Conference (WCNC), Nanjing, China, 29 March–1 April 2021. [Google Scholar] [CrossRef]
Xie, N.; Tan, H.; Huang, L.; Liu, A.X. Physical-layer authentication in wirelessly powered communication networks. IEEE/ACM Trans. Netw. 2021, 29, 1827–1840. [Google Scholar] [CrossRef]
Huang, J.; Xing, C.; Guizani, M. Power allocation for D2D communications with SWIPT. IEEE Trans. Wirel. Commun. 2020, 19, 2308–2320. [Google Scholar] [CrossRef]
Liu, Y.; Ding, Z.; Elkashlan, M.; Poor, H.V. Cooperative non-orthogonal multiple access with simultaneous wireless information and power transfer. IEEE J. Sel. Areas Commun. 2016, 34, 938–953. [Google Scholar] [CrossRef]
Mazhar, N.; Ullah, S.A.; Jung, H.; Nadeem, Q.-U.-A.; Hassan, S.A. Enhancing spectral efficiency in IoT networks using deep deterministic policy gradient and opportunistic NOMA. In Proceedings of the 2024 IEEE 100th Vehicular Technology Conference (VTC2024-Fall), Washington, DC, USA, 7–10 October 2024. [Google Scholar] [CrossRef]
Yang, J.; Cheng, Y.; Peppas, K.P.; Mathiopoulos, P.T.; Ding, J. Outage performance of cognitive DF relaying networks employing SWIPT. China Commun. 2018, 15, 28–40. [Google Scholar] [CrossRef]
Song, Z.; Wang, X.; Liu, Y.; Zhang, Z. Joint Spectrum Resource Allocation in NOMA-based Cognitive Radio Network with SWIPT. IEEE Access 2019, 7, 89594–89603. [Google Scholar] [CrossRef]
Yang, C.; Lu, W.; Huang, G.; Qian, L.; Li, B.; Gong, Y. Power Optimization in Two-way AF Relaying SWIPT based Cognitive Sensor Networks. In Proceedings of the 2020 IEEE 92nd Vehicular Technology Conference (VTC2020-Fall), Victoria, BC, Canada, 18 November–16 December 2020. [Google Scholar] [CrossRef]
Liu, X.; Zheng, K.; Chi, K.; Zhu, Y.-H. Cooperative Spectrum Sensing Optimization in Energy-Harvesting Cognitive Radio Networks. IEEE Trans. Wirel. Commun. 2020, 19, 7663–7676. [Google Scholar] [CrossRef]
Wang, Y.; Chen, S.; Wu, Y.; Zhao, C. Maximizing Average Throughput of Cooperative Cognitive Radio Networks Based on Energy Harvesting. Sensors 2022, 22, 8921. [Google Scholar] [CrossRef]
Luong, N.C.; Hoang, D.T.; Gong, S.; Niyato, D.; Wang, P.; Liang, Y.-C.; Kim, D.I. Applications of deep reinforcement learning in communications and networking: A survey. IEEE Commun. Surv. Tutor. 2019, 21, 3133–3174. [Google Scholar] [CrossRef]
Umeonwuka, O.O.; Adejumobi, B.S.; Shongwe, T. Deep Learning Algorithms for RF Energy Harvesting Cognitive IoT Devices: Applications, Challenges and Opportunities. In Proceedings of the 2022 International Conference on Electrical, Computer and Energy Technologies (ICECET), Prague, Czech Republic, 20–22 July 2022. [Google Scholar] [CrossRef]
Du, K.; Xie, X.; Shi, Z.; Li, M. Joint Time and Power Control of Energy Harvesting CRN Based on PPO. In Proceedings of the 2022 Wireless Telecommunications Symposium (WTS), Pomona, CA, USA, 6–8 April 2022. [Google Scholar] [CrossRef]
Al Rabee, F.T.; Masadeh, A.; Abdel-Razeq, S.; Salameh, H.B. Actor–Critic Reinforcement Learning for Throughput-Optimized Power Allocation in Energy Harvesting NOMA Relay-Assisted Networks. IEEE Open J. Commun. Soc. 2024, 5, 7941–7953. [Google Scholar] [CrossRef]
Ding, Z.; Schober, R.; Poor, H.V. No-Pain No-Gain: DRL Assisted Optimization in Energy-Constrained CR-NOMA Networks. IEEE Trans. Commun. 2021, 69, 5917–5932. [Google Scholar] [CrossRef]
Shi, Z.; Xie, X.; Lu, H.; Yang, H.; Cai, J.; Ding, Z. Deep Reinforcement Learning-Based Multidimensional Resource Management for Energy Harvesting Cognitive NOMA Communications. IEEE Trans. Commun. 2022, 70, 3110–3125. [Google Scholar] [CrossRef]
Ullah, A.; Zeb, S.; Mahmood, A.; Hassan, S.A.; Gidlund, M. Opportunistic CR-NOMA Transmissions for Zero-Energy Devices: A DRL-Driven Optimization Strategy. IEEE Wirel. Commun. Lett. 2023, 12, 893–897. [Google Scholar] [CrossRef]
Du, K.; Xie, X.; Shi, Z.; Li, M. Throughput maximization of EH-CRN-NOMA based on PPO. In Proceedings of the 2023 International Conference on Inventive Computation Technologies (ICICT), Lalitpur, Nepal, 26–28 April 2023. [Google Scholar] [CrossRef]
Zhou, F.; Chu, Z.; Wu, Y.; Al-Dhahir, N.; Xiao, P. Enhancing PHY security of MISO NOMA SWIPT systems with a practical non-linear EH model. In Proceedings of the 2018 IEEE International Conference on Communications Workshops (ICC Workshops), Kansas City, MO, USA, 20–24 May 2018. [Google Scholar] [CrossRef]
Kumar, D.; Singya, P.K.; Choi, K.; Bhatia, V. SWIPT enabled cooperative cognitive radio sensor network with non-linear power amplifier. IEEE Trans. Cogn. Commun. Netw. 2023, 9, 884–896. [Google Scholar] [CrossRef]
Mohammed, A.A.; Baig, M.W.; Sohail, M.A.; Ullah, S.A.; Jung, H.; Hassan, S.A. Navigating boundaries in quantifying robustness: A DRL expedition for non-linear energy harvesting IoT networks. IEEE Commun. Lett. 2024, 28, 2447–2451. [Google Scholar] [CrossRef]
Ullah, S.A.; Mahmood, A.; Nasir, A.A.; Gidlund, M.; Hassan, S.A. DRL-driven optimization of a wireless powered symbiotic radio with non-linear EH model. IEEE Open J. Commun. Soc. 2024, 5, 5232–5247. [Google Scholar] [CrossRef]
Li, K.; Ni, W.; Dressler, F. LSTM-Characterized Deep Reinforcement Learning for Continuous Flight Control and Resource Allocation in UAV-Assisted Sensor Network. IEEE Internet Things J. 2022, 9, 4179–4189. [Google Scholar] [CrossRef]
He, X.; Mao, Y.; Liu, Y.; Ping, P.; Hong, Y.; Hu, H. Channel assignment and power allocation for throughput improvement with PPO in B5G heterogeneous edge networks. Digit. Commun. Netw. 2024, 10, 109–116. [Google Scholar] [CrossRef]
Ullah, I.; Singh, S.K.; Adhikari, D.; Khan, H.; Jiang, W.; Bai, X. Multi-Agent Reinforcement Learning for task allocation in the Internet of Vehicles: Exploring benefits and paving the future. Swarm Evol. Comput. 2025, 94, 101878. [Google Scholar] [CrossRef]
Alhartomi, M.; Salh, A.; Audah, L.; Alzahrani, S.; Alzahmi, A. Enhancing Sustainable Edge Computing Offloading via Renewable Prediction for Energy Harvesting. IEEE Access 2024, 12, 74011–74023. [Google Scholar] [CrossRef]
Choi, J.; Lee, B.-J.; Zhang, B.-T. Multi-focus Attention Network for Efficient Deep Reinforcement Learning. In Proceedings of the AAAI 2017 Workshop on What’s Next for AI in Games, AAAI 2017, San Francisco, CA, USA, 4–9 February 2017. [Google Scholar] [CrossRef]
Zhou, X.; Zhang, R.; Ho, C.K. Wireless Information and Power Transfer: Architecture Design and Rate-Energy Tradeoff. IEEE Trans. Commun. 2013, 61, 4754–4767. [Google Scholar] [CrossRef]
Yuan, T.; Liu, M.; Feng, Y. Performance Analysis for SWIPT Cooperative DF Communication Systems with Hybrid Receiver and Non-Linear Energy Harvesting Model. Sensors 2020, 20, 2472. [Google Scholar] [CrossRef]
Boyd, S.; Vandenberghe, L. Convex Optimization, 1st ed.; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
Gradshteyn, I.S.; Ryzhik, I.M. Table of Integrals, Series and Products, 6th ed.; Academic Press: New York, NY, USA, 2000. [Google Scholar]

Figure 1. EH symbiotic radio system.

Figure 2. Data transmission mode.

Figure 3. Logic flow diagram for the two-layered optimization process.

Figure 4. The framework of LAMDDPG. Black solid lines denote the agent, environment, network components, black dashed lines illustrate the network structure of the proposed LAMDDPG algorithm, blue solid lines depict the training phase, blue dashed lines illustrate the execution phase of the algorithm.

Figure 5. Rewards under different algorithms when M = 2.

Figure 6. (a) Rewards under different algorithms when M = 5. (b) Rewards under different algorithms when M = 10.

Figure 7. Rewards under different EH models when M = 2.

Figure 8. (a) Rewards under different EH models when M = 5. (b) Rewards under the different EH models when M = 10.

Figure 9. Rewards under different numbers of PDs.

Figure 10. Action selection when PD = 10.

Figure 11. Action selection when PD = 15.

Table 1. Summary of main notations.

Notation	Description
M	Number of primary devices
$S e_{0}$	EH symbiotic radio SD
$t_{k}$	Any given time slot
$h_{k}$	Channel gain between BS and $P D_{k}$
$h_{k, 0}$	Channel gain between $P D_{k}$ and $S e_{0}$
$h_{0, k}$	Channel gain between $S e_{0}$ and BS
$β_{k}$	Time allocation factor
$T$	Duration for every time slot
$P_{k}$	Transmit power of $P D_{k}$
$P_{t h}$	Energy harvesting threshold
$η$	Energy harvesting efficiency
$E_{k}$	Residual energy in the battery of $S e_{0}$
$E_{\max}$	Maximum battery capacity of $S e_{0}$
$P_{0, k}$	Transmit power of $S e_{0}$
$P_{\max}$	Maximum transmit power of $S e_{0}$
$γ$	Discount rate
$E_{k}^{sur}$	Energy surplus
$W_{0} (\cdot)$	Principal branch of the Lambert W function
$s, s_{k}$	Current state
$ω^{μ}, ω^{μ^{'}}$	Parameters of Actor, target Actor network
$n$	Noise
$a (s) = μ (s \| ω^{μ}) + n, a_{k} = μ (s_{k} \| ω^{μ}) + n$	Action
$a^{'} (s^{'}) = μ^{'} (s^{'} \| ω^{μ}), a_{k + 1} = μ (s_{k + 1})$	Target action
$F$	Features from the hidden layer of Critic
$ω^{Q}, ω^{Q^{'}}$	Parameters of Critic, Critic target network
$Q (s, a\| ω^{Q}), Q (s_{k}, a_{k}\| ω^{Q})$	Output of the Critic network
$Q^{'} (s^{'}, a^{'}\| ω^{Q^{'}}), Q^{'} (s_{k + 1}, a_{k + 1}\| ω^{Q^{'}})$	Output of the Critic target network
$r, r_{k}$	System reward
$s^{'}, s_{k + 1}$	New state
$(s, a, r, s^{'}), (s_{k}, a_{k}, r_{k}, s_{k + 1})$	Experience tuple
$ε_{k}$	Weight
$N$	Batch size
$y_{k}$	Target value
$τ$	Update coefficient

Table 2. Parameter settings.

Parameter Settings
Learning rate of the Actor network	0.002
Learning rate of the Critic network	0.004
discount rate $γ$	0.9
Network update parameters $τ$	0.01
Batch size of experience replay pool	32
transmission power of $P_{k}$	30 dBm
initial energy $E_{\max}$	0.1 J
The duration of each time slot T	1 s
The maximum transmission power of $P_{\max}$	0.1 W
Energy harvesting efficiency $η$	0.7
Noise power spectral density	−170 dBm/Hz
Bandwidth of noise	1 MHz
Noise carrier frequency	914 MHz
Path Loss Exponent	3
Number of hidden layer nodes	64

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Throughput Maximization in EH Symbiotic Radio System Based on LSTM-Attention-Driven DDPG

Abstract

1. Introduction

2. Related Works

2.1. EH in CR-NOMA Networks

2.2. Deep Reinforcement Learning for Resource Allocation

2.3. Motivations

3. System Model

4. Problem Formulation and Decomposition

4.1. Problem Transformation

4.2. Solving Problem P3

4.3. Solving Problem P4

5. Proposed LAMDDPG Algorithm

5.1. Framework of LAMDDPG Algorithm

5.2. Neural Network Architectures and Training Parameters

6. Simulation Results

6.1. Rewards Analysis Under Different Algorithms

6.2. Rewards Analysis Under Different EH Model

6.3. Mechanism Analysis of Performance Improvement

6.4. Rewards Analysis Under Different Numbers of PDs

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Proof That Problems P7 and P8 Are Feasible

Appendix B. Proof That Problems P7 and P8 Are Concave Functions

References

Article Metrics

Citations

Article Access Statistics