Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Throughput Maximization in EH Symbiotic Radio System Based on LSTM-Attention-Driven DDPG

Electronics 2025, 14(24), 4835; https://doi.org/10.3390/electronics14244835

by Yanjun Zhu^1,2, Lin Kang^2,*

, Jinrong Su³ and Di Yang⁴

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Electronics 2025, 14(24), 4835; https://doi.org/10.3390/electronics14244835

Submission received: 31 October 2025 / Revised: 3 December 2025 / Accepted: 4 December 2025 / Published: 8 December 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper proposes a LSTM-Attention-driven DDPG approach for throughput maximization in energy harvesting symbiotic radio system. This is a real-world problem and some comments are as follows:
1. Some abbreviations are defined more than once, e.g., DDPG in line 78 and line 171.
2. The authors should add a separate and organized related work section, instead of literature review in the introduction section without using subsections.
3. Energy Harvesting is introduced in line 68. However, why is it difficult to maximize throughput because of the introduction of energy harvesting is not mentioned.
4. Based on the literature review in the introduction, DDPG has been widely used in many similar studies and LSTM is also a mature technique. The authors should justify their novelty to support their claimed contributions.
5. DRL has been widely used in relevant problems and more discussion should be given with references including "Multi-Agent Reinforcement Learning for Task Allocation in the Internet of Vehicles: Exploring Benefits and Paving the Future" and "Channel assignment and power allocation for throughput improvement with PPO in B5G heterogeneous edge networks".
6. Figure 1. System model should be further improved with a higher resolution. And explain what is the relationship between the upper and lower subfigures.
7. "Figure1" in line 190 should be "Figure 1".
8. Add a table summarizing all mathematical symbols and their definitions in the beginning of the system model section.
9. "Where" in line 208 should be "where". Remove the blank line above line 208 and the space in the beginning of line 208. Fix similar problems afterwards.
10. Use the same font in "Figure 2. Data transmission mode " with the main text.
11. Add an equation number in line 228 for P1. Check and fix similar problems.
12. Improve the section name of "3. Problem Solving", e.g., Proposed Methodology.
13. There are too many equations in Page 10 without numbering.
14. Give more discussion for the feasibility of the claim "This problem is a function of sur Ek , which can be optimized through LAMDDPG method." in line 310.
15. "4.1. Preliminaries of LSTM" and "Figure 3. The structure of LSTM layer" can be removed, because it can be easily found in a textbook.
16. Improve the resolution and font of "Figure 4. The structure of Attention mechanism" and "Figure 5. The framework of LAMDDPG"
17. Use a standard algorithm formulation for LAMDDPG algorithm, with input and output. And put the algorithm in the same page, instead of two pages.
18. Check the line numbers in Page 17. They are mixed with the table.
19. Add the table numbering and caption in Section 5. SIMULATION RESULTS.
20. Besides the standard DDPG model, add more baselines from the literature in the experiments.
21. Improve the result figures. Use vector images if possible.

Author Response

Point-by-point response

We gratefully appreciate for the editor and all reviewers for their constructive remarks and useful suggestions which have significantly raised the quality of the manuscript and enable us to improve the manuscript. Each suggested revision and comment, brought forward by the reviewers was accurately incorporated and considered. Below the comments of the reviewers are response point by point and the revisions are indicated.

Reviewer 1

This paper proposes a LSTM-Attention-driven DDPG approach for throughput maximization in energy harvesting symbiotic radio system. This is a real-world problem and some comments are as follows:

1.Comments:

Some abbreviations are defined more than once, e.g., DDPG in line 78 and line 171.
Reply:

Thank you for carefully pointing this out. We agree and have corrected this oversight.

We have removed the redundant definition of DDPG at line 171.

As we have revised the introduction section, the abbreviation DDPG is now used at line 78.

The revised abbreviation is highlighted in red font at line 78 and line 171 as follows:

Line 78 “ DDPG-based opportunistic NOMA access method, the spectral efficiency of EH-enabled secondary IoT devices has been enhanced [17].”

Line 171 “bined DDPG, named LAMDDPG algorithm，to optimize long-term throughput for secondary device.

2.Comments:

The authors should add a separate and organized related work section, instead of literature review in the introduction section without using subsections.

Reply:

Thank you for your valuable suggestion. We have streamlined the introduction and added a separate related work section.

(1)The introduction has been streamlined to concisely present the research background, problem statement, key challenges, and our main contributions, highlighted in red font in the revised manuscript. The details are as follows:

After the sentence “thereby offering a sustainable approach to powering IoT networks. ” we have added

Establishing an EH-enabled CR-NOMA IoT network where symbiotic secondary devices (SD) share primary device’s (PD) spectrum and harvest energy from emitted signals of PDs can greatly enhance the spectrum efficiency and sustainability of IoT applications.

Considering the randomness of energy harvesting from a wireless environment complicates the acquisition of precise channel state information (CSI). Deep reinforcement learning (DRL) algorithms offer solutions by providing optimal spectrum and power allocation in such networks. However, the inherent nonlinearity of practical energy harvesting circuits, particularly those characterized by piecewise functions, presents challenges for DRL-based spectrum access and power allocation in achieving long-term throughput enhancement.

In this study, we explored an EH symbiotic radio(EHSR) IoT network the symbiotic SD is permitted to access the spectrum assigned to PD by CR and harvest energy from emitted signals of PDs via nonlinear power model(NLPM). We determine the optimal power and time allocation factor for symbiotic SD using a convex optimization tool. Then, we design a long short-term memory (LSTM)-attention-mechanism combined Deep Deterministic Policy Gradient (DDPG) algorithm (LAMDDPG) to solve the long-term throughput maximization for SD. Moreover, we find the optimal number of PDs to maintain efficient network performance under NLPM, which is highly significant for guiding practical EHSR applications.

(2) We have added New Related Work Section as II. Related works section with clear thematic subsections. We have also revised certain sentences to improve clarity and flow. These revisions are highlighted in red font in the revised manuscript .The details are as follows:

2.1. EH in CR-NOMA networks

Simultaneous wireless information and power transfer (SWIPT) has become a key enabler for sustainable IoT. By integrating time-switching (TS) or power-splitting (PS) protocols, SWIPT allows devices to decode information and harvest energy simultaneously [15].

Through convex optimization and a DDPG-based opportunistic NOMA access method, the spectral efficiency of EH-enabled secondary IoT devices has been enhanced [17].

Furthermore, combining Lagrangian dual decomposition with bisection method, system throughput was enhanced in SWIPT-enabled cognitive sensor networks [20]. Besides, in other CR networks, SDs obtains energy supplementation by harvesting RF signals from PDs.

A k-out-of-M fusion spectrum access strategy was developed to maximize the achievable throughput in such EH SR network with a mobile SD [21].

To better harness EHSR for IoT networks, it is important to design efficient resource allocation algorithms to provide optimal spectrum access and power.

accurate CSI is essential. However, the randomness of energy harvesting from a wireless environment complicates the acquisition of a precise CSI. Moreover, maximizing throughput faces significant challenges due to the tight coupling between data transmission of symbiotic secondary devices and their instantaneous energy availability, compounded by the stochastic nature of energy conversion that profoundly affects transmission rates. DRL algorithms can provide optimal resource allocation decisions for devices in such network. DRL algorithms can provide optimal resource allocation decisions for devices in such network by leveraging the interactions between agents and the environment, thereby enhancing overall system throughput [23].

some researchers have explored NLPMs to enhance the throughput in EH-enabled CR-NOMA networks [31].

Furthermore, Alhartomi et al. designed an LSTM-driven actor critic framework to solve the service offloading problem with faster convergence rate than DDPG in an IoT network [35].

2.3. Motivations

Then, we design an LSTM-attention-mechanism combined DDPG algorithm (LAMDDPG) to solve the long-term throughput maximization problem.

then these optimal parameters are fed into the LAMDDPG framework to optimize long-term throughput for secondary device.

We also propose an LSTM combined DDPG (LMDDPG) algorithm by incorporating LSTM layers into the actor networks of DDPG.

(3) We have also adjusted the order of References 20 and 21 to improve clarity and flow. These revisions are highlighted in red font in the revised manuscript .The details are as follows:

20.C. Yang, W. Lu, G. Huang, L. Qian, B. Li, Y. Gong. Power Optimization in Two-way AF Relaying SWIPT based Cognitive Sensor Networks. In 2020 IEEE 92nd Vehicular Technology Conference (VTC2020-Fall). Victoria, BC, Canada. 18 November 2020 - 16 December 2020. DOI: 10.1109/VTC2020-Fall49728.2020.9348749.

21.X. Liu, K. Zheng, K. Chi, Y.-H. Zhu. Cooperative Spectrum Sensing Optimization in Energy-Harvesting Cognitive Radio Networks. IEEE Trans. Wireless Commun.. 2020, 19, 7663-7676. DOI: 10.1109/TWC.2020.3015260.

(4) We have adjusted the section numbers and marked these changes in red font in the revised manuscript.

Comments:

Energy Harvesting is introduced in line 68. However, why is it difficult to maximize throughput because of the introduction of energy harvesting is not mentioned.

Reply:

Thank you for your valuable suggestion.

After the sentence: “Although the aforementioned algorithms can effectively enhance system through-put in EHSR IoT networks,” We have added the reasons why is it difficult to maximize throughput because of the introduction of energy harvesting. These revisions are highlighted in red font in the revised manuscript .The details are as follows:

4.Comments:

Based on the literature review in the introduction, DDPG has been widely used in many similar studies and LSTM is also a mature technique. The authors should justify their novelty to support their claimed contributions.

Reply:

Thank you for your valuable suggestion. We acknowledge that DDPG and LSTM are indeed mature techniques.

Our novelty is first reflected in the scenario innovation: maximizing the long-term throughput of the secondary device (SD) in a CR-NOMA cognitive radio coexistence scenario where a piecewise-function-based nonlinear energy harvesting (EH) model is adopted.

The second novelty lies in the methodological innovation for solving the optimization problem: a hierarchical optimization approach is adopted, where the derived optimal power and time allocation factors via convex optimization are embedded into the deep reinforcement learning (DRL) framework to achieve the long-term optimal throughput of the secondary device (SD) under dynamically changing environments.

The third novelty is embodied in the proposed long short-term memory (LSTM)-attention-mechanism combined Deep Deterministic Policy Gradient, named LAMDDPG algorithm. By equipping the Actor with LSTM to capture temporal state and enhancing the Critic with channel-wise attention mechanism, namely Squeeze-and-Excitation Block, for precise Q-evaluation, the LAMDDPG algorithm achieves a faster convergence rate and optimal long-term throughput compared to the baseline algorithms.

We have rewritten the abstract, and the revised version is as follows:

Massive Internet of Things (IoT) deployments face critical spectrum crowding and energy scarcity challenges. Energy harvesting (EH) symbiotic radio (SR), where secondary devices share spectrum and harvest energy from Non-orthogonal multiple access (NOMA)-based primary systems, offers a sustainable solution. We consider long-term throughput maximization in an EHSR network with non-linear EH model. To solve this non-convex problem, we design a two-layered optimization algorithm combing convex optimization with deep reinforcement learning (DRL) frame. The derived optimal power, time allocation factor and the time varying environment state are fed into the proposed long short-term memory (LSTM)-attention-mechanism combined Deep Deterministic Policy Gradient, named LAMDDPG algorithm to achieve the optimal long-term throughput. By equipping the Actor with LSTM to capture temporal state and enhancing the Critic with channel-wise attention mechanism, namely Squeeze-and-Excitation Block, for precise Q-evaluation, the LAMDDPG algorithm achieves a faster convergence rate and optimal long-term throughput compared to the baseline algorithms. Massive Internet of Things (IoT) deployments face critical spectrum crowding and energy scarcity challenges. Energy harvesting (EH) symbiotic radio (SR), where secondary devices share spectrum and harvest energy from Non-orthogonal multiple access (NOMA)-based primary systems, offers a sustainable solution. We consider long-term throughput maximization in an EHSR network with non-linear EH model. To solve this non-convex problem, we design a two-layered optimization algorithm combing convex optimization with deep reinforcement learning (DRL) frame. The derived optimal power, time allocation factor and the time varying environment state are fed into the proposed long short-term memory (LSTM)-attention-mechanism combined Deep Deterministic Policy Gradient, named LAMDDPG algorithm to achieve the optimal long-term throughput. Simulation results demonstrate that by equipping the Actor with LSTM to capture temporal state and enhancing the Critic with channel-wise attention mechanism, namely Squeeze-and-Excitation Block, for precise Q-evaluation, the LAMDDPG algorithm achieves a faster convergence rate and optimal long-term throughput compared to the baseline algorithms. Moreover, we find the optimal number of PDs to maintain efficient network performance under NLPM, which is highly significant for guiding practical EHSR applications.

5.Comments:

DRL has been widely used in relevant problems and more discussion should be given with references including "Multi-Agent Reinforcement Learning for Task Allocation in the Internet of Vehicles: Exploring Benefits and Paving the Future" and "Channel assignment and power allocation for throughput improvement with PPO in B5G heterogeneous edge networks".

Reply:

Thank you for your valuable suggestion.

We have added references to the Related Work section regarding the use of Multi-Agent Deep Reinforcement Learning algorithms in resource allocation. After the sentence: “Furthermore, Alhartomi et al. designed an LSTM-driven actor critic framework to solve the service offloading problem with faster convergence rate than DDPG in an IoT network [35].” These revisions are highlighted in red font in the revised manuscript .The details are as follows:

In another heterogeneous edge networks, a distributed multi-agent PPO resource allocation algorithm was proposed to maximize the sum rate [36]. When offloading tasks occurred in fast-varying vehicular networks, various multi-agent reinforcement learning frameworks were leveraged for efficient resource allocation [37].

In the motivation section, we explained the problems with the single-agent deep reinforcement learning algorithm for solving long-term system throughput maximization in symbiotic networks. After the sentence: “Therefore, it is essential to develop effective plans for EHSR to enhance long-term throughput.” The revision is highlighted in red font in the revised manuscript .The details are as follows:

The plans can serve as a decision-making process to solve it via the single-agent DRL algorithm.

We have also added References 36 and 37 in Reference. These revisions are highlighted in red font in the revised manuscript. The details are as follows

He X , Mao Y , Liu Y ,et al. Channel assignment and power allocation for throughput improvement with PPO in B5G hetero-geneous edge networks. Digit. Commun. Netw.. 2024,2,109-116. DOI: 10.1016/j.dcan.2023.02.018.
Ullah I , Singh S K , Adhikari D ,et al. Multi-Agent Reinforcement Learning for task allocation in the Internet of Vehicles: Exploring benefits and paving the future. Swarm Evol. Comput.. 2025,94, 101878. DOI:10.1016/j.swevo.2025.101878.

Additionally, we adjusted the numbering of other references in the text and marked these changes in red font in text and Reference.

6.Comments:

Figure 1. System model should be further improved with a higher resolution. And explain what is the relationship between the upper and lower subfigures.

Reply:

Thank you for your valuable suggestion.

Figure 1 has been replaced with a high-resolution image. The revision is highlighted in yellow in the revised manuscript.

The upper and lower subfigures respectively depict secondary devices sharing the spectrum of arbitrary primary user (1...M) for data transmission and energy harvesting. The selection of which primary user's spectrum the secondary device accesses in each time slot for data transmission and energy harvesting depends on the agent's decision.

7.Comments:

"Figure1" in line 190 should be "Figure 1"

Reply:

Thank you for pointing out this typographical error. We have corrected it. The revised part is highlighted in yellow at line 190, the details are as follows:

As illustrated in Figure 1,

8.Comments:

Add a table summarizing all mathematical symbols and their definitions in the beginning of the system model section.

Reply:

Thank you for your valuable suggestion.

We have added a table summarizing all mathematical symbols and their definitions in the beginning of the system model section. The revision is highlighted in red font in the revised manuscript.

Table 1 Summary of Main Notations

Notation Description
M	Number of primary devices
	EH symbiotic radio SD
	Any given time slot
	Channel gain between BS and
	Channel gain between and
	Channel gain between and BS
	Time allocation factor
	Duration for every time slot
	Transmit power of
	Energy harvesting threshold
	Energy harvesting efficiency
	Residual energy in the battery of
	Maximum battery capacity of
	Transmit power of
	Maximum transmit power of
	Discount rate
	Energy surplus
	Principle branch of Lambert W function
	Current state
	Parameters of Actor、target Actor network
	Noise
	Action
	Target action
	Features from hidden layer of Critic
	Parameters of Critic、Critic target network
	Output of Critic network
	Output of Critic target network
	System reward
	New state
	Experience tuple
	Weight
	Batch size
	Target value
	Update coefficient

We also added a sentence in red font after the sentence “The remainingseconds are utilized for energy harvesting.” as follows

We summarize the main notations of this paper in Table 1.

to denote the addition in the text.

9.Comments:

"Where" in line 208 should be "where". Remove the blank line above line 208 and the space in the beginning of line 208. Fix similar problems afterwards.

Reply:

Thank you for pointing out these formatting issues. We have carefully corrected them. Specifically, at line 208, we have made the following changes (highlighted in red font):

Changed "Where" to "where" to correct the capitalization error
Removed the redundant blank line above line 208
Deleted the leading space at the beginning of line 208 to ensure proper indentation

The details are as follows

where is the transmit power of , denotes the energy harvesting threshold of the EH circuit, is the energy harvesting efficiency.

We also revised similar problems afterwards: (highlighted in red font):

where denotes the maximum transmit power of
where
where denotes the principle branch of Lambert W function [43].
where is the update coefficient
where
where is the achievable data rate at

10.Comments:

Use the same font in "Figure 2. Data transmission mode " with the main text.

Reply:

Thank you for your valuable suggestion. We have corrected it using red font.

Below figure 2, the details are as follows

Figure 2. Data transmission mode

Comments:

Add an equation number in line 228 for P1. Check and fix similar problems.

Reply:

Thank you for your valuable suggestion.

We have added equation number for every Problem and rearranged the order of the equations, and these changes are highlighted in red font.

12.Comments:

Improve the section name of "3. Problem Solving", e.g., Proposed Methodology.

Reply:

Thank you for your valuable suggestion. We have improved the section name of "4. Problem Solving into 4 Problem Formulation and Decomposition (highlighted in red font).

13.Comments:

There are too many equations in Page 10 without numbering.

Reply:

Thank you for your valuable suggestion. The expressions

, , ,,.

on Page 10 serve as simplified notations to improve readability of complex formulas [19], rather than new mathematical statements. These are simplified notations, therefore they are not numbered.

14.Comments:

Give more discussion for the feasibility of the claim "This problem is a function of sur Ek , which can be optimized through LAMDDPG method." in line 310.

Reply:

Thank you for your valuable suggestion.

To extend the network lifetime, the long-term throughput maximization in this paper is influenced by the surplus energy of secondary devices(sur Ek). When , in the current time slot, secondary device can adjust the power and time allocation factor to increase the current transmission rate, thereby enhancing the long-term throughput. In the long-term, the larger the sur Ek is, the overall long-term throughput will increase. We explain the optimization process by adding a logic flow diagram in the revised paper, the details are as follows

Optimization problem P1 is solved via a layered optimization approach, in this section we design a logic flow diagram to analyze this process.

Figure 3. Logic flow diagram for the two-layered optimization process

15.Comments:

"4.1. Preliminaries of LSTM" and "Figure 3. The structure of LSTM layer" can be removed, because it can be easily found in a textbook.

Reply:

Thank you for your valuable suggestion.

We have removed"4.1. Preliminaries of LSTM" and "Figure 3. The structure of LSTM layer" and adjusted the numbering of subsequent subsections and equations.

16.Comments:

Improve the resolution and font of "Figure 4. The structure of Attention mechanism" and "Figure 5. The framework of LAMDDPG"

Reply:

Thank you for your valuable suggestion.

Figure 4 has been removed, and the operational mechanism of the attention mechanism has been integrated into the algorithmic framework.

"Figure 5. are improved the resolution and font. Details are as follows:

Figure 4. The framework of LAMDDPG

17.Comments:

Use a standard algorithm formulation for LAMDDPG algorithm, with input and output. And put the algorithm in the same page, instead of two pages.

Reply:

Thank you for your valuable suggestion.

The algorithm's input and output has been clarified, we also put the algorithm in the same page, instead of two pages (marked in red font). Details are as follows:

LAMDDPG algorithm
Input: Environment, settings of and PDs Output: parameters , ,, Initialize system parameters, initialize Actor network and Critic network
Initialize target network weights parameters
Initialize experience replay memory
1	For episode=1 to n_ep,do:
2	Initialization of noise n, initialization of large-scale fading and small-scale random fading
3	Obtain the initial state s₁
4	For k=1 to T, do:
5	Select
6	Execute action , receive reward and environment state , and store array in the experience replay memory
7	Randomly sample a batch of experiencefrom the replay memory
8	Set
9	Minimize the loss function to update the Critic Q network

10	Sample strategy gradient update Actor policy network

11	Update target network

12	End for
13	End for

18.Comments:

Check the line numbers in Page 17. They are mixed with the table.

Reply:

Thank you for your valuable suggestion.

The problem has been addressed through table format adjustments.

19.Comments:

Add the table numbering and caption in Section 5. SIMULATION RESULTS.

Reply:

Thank you for your valuable suggestion.

We have added a header to the parameter settings table in the SIMULATION RESULTS section and provided explanations in the text (in red font). The details are as follows:

In this section, we prove the effective of the proposed algorithm, the path loss model from [13] is adopted, with the following parameter settings in Table 2[27].

Table 2 Parameter settings

Parameter Settings
Learning rate of Actor network	0.002
Learning rate of Critic network	0.004
discount rate	0.9
Network update parameters	0.01
Batch size of experience replay pool	32
transmission power of	30dBm
initial energy	0.1J
The duration of each time slot T	1s
Maximum transmission power of	0.1W
Energy harvesting efficiency	0.7
Noise power spectral density	−170 dBm/Hz
Bandwidth of noise	1MHz
Noise carrier frequency	914MHz
Path Loss Exponent	3
Number of hidden layer nodes	64

20.Comments:

Besides the standard DDPG model, add more baselines from the literature in the experiments.

Reply:

Thank you for your valuable suggestion.

In the scenario of this paper, we have improved the actor-critic structured DDPG algorithm by introducing the LSTM and attention mechanism, and have completed the performance comparison between the improved algorithm and the standard DDPG model. Other types of DRL algorithms such as PPO and TD3 have been clearly identified as future work directions in the conclusion section, and we will further expand relevant research and conduct comprehensive comparisons in subsequent studies. Details are as follows:

In this paper, we consider the long-term throughput maximization problem for an EHSR IoT device in a CR-NOMA-enabled IoT network that comprises multiple primary IoT devices, a base station and an EHSR IoT device. To be closer to practical applications, we adopt a piece-wise linear function-based NLPM. We addressed this optimization problem by integrating convex optimization with the LAMDDPG algorithm. Experimental results demonstrate that the LSTM layer in the Actor network can predict channel state information from historical data, effectively solving the agent's partial observability problem. Meanwhile, the channel attention SE block in the Critic network mitigates Q-value overestimation in DRL algorithms through squeeze-excitation-scale operations. The synergy of these two mechanisms accelerates exploration, improves reward acquisition, and speeds up convergence. Moreover, we find the optimal number of PDs to maintain efficient network performance under NLPM, which is highly significant for guiding practical EHSR applications. However, we consider the ideal SIC condition in such EH-CR-NOMA symbiotic system. In future work, we will extend our research to non-ideal SIC scenarios and further explore improving other types of DRL algorithms (e.g., PPO, TD3) to address the throughput maximization problem in EH-CR-NOMA symbiotic networks with a nonlinear energy harvesting model.

21.Comments:

Improve the result figures. Use vector images if possible.

Reply:

Thank you for your valuable suggestion.

We have converted the result figures into vector graphics in PDF format, which will be uploaded as attachments.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The manuscript addresses a complex problem at the intersection of symbiotic radio, energy harvesting, and deep reinforcement learning. The integration of a non-linear power model (NLPM) is a practical and valuable aspect. However, the paper in its current form suffers from fundamental issues in presentation, clarity, and scientific rigor that prevent it from being considered for publication. The core technical idea may have merit, but it is obscured by poor writing, a confusing structure, and a lack of critical details.

1.- The paper is poorly organized. The “Preliminaries” on LSTM and Attention (Sections 4.1, 4.2) are generic textbook explanations that break the flow and belong in an appendix, if anywhere. The description of the LAMDDPG framework (Section 4.3) is convoluted. The “Simulation Results” section lacks a clear narrative and critical analysis. Please, restructure the paper logically: Introduction, System Model, Problem Formulation and Decomposition, Proposed LAMDDPG Algorithm, Simulation Results, and Conclusion sections.

2.- Integrate the LSTM and Attention explanations briefly into the algorithm description, focusing on why and how they are applied in this specific context, not on their general formulas.

3.- Create a clear, standalone subsection for the LAMDDPG algorithm, detailing the state, action, reward, and network architecture with a clear diagram.

4.- What is the exact architecture of the Actor and Critic networks? How many LSTM layers? What are their sizes? How is the attention mechanism specifically integrated into the Critic network? In fact, the provided Figure 4 is a generic diagram and not specific to this work.)

5.- What is the size of the replay buffer? What is the noise process n? What are the specific hyperparameters for the LSTM and attention layers?

6.- It is stated that the DDPG baseline is from [27], but are the hyperparameters (learning rates, etc.) the same for all DRL algorithms to ensure a fair comparison? How were the Greedy and Random algorithms implemented exactly?

7.- Please, add a comprehensive subsection detailing the neural network architectures and all training parameters. Justify that all DRL baselines were tuned and compared fairly.

8.- The results section merely describes the curves without providing a deep, critical analysis. E.g., why does LAMDDPG really perform better? The explanation "LSTM layers... enable LMDDPG to converge faster" is insufficient. What temporal dependencies is the LSTM capturing? How does the attention mechanism help the Critic, specifically?

9.- The claim that LAMDDPG achieves a "19% over LMDDPG" and "25% over DDPG" is made without stating if these differences are statistically significant. No confidence intervals or multiple run averages are shown.

10.- The finding of an "optimal number of PDs" is interesting but not sufficiently analyzed. The explanation about interference is plausible but should be supported by additional evidence (e.g., showing the average action chosen by the agent as the number of PDs increases). The discussion must be significantly deepened. Explain the mechanisms behind the performance gains. Report results over multiple runs to show statistical significance.

11.- Please, provide more insightful analysis for the key findings, such as the optimal number of PDs.

12.- The notation is sometimes inconsistent. A table of notation would be very helpful.

13.- Abstract and Conclusion sections are overly complex and repeat the methodology. They should be rewritten to clearly state the problem, the proposed solution, the key novel contribution, and the main outcome/finding.

14.- Reference formatting is inconsistent (e.g., some journal names are italicized, others are not). Ensure full consistency with MDPI's referencing style.

15.- The manuscript must undergo a thorough, professional-level language edit before any further consideration. Every section needs to be rewritten for clarity, conciseness, and grammatical correctness.

Comments on the Quality of English Language

The English could be improved to more clearly express the research.

Author Response

Point-by-point response

Reviewer2：

1.Comments:

The paper is poorly organized. The “Preliminaries” on LSTM and Attention (Sections 4.1, 4.2) are generic textbook explanations that break the flow and belong in an appendix, if anywhere. The description of the LAMDDPG framework (Section 4.3) is convoluted. The “Simulation Results” section lacks a clear narrative and critical analysis. Please, restructure the paper logically: Introduction, System Model, Problem Formulation and Decomposition, Proposed LAMDDPG Algorithm, Simulation Results, and Conclusion sections.

Reply:

Thank you for your valuable suggestion.

We have reorganized the chapter sequence as follows: Introduction → related work→System Model → Problem Formulation and Decomposition → Proposed LAMDDPG Algorithm → Simulation Results → Conclusion. Specifically, the general explanations of LSTM and attention mechanism originally presented in former Sections 4.1 and 4.2 have been removed. The section on the proposed LAMDDPG algorithm has been rewritten, with the attention mechanism integrated into the algorithm description.

All revised chapter titles have been highlighted in red.

The revised Proposed LAMDDPG Algorithm part is as follows

Proposed LAMDDPG Algorithm

The optimal long-term throughput decision for P9 is served as a Markov Decision Process (MDP) and solved by the LAMDDPG framework. Here, acts as an agent. The LAMDDPG framework is employed to identify a sequence of optimal decisions that maximize the long-term expected cumulative discounted reward

As shown in Figure 4, the LAMDDPG framework contains Actor network and Actor target network , Critic network and Critic target network , an experience replay memory. The LSTM layers are integrated after the input layer for Actor network and Actor target network. The attention layers are adopted after hidden layers for Critic network and Critic target network.

Figure 4. The framework of LAMDDPG

The current state from the environment are input to the Actor and target Actor network, considering the temporal correlation of residual energy and channel state information in the current scenario, LSTM layers are introduced to capture dynamic dependencies. The LSTM layers learn from the past observations to adjust the weight and bias to predict the real-time state. The output data is input to hidden layer of the Actor network and Actor target network , respectively. The action and target action are output. To enhance the exploration capability of the Actor network, noise is introduced in the output of the Actor network. Thus, the action to be chosen is , where represents the noise following Gaussian distribution.

Moreover, attention mechanisms are integrated into the Critic network and Critic target network. from hidden layer are input to the attention module. Through squeezing and excitation, features of different channels are assigned distinct channel weights. By multiplying with the channel weights, can be adaptively enhanced, the Critic network can dynamically focus on more important features to the current agent's decision-making, thereby enhancing the accuracy of decision-making. The Critic network outputs the state-action value function under the current state and action by Actor network. The Critic target network outputs . Optimal actions generated by these networks are applied to the environment to get a reward , then the environment is transited to a new state . The experience is then stored into the replay memory.

The state space, action space, reward function are defined as followsThe state space, action space, reward function are defined as follows

State space: In this CR-NOMA-enabled EHSR network, the state space is defined as the channel state information associated with the PD and, the residual energy of .The system state at time slot can be denoted as

(23)

Action space: The agent selects transmitting data or energy harvesting according to the current system state

(24)

Considering the extreme situation that only transmits data at whole time slot , the lower bound is

(25)

On the other hand, if only harvests energy at whole time slot , the upper bound is

(26)

Thus the value for actions is very large which brings unstable to the network. The value of can be normalized as

(27)

where , hence, the suitable action parameter can be constrained to improve the stability of the networks.

Reward function: When the agent selects an action in any time slot , it will receive a corresponding reward, set as

(28)

where is the achievable data rate at .

The LAMDDPG algorithm employs centralized training with distributed execution. During the training phase, a batch of experience is selected from the replay memory for training, is the batch size.

The predicted action is fed into the Critic target network. Based on these inputs, the Critic target network computes the target value . The target value for the state function is calculated by

(29)

and are fed into the Critic network. Based on these inputs, the Critic network calculates the corresponding Q value, denoted as . The parameters of the Critic network are updated using gradient descent. The loss function for the Critic network is defined as the difference between the target value and the predicted, which is essentially the error term of the Bellman equation. Therefore, the parameters of the Critic network can be updated according to the following formula:

(30)

When updating the parameters of the Actor network, gradient ascent is employed. The parameters can be updated according to the following formula:

(31)

The parameters of both target networks are updated using a soft update method, described as follows:

(32)

(33)

where is the update coefficient.

The LAMDDPG algorithm is presented in LAMDDPG algorithm .

LAMDDPG algorithm
Input: Environment, settings of and PDs Output: parameters , ,, Initialize system parameters, initialize Actor network and Critic network
Initialize target network weights parameters
Initialize experience replay memory
1	For episode=1 to n_ep,do:
2	Initialization of noise n, initialization of large-scale fading and small-scale random fading
3	Obtain the initial state s₁
4	For k=1 to T, do:
5	Select
6	Execute action , receive reward and environment state , and store array in the experience replay memory
7	Randomly sample a batch of experiencefrom the replay memory
8	Set
9	Minimize the loss function to update the Critic Q network

10	Sample strategy gradient update Actor policy network

11	Update target network

12	End for
13	End for

2.Comments:

Integrate the LSTM and Attention explanations briefly into the algorithm description, focusing on why and how they are applied in this specific context, not on their general formulas.

Reply:

Thank you for your valuable suggestion.

We have removed the general formula derivations of LSTM and attention mechanism originally included in Sections 4.1 and 4.2 of the original manuscript. Considering that residual energy and channel quality in the symbiotic radio scenario exhibit temporal correlation, integrating an LSTM layer into the Actor network to capture such dynamic dependencies can enhance decision-making accuracy. Correspondingly, an attention module has been introduced into the Critic network, enabling it to dynamically allocate weights to the temporal features, thereby focusing on features that significantly impacts the current decision-making process.

These reasons have been supplemented in the manuscript, with specific details provided as follows:

3.Comments:

Create a clear, standalone subsection for the LAMDDPG algorithm, detailing the state, action, reward, and network architecture with a clear diagram.

Reply:

Thank you for your valuable suggestion.

A new independent section "5. Proposed LAMDDPG Algorithm" has been added, detailing the state, action, reward. We redraw a dedicated schematic diagram Figure 3, clearly labeling the connection relationships between the LSTM layer, attention layer, and Actor/Critic networks.

The specific details are as follows:

Figure 4. The framework of LAMDDPG

The state space, action space, reward function are defined as follows

(23)

Action space: The agent selects transmitting data or energy harvesting according to the current system state

(24)

Considering the extreme situation that only transmits data at whole time slot , the lower bound is

(25)

On the other hand, if only harvests energy at whole time slot , the upper bound is

(26)

Thus the value for actions is very large which brings unstable to the network. The value of can be normalized as

(27)

where , hence, the suitable action parameter can be constrained to improve the stability of the networks.

Reward function: When the agent selects an action in any time slot , it will receive a corresponding reward, set as

(28)

where is the achievable data rate at .

4.Comments:

What is the exact architecture of the Actor and Critic networks? How many LSTM layers? What are their sizes? How is the attention mechanism specifically integrated into the Critic network? In fact, the provided Figure 4 is a generic diagram and not specific to this work.

Reply:

Thank you for your valuable suggestion.

We appreciate your careful reading of our manuscript. In response to your questions regarding the network architecture of our proposed LAMDDPG algorithm, we provide the following detailed clarification:

Exact Architecture of the Actor and Critic Networks

We have reorganized and clarified the architecture of both the Actor and Critic networks. The details are as follows:

Actor Network Architecture

The Actor network takes the current state s as input and outputs a deterministic action a. Its architecture is:

Input Layer: State vector s with dimension (None, s_dim), where s_dim is the size of the state space.

LSTM Layer: A single LSTM layer with 64 hidden units. The input to this layer is reshaped to (1, None, s_dim) to fit the expected format of tf.nn.dynamic_rnn (timesteps, batch, features). The output of the LSTM layer is squeezed back to (None, 64).

Dense Layer 1: A fully connected layer with 64 units and ReLU activation.

Dense Layer 2: A fully connected layer with 64 units and Tanh activation.

Output Layer: A fully connected layer with a_dim units (where a_dim is the size of the action space) and Tanh activation. The output is then scaled by a_bound to produce the final action within the valid range [-a_bound, a_bound].

Critic Network Architecture

The Critic network takes both the state s and action a as inputs and outputs the corresponding Q-value Q(s,a). Its architecture is:

Input Layers: State vector s with dimension (None, s_dim). Action vector a with dimension (None, a_dim).

Merged Dense Layer: The state and action vectors are first processed by separate linear transformations (without activation) and then summed together with a bias term. This merged result is passed through a ReLU activation. This layer has 64 units.

Dense Layer: A fully connected layer with 64 units and ReLU activation.

Squeeze-and-Excitation (SE) Block: This is the attention mechanism integrated into our Critic network. It adaptively recalibrates the feature responses from the previous dense layer. The SE block consists of: Squeeze: A global average pooling operation across the feature dimension. Excitation: Two fully connected layers. The first reduces the dimensionality to 64 / 16 = 4 (using ReLU), and the second reconstructs it back to 64 (using Sigmoid) to produce channel-wise attention weights. Scale: The original features are multiplied by these attention weights to emphasize important channels.

Output Layer: A fully connected layer with 1 unit and no activation function, which outputs the Q-value.

We acknowledge your comment that Figure 4 is a generic diagram. We agree that it does not adequately represent the unique integration of LSTM and attention mechanisms in our LAMDDPG model.

To rectify this, we have created a new, dedicated Figure 4

Figure 4. The framework of LAMDDPG

5.Comments:

What is the size of the replay buffer? What is the noise process n? What are the specific hyperparameters for the LSTM and attention layers?

Reply:

Thank you for your valuable suggestion.

Size of the Replay Buffer

The size of the replay buffer used in our LAMDDPG implementation is 10000 transitions.

Noise Process n

We employed the Gaussian distribution process. Below Figure 3, we have addexd this detail in the revised paper as follows

Thus, the action to be chosen is , where represents the noise following Gaussian distribution.

Specific Hyperparameters for LSTM and Attention Layers

LSTM Layer (Actor Network): Number of Hidden Units: 64, Number of Layers: 1,without Dropout Rate and Return Sequences.

Attention Layer (Critic Network):

We integrated a Squeeze-and-Excitation (SE) Block as the attention mechanism in the Critic network. Its specific hyperparameters are: Reduction Ratio: 16. Activation Functions: ReLU.

6.Comments:

It is stated that the DDPG baseline is from [27], but are the hyperparameters (learning rates, etc.) the same for all DRL algorithms to ensure a fair comparison? How were the Greedy and Random algorithms implemented exactly?

Reply:

Thank you for your valuable suggestion.

We acknowledge the importance of consistent hyperparameter settings for a fair comparison. To address this, we adopted the following strategy:

For the final comparison reported in the paper, the baseline DDPG, our LMDDPG, LAMDDPG were run using the same set of core hyperparameters from Reference [27]. The detailed hyperparameters used for all algorithms in the final comparison are provided in Table 2 Parameter settings

Greedy Algorithm:

At each time step, the greedy agent selects the action that maximizes the immediate expected Q-value. The agent inputs the current state and a densely sampled set of actions from the continuous action space into the latest version of the Critic Network, and then selects the action that maximizes the action value Q(s,a).

Random Algorithm:

The random agent selects actions uniformly at random from the action space.

7.Comments:

Please, add a comprehensive subsection detailing the neural network architectures and all training parameters. Justify that all DRL baselines were tuned and compared fairly

Reply:

Thank you for your valuable suggestion, which is crucial for improving the rigor and reproducibility of our work. We have carefully addressed your request by supplementing a comprehensive subsection and justifying the fairness of DRL baseline comparisons. Here are the key revisions and details:

We have inserted a new subsection titled “5.2 Neural Network Architectures and Training Parameters” in the revised manuscript. The details are as follows

5.2. Neural Network Architectures and Training Parameters

In this subsection, we present the neural network architecture of the proposed LAMDDPG algorithm. In LAMDDPG algorithm, both Actor network and Actor target network contain an input layer, an LSTM layer, two hidden layers, and an output layer. Both the Critic network and Critic target network consist of an input layer, two hidden layers, an attention module (based on Squeeze-and-Excitation module), and an output layer. Both Actor and Critic network adopt Adam optimizer.

The input layer of the Actor (target) network takes the state as input, performing input reshaping to adapt to the input format required by the LSTM layer. The LSTM layer contains 64 units and outputs a vector with a dimension of (32, 64), which is fed into the first hidden layer. After ReLU activation, the first hidden layer outputs a vector of (32, 64), which is then passed to the second hidden layer. Following tanh activation, the second hidden layer outputs a (32, 64) vector that is fed into the output layer. After tanh activation, the output action is scaled to the range between the maximum and minimum magnitudes of the action (target action).

The Critic (target) network takes the state()and action()as inputs. These inputs undergo feature fusion in the first hidden layer, and after ReLU activation, a vector with a dimension of (32, 64) is output. This vector is fed into the second hidden layer, which outputs a feature vectorof dimension (32, 64) following ReLU activation. Subsequently, the vector enters the attention module: first, average pooling is performed on the feature dimension to complete the squeezing operation; then, it passes through two fully connected layers, activated by ReLU and Sigmoid respectively, to generate channel-wise attention weights. The vectoris then multiplied by the attention weights to obtain an output vector of (32, 64). Finally, the action value function ()is output after passing through the output layer.

8.Comments:

The results section merely describes the curves without providing a deep, critical analysis. E.g., why does LAMDDPG really perform better? The explanation "LSTM layers... enable LMDDPG to converge faster" is insufficient. What temporal dependencies is the LSTM capturing? How does the attention mechanism help the Critic, specifically?

Reply:

Thank you for your valuable suggestion.

6.3. Mechanism Analysis of Performance Improvement

Through ablation experiments (Figures 4–7), we compare the LMDDPG algorithm with an added LSTM layer, the LAMDDPG algorithm (integrating LSTM and attention mechanisms), and DDPG, Greedy, Random algorithms. The simulation results demonstrate that LAMDDPG achieves faster convergence and higher cumulative rewards across scenarios with varying numbers of PDs and different energy-harvesting models.

In our scenario, inherent dependencies exist between CSI and the remaining energy of EHSR. The introduced LSTM layer leverages its hidden state to encode historical data, enabling the agent to capture implicit states that are critical for decision-making. Thus LMDDPG outperforms DDPG, greedy, and random algorithms in reward accumulation.

By integrating attention module, the extracted features from hidden layers of Critic (target) network, are squeezed and activated by Sigmoid function, which mitigates the overestimation bias of Q-values. Meanwhile, by squeeze-excitation-scaling steps, the attention module adaptively assigns weights to the extracted features, which emphasizes those that contribute more to Q-value estimation while suppressing redundant ones. This enhances the accuracy of Q-value predictions, reducing the agent’s ineffective exploration, and thus accelerates algorithm convergence while boosting cumulative rewards.

9.Comments:

The claim that LAMDDPG achieves a "19% over LMDDPG" and "25% over DDPG" is made without stating if these differences are statistically significant. No confidence intervals or multiple run averages are shown.

Reply:

Thank you for your valuable suggestion.

To verify the effectiveness of the algorithm, we adopted the same environmental configuration as in Reference [27], and the details in this paper are as follows:

The channels remain unchanged in each experiment consisting of multiple episodes. Independent and identically distributed (i.i.d.) complex Gaussian random variables with a mean of zero and a unit variance are employed to simulate small-scale fading.

The experiments were independently repeated 30 times under the same initial conditions and hyperparameters to get the simulation results in Figure 5, thus we claim that LAMDDPG achieves a "19% over LMDDPG" and "25% over DDPG". In the text we clearly stated the repeated simulation as follows

As illustrated in Figure 5, after independent and repeated experiments, the data rates achieved by the DDPG,

As depicted in Figure 6 (a) and Figure 6 (b), after independent and repeated experiments,

As depicted in Figure 7, after independent and repeated experiments,

As depicted in Figure 8 (a) and Figure 8 (b), after independent and repeated experiments,

10.Comments:

The finding of an "optimal number of PDs" is interesting but not sufficiently analyzed. The explanation about interference is plausible but should be supported by additional evidence (e.g., showing the average action chosen by the agent as the number of PDs increases). The discussion must be significantly deepened. Explain the mechanisms behind the performance gains. Report results over multiple runs to show statistical significance.

11.Comments:

Please, provide more insightful analysis for the key findings, such as the optimal number of PDs.

Reply:

Thank you for your valuable suggestion.

Here, we combine our responses to the above two Comments.

As seen from Figure 9, after averaging the throughput through simulation, we find that the system achieves favorable throughput when the number of PD reaches 10. To further elaborate on this observation, we analyzed the action selections of the algorithm when the number of primary devices is 10 and 15.

We define the action in action space

Action space: The agent selects transmitting data or energy harvesting according to the current system state

(24)

Here the bound for is (0,100). 0 means that the there is no surplus energy, meaning all harvested energy, that is , is used for data transmission, so the SD can transmit data at a high power, which will yield a higher throughput.

100 means that the harvested energy is remain, SD tends to reduce the transmit power to conserve energy and mitigate interference to the PD, which will yield a higher throughput.

We have statistically analyzed the probability distribution of the agent's output actions when the number of PD is 10 and 15.

We added the following results in revised paper

Figure 9. Rewards under different number of PDs

Figure 10. Action selection when PD=10

Figure 11. Action selection when PD=15

We revised the explanation in the revised paper as follows:

Furthermore, we deploy more PDs to investigate the maximum data rate under different PDs according to different EH models. As depicted in Figure 9, the average data rate initially increases with the increase of PDs then decreases with the increase of PDs under different algorithms. This is because when the number of PDs exceeds two, the system allocates more time slots, thereby enhancing data rate. However, when the number of PDs surpasses 10, the system experiences a sharp decline due to the strong inter-device interference caused by the increased number of PDs. As seen from Figure 10, when PD=10, is more inclined to select actions for data transmission than to conserve energy compared with the scenario where the number is 15. In other words, the action corresponds to the surplus energy, and the probability of the surplus energy being small is relatively high. When the number of PD is 15, the probability of the probability of the surplus energy being large is relatively high，indicating that tends to prioritize energy harvesting over data transmission to manage the increased interference, which further contributes to the decline in data rate.

12.Comments:

The notation is sometimes inconsistent. A table of notation would be very helpful.

Reply:

Thank you for your valuable suggestion.

The characters in the manuscript have been standardized: the subscript "t" has been uniformly replaced with "k", and a corresponding table is provided to illustrate this revision.

Table 1 Summary of Main Notations

Notation Description
M	Number of primary devices
	EH symbiotic radio SD
	Any given time slot
	Channel gain between BS and
	Channel gain between and
	Channel gain between and BS
	Time allocation factor
	Duration for every time slot
	Transmit power of
	Energy harvesting threshold
	Energy harvesting efficiency
	Residual energy in the battery of
	Maximum battery capacity of
	Transmit power of
	Maximum transmit power of
	Discount rate
	Energy surplus
	Principle branch of Lambert W function
	Current state
	Parameters of Actor、target Actor network
	Noise
	Action
	Target action
	Features from hidden layer of Critic
	Parameters of Critic、Critic target network
	Output of Critic network
	Output of Critic target network
	System reward
	New state
	Experience tuple
	Weight
	Batch size
	Target value
	Update coefficient

13.Comments:

Abstract and Conclusion sections are overly complex and repeat the methodology. They should be rewritten to clearly state the problem, the proposed solution, the key novel contribution, and the main outcome/finding.

Reply:

Thank you for your valuable suggestion.

We have rewritten the abstract and conclusion, with details as follows:

Abstract :

Conclusion :

14.Comments:

Reference formatting is inconsistent (e.g., some journal names are italicized, others are not). Ensure full consistency with MDPI's referencing style

Reply:

Thank you for your valuable suggestion.

We have standardized the spacing, punctuation, and line break formats of all references in accordance with the target journal's requirements to ensure consistent and compliant formatting. The specific details are as follows(highlighted in red font in the revised paper):

P. K. Donta, S. N. Srirama, T. Amgoth, C. S. R. Annavarapu. Survey on recent advances in IoT application layer protocols and machine learning scope for research directions. Digit. Commun. Netw.. 2022, 8, 727–744. https://doi.org/10.1016/j.dcan.2021.10.004.
J. G. Andrews, S. Buzzi, W. Choi, S. V. Hanly, A. Lozano, A. C. K. Soong, J. C. Zhang. What will 5G be? IEEE J. Sel. Areas Commun.. 2014, 32, 1065–1082. DOI: 10.1109/JSAC.2014.2328098.
B. Makki, K. Chitti, A. Behravan, M.-S. Alouini. Zhang. A survey of NOMA: Current status and open research challenges. IEEE Open J. Commun. Soc.. 2020, 1, 179–189.
A. Kilzi, J. Farah, C. Abdel Nour, C. Douillard. Mutual successive interference cancellation strategies in NOMA for enhancing the spectral efficiency of CoMP systems? IEEE Trans. Commun.. 2020, 68, 1213–1226. DOI: 10.1109/TCOMM.2019.2945781.
R. Alhamad,H. Boujemâa. Optimal power allocation for CRN-NOMA systems with adaptive transmit power. Signal Image Video Process. 2020, 14, 1327–1334. https://doi.org/10.1007/s11760-020-01674-8.
S. Abidrabbu, H. Arslan. Energy-Efficient Resource Allocation for 5G Cognitive Radio NOMA Using Game Theory. In: (WCNC), 2021 IEEE Wireless Communications and Networking Conference. Chengdu, China. 09-12 December 2022. DOI: 10.1109/ICCC56324.2022.10065916.
Liu, Z. Ding, M. Elkashlan, H. V. Poor. Cooperative non-orthogonal multiple access with simultaneous wireless information and power transfer. IEEE J. Sel. Areas Commun.. 2016, 34, 938–953. DOI: 10.1109/JSAC.2016.2549378.
Song, X. Wang, Y. Liu, Z. Zhang. Joint Spectrum Resource Allocation in NOMA-based Cognitive Radio Network With SWIPT. IEEE Access. 2019, 7, 89594–89603. https://doi.org/10.1109/ACCESS.2019.2940976.

21.X. Liu, K. Zheng, K. Chi, Y.-H. Zhu. Cooperative Spectrum Sensing Optimization in Energy-Harvesting Cognitive Radio Networks. IEEE Trans. Wireless Commun.. 2020, 19, 7663-7676. DOI: 10.1109/TWC.2020.3015260.

24.O. O. Umeonwuka, B. S. Adejumobi, T. Shongwe. Deep Learning Algorithms for RF Energy Harvesting Cognitive IoT Devices: Applications, Challenges and Opportunities. In 2022 International Conference on Electrical, Computer and Energy Technologies (ICECET), Prague, Czech Republic. 09 September 2022,.doi: 10.1109/ICECET55527.2022.9872992.

26.F. T. Al Rabee, A. Masadeh, S. Abdel-Razeq, H. Bany Salameh. Actor–Critic Reinforcement Learning for Throughput-Optimized Power Allocation in Energy Harvesting NOMA Relay-Assisted Networks. IEEE Open J. Commun. Soc.. 2024, 5, 7941-7953.DOI: 10.1109/OJCOMS.2024.3514785.

27.Z. Ding, R. Schober, H. V. Poor. No-Pain No-Gain: DRL Assisted Optimization in Energy-Constrained CR-NOMA Networks. IEEE Trans. Commun.. 2021, 69, 5917–5932. DOI: 10.1109/TCOMM.2021.3087624.

28.Z. Shi, X. Xie, H. Lu, H. Yang, J. Cai, Z. Ding.Deep Reinforcement Learning-Based Multidimensional Resource Management for Energy Harvesting Cognitive NOMA Communications, IEEE Trans. Commun.. 2022, 70, 3110–3125. DOI: 10.1109/TCOMM.2021.3126626.

29.A. Ullah, S. Zeb, A. Mahmood, S. A. Hassan, M. Gidlund. Opportunistic CR-NOMA Transmissions for Zero-Energy Devices: A DRL-Driven Optimization Strategy. IEEE Wireless Commun. Lett. 2023, 12, 893–897. DOI: 10.1109/LWC.2023.3247962.

30.K. Du, X. Xie, Z. Shi, M. Li. Throughput maximization of EH-CRN-NOMA based on PPO. In: 2023 International Conference on Inventive Computation Technologies (ICICT). Raleigh, United States. March 24-26, 2023. DOI: 10.1109/ICICT57646.2023.10133954.

31.F. Zhou, Z. Chu, Y. Wu, N. Al-Dhahir, P. Xiao. Enhancing PHY security of MISO NOMA SWIPT systems with a practical non-linear EH model. In: 2018 IEEE International Conference on Communications Workshops (ICC Workshops) Kansas City, MO, USA. 20-24 May 2018 . DOI: 10.1109/ICCW.2018.8403565.

32.D. Kumar, P. K. Singya, K. Choi, V. Bhatia. SWIPT enabled cooperative cognitive radio sensor network with non-linear power amplifier. IEEE Trans. Cogn. Commun. Netw. .2023, 9, 884–896. DOI: 10.1109/TCCN.2023.3269511.

K. Li, W. Ni, F. Dressler. LSTM-Characterized Deep Reinforcement Learning for Continuous Flight Control and Resource Allocation in UAV-Assisted Sensor Networks. IEEE Internet Things J.. 2022, 9, 4179 – 4189. DOI: 10.1109/JIOT.2021.3102831.
He X , Mao Y , Liu Y ,et al. Channel assignment and power allocation for throughput improvement with PPO in B5G hetero-geneous edge networks. Digit. Commun. Netw.. 2024,2,109-116. DOI: 10.1016/j.dcan.2023.02.018.
Ullah I , Singh S K , Adhikari D ,et al. Multi-Agent Reinforcement Learning for task allocation in the Internet of Vehicles: Exploring benefits and paving the future. Swarm Evol. Comput.. 2025,94, 101878. DOI:10.1016/j.swevo.2025.101878.
M. Alhartomi, A. Salh, L. Audah, S. Alzahrani, A. Alzahmi. Enhancing Sustainable Edge Computing Offloading via Renewable Prediction for Energy Harvesting. IEEE Access. 2024, 12, 74011–74023. DOI: 10.1109/ACCESS.2024.3404222.
J. Choi, B.-J. Lee, B.-T. Zhang. Multi-focus Attention Network for Efficient Deep Reinforcement Learning. In: Deep Reinforcement Learning: Frontiers and Challenges 2023 AAAI Conference. Washington, DC, USA. 07-10. February7-14 2023. DOI: 10.1609/aaai.v31i1.2402.
X. Zhou, R. Zhang, C. K. Ho. Wireless Information and Power Transfer: Architecture Design and Rate-Energy Tradeoff. IEEE Trans. Wireless Commun.. 2013, 61, 4754–4767. DOI: 10.1109/TCOMM.2013.13.120855.
I. S. Gradshteyn ,I. M. Ryzhik. Table of Integrals, Series and Products, 6th ed. New York, NY, USA:

15.Comments:

The manuscript must undergo a thorough, professional-level language edit before any further consideration. Every section needs to be rewritten for clarity, conciseness, and grammatical correctness.

Reply:

Thank you for your valuable suggestion.

We have reviewed all sections of the manuscript, reorganized the paper, the details are Comment 1.

We have rewritten the abstract, conclusion, and certain details of the algorithm implementation, supplemented relevant literature and corresponding analyses, and deleted redundant sentences throughout the paper. The details for revised abstract, conclusion are in Comment 7,8,13.

The introduction has been streamlined to concisely present the research background, problem statement, key challenges, and our main contributions, highlighted in red font in the revised manuscript. The details are as follows:

After the sentence “thereby offering a sustainable approach to powering IoT networks. ” we have added

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

The main idea of the paper is good, and the combination of energy-harvesting symbiotic radio, CR-NOMA, and a DRL method with LSTM and attention is relevant. But several parts of the paper feel unclear.

The biggest issue is the novelty: many existing works already use NOMA, energy harvesting, and DRL, so it would help if you clearly stated what is truly new here. Is the novelty the two-layer optimization, the non-linear EH model, or the way you combine LSTM and attention inside DDPG? This needs to be explained directly.

In the optimization section, there are many subproblems, and it becomes difficult to track how one problem leads to the next. A simple flow diagram or short summary would help readers understand the logic.

What exactly are the inputs and outputs of your DRL agent?

How long does training take, and what hardware did you use?

The system model section does not clearly explain how NOMA works, how SIC is applied, or how PD QoS is guaranteed.

Reference 24 and 34 list different papers but share the same DOI and page numbers, one of them is incorrect and must be corrected or removed.

Reference 28 repeats the author name “Z. Ding” twice, formatting error.

Reference 30 and Reference 26 have the exact same DOI (10.1109/OJCOMS.2024.3514785) even though one is a journal article and the other is a conference paper. This needs to be fixed.

Reference 31 contains a typo: “.2” appears in the middle of the entry and should be removed.

Several references have inconsistent spacing, punctuation, and line breaks. These need to be cleaned and formatted properly.

Author Response

Point-by-point response

Reviewer3：

1.Comments:

Reply:

Thank you for your valuable suggestion.

We have rewritten the abstract, and the revised version is as follows:

2.Comments:

Reply:

Thank you for your valuable suggestion.

The optimization follows a two-layer structure: First layer derives closed-form optimal power and time allocation via convex optimization. Second layer then uses these expressions as deterministic policies within a DRL framework to handle long-term dynamics. The coupling is resolved by substituting Layer 1's solutions into Layer 2's action space, making the problem tractable for LAMDDPG.

The details are as follows (highlighted in red font)

Problem Formulation and Decomposition

4.1. Problem Transformation

Optimization problem P1 is solved via a layered optimization approach, in this section we design a logic flow diagram to analyze this process.

Figure 3. Logic flow diagram for the two-layered optimization process

3.Comments:

What exactly are the inputs and outputs of your DRL agent?

Reply:

Thank you for your valuable suggestion.

The input of the DRL agent is the state, output is action. These are defined in our paper as follows:

The state space, action space, reward function are defined as follows

(23)

Action space: The agent selects transmitting data or energy harvesting according to the current system state

(24)

Considering the extreme situation that only transmits data at whole time slot , the lower bound is

(25)

On the other hand, if only harvests energy at whole time slot , the upper bound is

(26)

Thus the value for actions is very large which brings unstable to the network. The value of can be normalized as

(27)

where , hence, the suitable action parameter can be constrained to improve the stability of the networks.

4.Comments:

How long does training take, and what hardware did you use?

Reply:

Thank you for your valuable suggestion.

Specifically, for each scenario illustrated in the simulation figures, the training time per run was 5 minutes. Simulation environment:

CPU: AMD Ryzen 7 8845H w/ Radeon 780M Graphics

GPU: NVIDIA GeForce RTX 4060 Laptop GPU

Memory: 32 GB DDR5-5600 RAM

Storage: 1 TB NVMe SSD (for storing training logs and model checkpoints)

Software Environment: PyTorch 2.3.1, CUDA 11.8, Python 3.8

Comments:

The system model section does not clearly explain how NOMA works, how SIC is applied, or how PD QoS is guaranteed.

Reply:

Thank you for your valuable suggestion.

To address this concern, we have supplemented the implementation process of NOMA and the detailed workflow of Successive Interference Cancellation (SIC) in the revised manuscript, while clarifying the mechanisms to guarantee the Quality of Service (QoS) of primary devices (PDs). The details are as follows (highlighted in red font in revised paper):

Specifically, the PD is assigned the time slot to guarantee its data transmission priority. As depicted in Figure 2, the EHSR, denoted as, achieves non-orthogonal multiplexing by sharing one of the PDs’ time slots to send information to the BS and simultaneously harvesting energy during this process based on the underlay mode [40].The BS utilizes successive interference cancellation (SIC) technology, aided by known channel state information (CSI) that includes both large-scale and multipath fading . The BS first decodes the PD’s signal, and subtracts it from the received mixed signal to decode the signal for .During any given time slot ,commences data transmission, where .

Here, is the maximum battery capacity of , denotes the energy consumed by for data transmission in time slot . is the transmit power of , with a maximize value of . Based on formula (2),’s transmit power is dynamically constrained by its harvested energy and . This energy constraint inherently limits ’s interference to the co-channel PD, which will guarantee the quality of service (QoS) for PD.

6.Comments:

Reference 24 and 34 list different papers but share the same DOI and page numbers, one of them is incorrect and must be corrected or removed.

Reply:

Thank you for your valuable suggestion.

We have verified the DOIs and page numbers of References 24 and 34, and corrected the DOI and page number of Reference 24, the specific details are as follows(highlighted in red font):

7.Comments:

Reference 28 repeats the author name “Z. Ding” twice, formatting error.

Reply:

Thank you for your valuable suggestion.

We have removed the duplicated "Z. Ding" from Reference 28 to ensure the uniformity of the author list format. The specific details are as follows(highlighted in red font):

28.Z. Shi, X. Xie, H. Lu, H. Yang, J. Cai, Z. Ding.Deep Reinforcement Learning-Based Multidimensional Resource Management for Energy Harvesting Cognitive NOMA Communications, IEEE Trans. Commun.. 2022, 70, 3110–3125. DOI: 10.1109/TCOMM.2021.3126626.

8.Comments:

Reference 30 and Reference 26 have the exact same DOI (10.1109/OJCOMS.2024.3514785) even though one is a journal article and the other is a conference paper. This needs to be fixed.

Reply:

Thank you for your valuable suggestion.

We have revised the DOI of Reference 30. The specific details are as follows (highlighted in red font):

K. Du, X. Xie, Z. Shi, M. Li. Throughput maximization of EH-CRN-NOMA based on PPO. In: 2023 International Conference on Inventive Computation Technologies (ICICT). Raleigh, United States. March 24-26, 2023. DOI: 10.1109/ICICT57646.2023.10133954.

9.Comments:

Reference 31 contains a typo: “.2” appears in the middle of the entry and should be removed.

Reply:

Thank you for your valuable suggestion.

We have removed the incorrect ".2" character in the middle of Reference 31. The specific details are as follows(highlighted in red font):

10.Comments:

Several references have inconsistent spacing, punctuation, and line breaks. These need to be cleaned and formatted properly.

Reply:

Thank you for your valuable suggestion.

P. K. Donta, S. N. Srirama, T. Amgoth, C. S. R. Annavarapu. Survey on recent advances in IoT application layer protocols and machine learning scope for research directions. Digit. Commun. Netw.. 2022, 8, 727–744. https://doi.org/10.1016/j.dcan.2021.10.004.
J. G. Andrews, S. Buzzi, W. Choi, S. V. Hanly, A. Lozano, A. C. K. Soong, J. C. Zhang. What will 5G be? IEEE J. Sel. Areas Commun.. 2014, 32, 1065–1082. DOI: 10.1109/JSAC.2014.2328098.
B. Makki, K. Chitti, A. Behravan, M.-S. Alouini. Zhang. A survey of NOMA: Current status and open research challenges. IEEE Open J. Commun. Soc.. 2020, 1, 179–189.
A. Kilzi, J. Farah, C. Abdel Nour, C. Douillard. Mutual successive interference cancellation strategies in NOMA for enhancing the spectral efficiency of CoMP systems? IEEE Trans. Commun.. 2020, 68, 1213–1226. DOI: 10.1109/TCOMM.2019.2945781.
R. Alhamad,H. Boujemâa. Optimal power allocation for CRN-NOMA systems with adaptive transmit power. Signal Image Video Process. 2020, 14, 1327–1334. https://doi.org/10.1007/s11760-020-01674-8.
S. Abidrabbu, H. Arslan. Energy-Efficient Resource Allocation for 5G Cognitive Radio NOMA Using Game Theory. In: (WCNC), 2021 IEEE Wireless Communications and Networking Conference. Chengdu, China. 09-12 December 2022. DOI: 10.1109/ICCC56324.2022.10065916.
Liu, Z. Ding, M. Elkashlan, H. V. Poor. Cooperative non-orthogonal multiple access with simultaneous wireless information and power transfer. IEEE J. Sel. Areas Commun.. 2016, 34, 938–953. DOI: 10.1109/JSAC.2016.2549378.

19.Z. Song, X. Wang, Y. Liu, Z. Zhang. Joint Spectrum Resource Allocation in NOMA-based Cognitive Radio Network With SWIPT. IEEE Access. 2019, 7, 89594–89603. https://doi.org/10.1109/ACCESS.2019.2940976.

20..C. Yang, W. Lu, G. Huang, L. Qian, B. Li, Y. Gong. Power Optimization in Two-way AF Relaying SWIPT based Cognitive Sensor Networks. In 2020 IEEE 92nd Vehicular Technology Conference (VTC2020-Fall). Victoria, BC, Canada. 18 November 2020 - 16 December 2020. DOI: 10.1109/VTC2020-Fall49728.2020.9348749.

24..O. O. Umeonwuka, B. S. Adejumobi, T. Shongwe. Deep Learning Algorithms for RF Energy Harvesting Cognitive IoT Devices: Applications, Challenges and Opportunities. In 2022 International Conference on Electrical, Computer and Energy Technologies (ICECET), Prague, Czech Republic. 09 September 2022,.doi: 10.1109/ICECET55527.2022.9872992.

27.Z. Ding, R. Schober, H. V. Poor. No-Pain No-Gain: DRL Assisted Optimization in Energy-Constrained CR-NOMA Networks. IEEE Trans. Commun.. 2021, 69, 5917–5932. DOI: 10.1109/TCOMM.2021.3087624.

K. Li, W. Ni, F. Dressler. LSTM-Characterized Deep Reinforcement Learning for Continuous Flight Control and Resource Allocation in UAV-Assisted Sensor Networks. IEEE Internet Things J.. 2022, 9, 4179 – 4189. DOI: 10.1109/JIOT.2021.3102831.
He X , Mao Y , Liu Y ,et al. Channel assignment and power allocation for throughput improvement with PPO in B5G hetero-geneous edge networks. Digit. Commun. Netw.. 2024,2,109-116. DOI: 10.1016/j.dcan.2023.02.018.
Ullah I , Singh S K , Adhikari D ,et al. Multi-Agent Reinforcement Learning for task allocation in the Internet of Vehicles: Exploring benefits and paving the future. Swarm Evol. Comput.. 2025,94, 101878. DOI:10.1016/j.swevo.2025.101878.
M. Alhartomi, A. Salh, L. Audah, S. Alzahrani, A. Alzahmi. Enhancing Sustainable Edge Computing Offloading via Renewable Prediction for Energy Harvesting. IEEE Access. 2024, 12, 74011–74023. DOI: 10.1109/ACCESS.2024.3404222.
J. Choi, B.-J. Lee, B.-T. Zhang. Multi-focus Attention Network for Efficient Deep Reinforcement Learning. In: Deep Reinforcement Learning: Frontiers and Challenges 2023 AAAI Conference. Washington, DC, USA. 07-10. February7-14 2023. DOI: 10.1609/aaai.v31i1.2402.
X. Zhou, R. Zhang, C. K. Ho. Wireless Information and Power Transfer: Architecture Design and Rate-Energy Tradeoff. IEEE Trans. Wireless Commun.. 2013, 61, 4754–4767. DOI: 10.1109/TCOMM.2013.13.120855.

43.I. S. Gradshteyn ,I. M. Ryzhik. Table of Integrals, Series and Products, 6th ed. New York, NY, USA: Academic, 2000.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Dear authors,

This version can be accepted.

Author Response

thank you

Reviewer 2 Report

Comments and Suggestions for Authors

After reviewing the new version of the manuscript, in the review's opinion the article can be considered for publication. However, some figure needs to improve their quality; e.g., Fig. 10 and 11.

Comments on the Quality of English Language

The English is fine and does not require any improvement.

Author Response

Thank you for your valuable suggestion.

We have converted Figures 10 and 11 into PDF vector graphics and attached them after the point-by-point response.

Author Response File: Author Response.pdf

Article Menu

Throughput Maximization in EH Symbiotic Radio System Based on LSTM-Attention-Driven DDPG

Further Information

Guidelines

MDPI Initiatives

Follow MDPI