Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Research on Offloading and Resource Allocation for MEC with Energy Harvesting Based on Deep Reinforcement Learning

Electronics 2025, 14(10), 1911; https://doi.org/10.3390/electronics14101911

by Jun Chen^1,2, Junyu Mi², Chen Guo², Qing Fu^2,*

, Weidong Tang², Wenlang Luo² and Qing Zhu^1,*

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Electronics 2025, 14(10), 1911; https://doi.org/10.3390/electronics14101911

Submission received: 28 March 2025 / Revised: 30 April 2025 / Accepted: 6 May 2025 / Published: 8 May 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Please find below my comments on various sections of the manuscript. These comments are aimed at improving the clarity and completeness of the paper.

Comments from Section 1

Comment 1:

Lines 39-43

The sentence is long and difficult to follow. Consider breaking it down into smaller, more digestible parts.

Comment 2:

Line 53

What does "EH" stand for? If I had to guess, it would be "Energy Harvesting," but the acronym hasn’t been defined yet. Please clarify this earlier in the paper.

Comment 3:

Lines 81-89

Could you provide a clearer explanation of why both Q-learning and A3C are used, rather than just presenting A3C? Clarifying the reasoning behind this choice would help the reader better understand the methodology.

Comments from Section 2

Comment 4:

The related works section currently reads like a summary of prior research. To improve it, consider addressing the gaps in existing research and clearly explaining how this work fills those gaps. Justifying the significance of your work in comparison to previous studies would strengthen this section.

Comment 5:

The related works section currently presents a dense block of text. It would be beneficial to break it into subsections (e.g., "Offloading," "Energy Harvesting," "Learning") to enhance readability. The suggested subsection headings are just examples, so please feel free to adjust them to better reflect the content. This approach would help organize the material and make it easier for readers to navigate the different topics.

Comment 6:

Consider adding a table in the related works section to highlight the contributions of other studies and distinguish your work from existing literature. This would make it easier for readers to see in a glance as to what makes your work stand out.

Comments from Section 3

Comment 7:

Line 176

I assume that "MBSs" refers to “mobile base stations,” but the acronym has not been defined before. Please define it for clarity.

Comment 8:

Line 177

What does "MD" stand for? The acronym has not been defined before, so please clarify.

Comment 9:

Figure 1

The figure is not mentioned in the paper and has not been adequately described. It would be helpful to offer a detailed description of the figure’s purpose in the context of the system model. Additionally, the current figure could be improved. For example, adding numbered arrows or labeled steps would help guide the reader through the flow of information and enhance the effectiveness of the figure.

Comments from Section 5

Comment 10:

The MDP representation is packed and difficult to read. I recommend creating a table that describes each variable in terms of state, action, and reward. For example:

g(t) = channel gains between mobile devices (MDs) and mobile base stations (MBSs)

F = available MEC server computing resources

λ(t) = data size of each arriving task

H(t) = channel interference between MDs and MBSs

Comment 11:

Lines 456-485

The paper explains the A3C algorithm. However, it would benefit from a clearer focus on how these formulations apply to the specific MEC and energy harvesting context. For example, a discussion on reward design, policy, and value functions in this context would provide necessary clarity. If the algorithm follows the standard A3C framework, detailed derivations (e.g., actor-critic updates, partial derivatives) may not be necessary, but if any modifications were made, these should be clearly stated and justified.

There is also some inconsistency in notation. For instance, the lowercase theta is used to represent bandwidth allocation (as shown in Table 1) and later as parameters of the value network (in Equation 30 ). This reuse of variable names could lead to confusion, so it should be revised for clarity.

Comments from Section 6

Comment 12:

Are all the simulation/experiment parameters in section 6 are taken from just one paper (Paper 34)? Clarifying this would help establish the source of the methodologies or results presented.

Comment 13:

A table of simulation/experiment parameters should be included to clarify the specific values and settings used in the study.

Comment 14:

A separate subsection or section should be created to detail the parameters used in the A3C algorithm. These parameters include the policy, horizon, number of epochs, and clipping parameter. Additionally, it is unclear whether the A3C implementation is based on the Stable Baselines library or was self-implemented. This needs to be clarified.

If these details are omitted, it may create transparency issues, making it difficult for readers to evaluate the methodology’s validity. The choice of parameters in A3C can significantly impact performance, and without them, the results may be difficult to interpret or reproduce.

Comment 15:

Figures 7 and 8

Figures 7 and 8 show cumulative rewards and the number of steps for the proposed A3C algorithm over different training rounds, respectively. However, these figures alone do not provide enough information to fully assess the effectiveness of the training process. Additional details, such as:

Learning rate

Batch size

Number of workers/environments

Neural network architecture and layer configuration

Hardware setup used

Comment 16:

What is the time complexity of the proposed algorithm?

Author Response

Response to reviewers

We sincerely appreciate the reviewer’s consideration and constructive comments. The concerns raised by the reviewer are very helpful in ensuring the quality of this manuscript. We have addressed the concerns point-by-point in the following. we sincerely expect that our responses can clarify the ambiguity of this manuscript and successfully address reviewer’s concerns.

Comments from Section 1

Comment 1:

Lines 39-43

The sentence is long and difficult to follow. Consider breaking it down into smaller, more digestible parts.

Response：Thanks for the reviewer’s suggestion on the readability of the sentences! Lines 39-43 of the original text do have a problem with long sentences that are difficult to understand, so they have been broken down into short, logical sentences, as follows:

On page 2, line 39 to 43

Revision：However, mobile devices face strict battery capacity limitations. They cannot provide continuous energy supply for computational tasks. This energy constraint frequently leads to insufficient power for task processing. Even after offloading, these tasks may still be discarded due to resource shortages. Such interruptions significantly degrade processing efficiency and ultimately reduce service quality [8 ][9].

Comment 2:

Line 53

What does "EH" stand for? If I had to guess, it would be "Energy Harvesting," but the acronym hasn’t been defined yet. Please clarify this earlier in the paper.

Response: Dear reviewer, Thank you for your careful attention to the omission of terminology definitions. You are absolutely correct that ‘EH’ should be the abbreviation for ‘Energy Harvesting’, which appears for the first time in the abstract. We have added the full definition at the first occurrence in accordance with academic norms, as described below:

On page 1, line 1 to 2.

Revision: Mobile edge computing (MEC) systems empowered by energy harvesting EH significantly enhance sustainable computing capabilities for mobile devices.

Comment 3:

Lines 81-89

Response: Thanks to the reviewer for the attention paid to the logic of the methodological choices! Lines 81-89 of the original text, when describing the use of Q-learning and A3C, do need to be more clearly articulated in terms of their positioning and complementary relationship. The following is an additional explanation and a concrete proposal for change:

On page 3, line 90 to 96.

Revision: In summary, Q-learning and A3C form a hierarchical solution framework. the former validates the RL approach in simplified settings, while the latter addresses the scalability and complexity challenges of real-world MEC systems. This combination is not redundant but strategically complementary, ensuring both theoretical rigor and practical utility. It demonstrates a systematic progression from foundational RL techniques to advanced deep RL methodologies, tailored to the problem’s dual needs of clarity in small-scale modeling and efficiency in large-scale deployment.

Comments from Section 2

Comment 4:

Response: We gratefully appreciate for your comment. We add to the gaps in existing research in our related work.

On page 5, line 183 to 194.

Revision: Current work mainly focuses on static resource allocation or single-objective optimisation, but lacks joint optimisation for the problem of dynamic energy harvesting coupled with high-dimensional decision-making in multi-user multi-server scenarios. In addition, traditional model predictive control methods are difficult to guarantee robustness in the face of time-varying channels and stochastic energy arrivals, while existing deep reinforcement learning methods do not effectively incorporate the Lyapunov optimisation framework in order to address energy causality constraints. In contrast, this study proposes for the first time a joint optimisation framework of hierarchical deep reinforcement learning and Lyapunov drift-penalty, which significantly reduces the complexity of the high-dimensional action space by decomposing the long term optimisation into temporal subproblems, while ensuring energy queue stability.

Comment 5:

Response: We are grateful to the reviewers for their suggestions on the chapter structure of the related work! Segmenting dense text by topic can significantly improve readability, and the following is the specific modification plan and segmentation strategy: we have divided the related work into three subsections: computational offloading and resource allocation study; energy harvesting driven MEC system; and reinforcement learning application in MEC;

Comment 6:

Comments from Section 3

Response: Dear reviewers, We sincerely thank you for your constructive comments on the literature review section. After careful discussion, we believe that the existing text (Section 2.3) has clearly defined the innovativeness of this paper through a point-by-point comparison (e.g., algorithmic architecture, optimisation objectives, experimental scenarios, etc.).

Comment 7:

Line 176

I assume that "MBSs" refers to “mobile base stations,” but the acronym has not been defined before. Please define it for clarity.

Response: We gratefully appreciate for your valuable suggestion. We have revised in section 3.

On page 5, line 198 to 199.

Revision: The system model is shown in Fig.1, which consists of mobile devices, Macro Base Stations (MBSs), and MEC servers. In this system mode…。

Comment 8:

Line 177

What does "MD" stand for? The acronym has not been defined before, so please clarify.

Response: We gratefully appreciate for your comment. The MD first appeared in the abstract and we have modified it as follows:

On page 1, line 1 to 2.

Revision: Mobile edge computing (MEC) systems empowered by energy harvesting (EH) significantly enhance sustainable computing capabilities for mobile devices (MDs)。

Comment 9:

Figure 1

Response: We gratefully appreciate for your valuable suggestion. As for the problems related to the diagram you pointed out, the diagram has actually been described and cited in the thesis, specifically in Chapter 3, page 5 of the system model, in which we have explained the content of the diagram in more detail in the context of the system model, with the aim of helping readers to understand the relationship between the diagram and the system model.

Comments from Section 5

Comment 10:

The MDP representation is packed and difficult to read. I recommend creating a table that describes each variable in terms of state, action, and reward. For example:

g(t) = channel gains between mobile devices (MDs) and mobile base stations (MBSs)

F = available MEC server computing resources

λ(t) = data size of each arriving task

H(t) = channel interference between MDs and MBSs

Response: We gratefully appreciate for your valuable suggestion. we have added a description of the variables in Table 1.

Comment 11:

Lines 456-485

Response: We thank the reviewer for the valuable comments, and we are keenly aware of the importance of more clearly articulating the details of the application of the A3C algorithm to MEC and energy harvesting environments, as well as resolving symbol inconsistencies, in the paper. Below are the responses to these questions and specific revision options:

in which denotes the instantaneous cost weighed through the Lyapunov framework that Including factors such as task completion delay and overall energy consumption, this is to guide the algorithm to focus on the immediate performance of the system during the optimisation process; is then is related to the virtual energy queue, which is used to balance the energy consumption between different time gaps to ensure the long-term stable operation of the system. In MEC and energy harvesting environments, such a rewarding design can motivate the intelligentsia to satisfy the task requirements while rationally utilising energy to minimise the total system cost.

We have changed the notation for bandwidth allocation. Wherever appears in the text to indicate a bandwidth allocation, it has been replaced with .

Comments from Section 6

Comment 12:

Are all the simulation/experiment parameters in section 6 are taken from just one paper (Paper 34)? Clarifying this would help establish the source of the methodologies or results presented.

Response: Thank you for your interest in the source of the experimental parameters. The simulation/experiment parameters in Section 6 are not only taken from paper [34], but are set from a variety of sources. Some of the parameters are referenced from paper [34] because of its authority and representativeness in the research field, e.g., in terms of energy harvesting parameters and channel power gain, paper 34 provides reasonable distributions, which are borrowed from paper 34 in order to make the experimental environment more relevant to the actual situation. However, there are many other parameters that are determined based on our research objectives and system model characteristics.

When determining the parameters, we also referred to a large number of related literature, and carefully adjusted each parameter by considering the parameter setting methods in different studies and the actual scene requirements. In addition, for some key parameters, such as computing resource-related parameters, transmission power, time slot length, etc., we determine their values through several pre-experiments and theoretical analyses to ensure that the performance of the proposed method can be effectively evaluated and that the experimental results are reliable and convincing.

Comment 13:

A table of simulation/experiment parameters should be included to clarify the specific values and settings used in the study.

Response: Many thanks to the reviewers' comments, we have added a new Table 2 listing the names and values of the experimental parameters in the paper.

Comment 14:

Response: We gratefully thanks for the precious time the reviewer spent making constructive remarks.

In this study, we design a neural network architecture with specific configurations for the Actor and Critic networks. The Actor network takes an input layer corresponding to the state - space dimension . It consists of two hidden layers, each containing 256 neurons with the ReLU activation function. The output layer represents the action - space dimension, where continuous actions pertain to resource allocation parameters and discrete actions refer to offloading decisions. The probabilities of discrete actions are processed using the Softmax function.

The Critic network shares the first two hidden layers with the Actor network. Its output layer is a scalar value function that evaluates the state value. Regarding hyperparameters, we set the learning rate , utilizing the RMSProp optimizer with a momentum parameter . The discount factor is employed to balance short - term and long - term rewards. An entropy regularization parameter is introduced to encourage policy exploration and prevent premature convergence. To avoid gradient explosion and stabilize the training process, a gradient clipping parameter of is used. We use asynchronous threads to explore different environmental samples in parallel, thereby accelerating the training process. The training consists of epochs, with each epoch containing 1000 time - slot iterations.

Our implementation framework is based on PyTorch 2.0, and we refrain from relying on third - party libraries such as Stable Baselines to ensure that the algorithm logic is fully consistent with the paper's description. The key modules include environmental interaction, where each asynchronous thread independently simulates the dynamic behavior of the MEC system, including energy harvesting, channel variations, and task arrivals, to generate state - action - reward samples. Additionally, for parameter synchronization, the global network aggregates the gradient updates from each thread every 50 steps to avoid parameter divergence during asynchronous training.

The selection of parameters is well - founded. The neural network structure draws inspiration from the successful configurations of the classic A3C algorithm in continuous control tasks. We conduct ablation experiments to verify the impact of the hidden layer dimensions on the convergence speed. The gradient clipping and entropy regularization parameters are determined through cross - validation to strike a balance between exploration efficiency and policy stability. The number of asynchronous threads is set to 8, which effectively balances computational resource utilization and training stability, as an excessive number of threads may lead to parameter update conflicts.

Comment 15:

Figures 7 and 8

Learning rate

Batch size

Number of workers/environments

Neural network architecture and layer configuration

Hardware setup used

Response: We totally understand the reviewer’s concern and thank you for your rigorous comment.

In the training process of the A3C algorithm, we adopt a dynamic learning rate strategy. The initial learning rate is set to 0.001, and as the training progresses, the learning rate is adjusted to decay by a factor of 0.95 every 100 training rounds. This learning rate setting helps the algorithm to quickly explore the environment in the early stages of training, and more accurately converge to the optimal policy in the later stages. We set the batch size to 64. The choice of batch size has a significant impact on the training effectiveness of the algorithm, as larger batch sizes allow the algorithm to be more stable when updating parameters, but increase memory requirements and training time, while smaller batch sizes speed up training, but may lead to larger fluctuations in the training process. the A3C algorithm uses an asynchronous training mechanism to improve the efficiency of the training, and we set up 8 worker threads, each of which is a thread of a worker, and each of which is a thread of a worker. We set up 8 worker threads, each of which explores and learns in an independent instance of the environment. These worker threads simultaneously interact with the environment to collect experience samples and update the global neural network parameters. In our experiments, we used NVIDIA GeForce RTX 3090 GPUs with Intel Core i9-12900K CPUs to run the A3C algorithms and PyTorch as the deep learning framework for algorithm implementation. This hardware configuration can meet the computational requirements of the A3C algorithm in the large-scale training process to ensure efficient training. At the same time, we also reasonably allocate and optimise the hardware resources to avoid resource bottlenecks and ensure the accuracy and reliability of the experimental results.

Comment 16:

What is the time complexity of the proposed algorithm?

Response: Thank you for your comment.

The time complexity of the Q-learning algorithm is jointly determined by the state space, the action space and the number of training rounds.O(E×Tmax×∣S∣×∣A∣), where ∣S∣ is the number of states, ∣A∣ is the number of actions, E is the number of training rounds, and Tmax is the maximal time step of each round. This complexity grows exponentially with dimension D. The time complexity of A3C depends on three key factors: the number of parallel threads (N), the time step of a single update (T), and the computational complexity of the neural network. Its overall time complexity can be expressed as O(N × T × d).

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

This manuscript addresses the pertinent and challenging problem of joint computation offloading and resource allocation in energy harvesting-enabled Mobile Edge Computing (MEC) systems. The authors rightly identify the complexities arising from stochastic energy arrivals, dynamic channel conditions, and the need to balance multiple objectives (delay, energy consumption, task completion). The proposed approach, combining Lyapunov optimization for constraint handling and Deep Reinforcement Learning (DRL) for adaptive decision-making in a model-free setting, is a methodologically sound direction for tackling such stochastic optimization problems. The formulation progresses logically from the system model to the optimization problem, its transformation via Lyapunov theory, and finally to the application of Q-learning and Asynchronous Advantage Actor-Critic (A3C) algorithms. The inclusion of A3C to address the scalability limitations of Q-learning is appropriate. The simulation results generally support the viability of the DRL-based approach compared to baseline scenarios. However, several aspects require clarification and further development to strengthen the contribution and ensure reproducibility.

The manuscript should more clearly delineate how this specific formulation, MDP design (state/action/reward), or the integration of these techniques offers a distinct advantage or addresses a gap unfilled by prior work cited (e.g., [31-35]). Is it the specific handling of task discarding, the particular structure of the hierarchical DRL, or the joint optimization scope?
The action space A (Eq. 26 and line 381) includes the harvested energy ei(t). Typically, energy harvesting is an environmental process, not an action chosen by the agent. The agent observes the harvested energy (which influences the state, e.g., battery level Bi(t)), and decides how to use energy, not how much is harvested. If the agent can control aspects of harvesting (e.g., activate/deactivate harvesting), this needs to be explicitly modeled in the system and justified. Otherwise, ei(t) should likely be part of the state transition dynamics, not the action space.
The reward function r(t) (Eq. 27) is defined as the negative of the per-slot objective derived from the Lyapunov drift-plus-penalty expression (P3). This is standard practice. However, the task discard cost Ci(t) component within G(t) (used in P3 and thus the reward) is not explicitly defined. Is it a fixed penalty, or does it depend on task characteristics?
The action space involves discrete decisions (offloading mode) and potentially continuous decisions (resource allocation like fi,l(t), pi,m(t), i,m(t)). The paper standardly describes the A3C framework but doesn't specify how this mixed action space is handled. Are continuous actions discretized, or is a distribution (like Gaussian) used for continuous parameters within the policy network output?
The results compare the “Proposed scheme” against “Local computing,” “MEC servers computing,” and “Dynamic offloading.” The “Proposed scheme” appears based on DRL (Q-learning/A3C), but the “Dynamic offloading” baseline is ill-defined. Is it a heuristic, a simpler rule-based policy, or the Model Predictive Control (MPC) mentioned in the abstract? This needs explicit clarification. If comparing against MPC (as the abstract suggests), the specifics of the MPC implementation (prediction horizon, model used) should be provided.
The results section (Figs 3-6) shows the performance of the “Proposed scheme”. Was this generated using Q-learning or A3C? The text mentions Q-learning (line 537), but Algorithm 2 details A3C, and A3C convergence is shown (Figs 7-8). Clarify which algorithm's results are presented in the performance comparison plots. If both were run, comparing their performance would be valuable.
There seems to be a slight inconsistency in task generation description. Line 184 mentions task generation following a Bernoulli process (implying arrival/no arrival), while the task tuple definition A = <λi(t), τi,d(t)> uses λi(t) as the amount (data size) of the arrived task, suggesting variable sizes rather than just arrival events. Please clarify the exact task arrival and size generation process.
The performance heavily relies on the weighting factors (i, i, i) and the Lyapunov control parameter V. While the results show performance under specific (presumably chosen) weights, a sensitivity analysis showing how performance changes with different weightings would significantly strengthen the evaluation, demonstrating robustness and providing insights into tuning these crucial parameters for different QoS priorities.

Author Response

Comments and Suggestions for Authors

The authors would like to thank the reviewer for the time and kind comments. In the following, we will address the concerns point by point and we sincerely expect that our responses can clarify the ambiguity of this manuscript and successfully address reviewer’s concerns. ˛

The manuscript should more clearly delineate how this specific formulation, MDP design (state/action/reward), or the integration of these techniques offers a distinct advantage or addresses a gap unfilled by prior work cited (e.g., [31-35]). Is it the specific handling of task discarding, the particular structure of the hierarchical DRL, or the joint optimization scope?

Response: In terms of technology integration, this paper has significant and unique advantages and effectively solves many problems of existing studies. In terms of energy causal constraint decoupling, existing studies such as [28][30] have the problem of decision dependence across time slots due to energy constraints, which limits the system performance and flexibility, whereas in this paper, by introducing the virtual energy queue (Bi(t)=Bi(t)-θi) and combining with Lyapunov optimisation, the long-term coupling problem is decomposed into time slot-level subproblems, which not only provides a new perspective for energy management, achieves flexible and efficient energy allocation, but also ensures the system stability and improves the system performance. It provides a new perspective for energy management, realises flexible and efficient allocation of energy, ensures system stability, solves the decision-dependent problem, and improves the overall performance of the system. In the hierarchical DRL structure, the traditional single DRL method has high complexity and low efficiency when dealing with high-dimensional continuous action space and discrete decision-making. In this paper, Q-learning and A3C are innovatively integrated, with Q-learning responsible for discrete offloading decision-making, which can make fast and accurate decision-making, and A3C focusing on high-dimensional continuous resource allocation, which reduces the complexity with distributed learning, and synergistically improves the decision-making efficiency of the two methods, The two work together to improve the decision-making efficiency and reduce the complexity of the traditional method, and from Fig. 7-8, the convergence verification shows that it is better than the traditional method in terms of convergence speed and stability, which can achieve more efficient resource allocation and system optimisation. In addition, in the design of the MDP framework, the state space, action space and reward function are fully considered to meet the requirements of technology integration, the state space captures the environmental changes in real time to provide accurate decision-making information, the action space forms a mixed integer-continuous action space to enhance the flexibility of the system, and the reward function realises the synergistic optimisation of multi-objectives, and the organic fusion of multiple technologies and synergistic optimisation make the integration of the technology in this paper more comprehensive and effective in the solution of practical problems. The organic integration and synergistic optimisation of multiple techniques make the technology integration in this paper more comprehensive and effective in solving practical problems. In summary, the advantages of this paper's technology integration are reflected in the decoupling of energy causality constraints, the innovation of hierarchical DRL structure, and the organic fusion and synergistic optimisation of multiple technologies, which are of great value in overcoming the existing problems and filling the research gaps, and also provide valuable references for the subsequent related research.

The action space A (Eq. 26 and line 381) includes the harvested energy ei(t). Typically, energy harvesting is an environmental process, not an action chosen by the agent. The agent observes the harvested energy (which influences the state, e.g., battery level Bi(t)), and decides how to use energy, not how much is harvested. If the agent can control aspects of harvesting (e.g., activate/deactivate harvesting), this needs to be explicitly modeled in the system and justified. Otherwise, ei(t) should likely be part of the state transition dynamics, not the action space.

Response: We thank the reviewers for their precise corrections on the definition of action space! We have moved the energy harvesting quantity ei(t) from the action space to the state space according to the reinforcement learning paradigm and practical physical implications, ensuring that the model conforms to the basic framework of agent-controllable action-environmental feedback state.

The reward function r(t) (Eq. 27) is defined as the negative of the per-slot objective derived from the Lyapunov drift-plus-penalty expression (P3). This is standard practice. However, the task discard cost Ci(t) component within G(t) (used in P3 and thus the reward) is not explicitly defined. Is it a fixed penalty, or does it depend on task characteristics?

Response: We thank the reviewers for their attention to the definition of the task discarding cost Ci(t). This cost component does need to be more clearly articulated in the paper, and the following is a revised and clear definition and design logic to ensure the rationality and interpretability of the reward function: we have added the relevant content and definition in subsection

On page 8, line 287 to 298

Revision：

3.3 Task drop model

When the remaining energy of the MDs is insufficient to support the tasks generated in the current time interval $t$ for local computation or offloading to the edge servers of the MBS for processing, and when the channel state information from the MDs to the MBS is unstable during the offloading process of the computation tasks, resulting in the phenomenon of channel depth fading, which makes it difficult to offload the tasks successfully, based on the above mentioned factors, the computation tasks generated in the current time interval will be discarded. Based on the above factors, the computation tasks generated in the current time interval will be discarded. Since the discarded tasks will affect the MDs' task processing, we will impose a penalty on each discarded task, and the penalty cost of the MD for the computation of a discarded task in the time interval $t$ is shown as follows:

Where denotes the penalty cost of each discarded task of MD .

The action space involves discrete decisions (offloading mode) and potentially continuous decisions (resource allocation like fi,l(t), pi,m(t), i,m(t)). The paper standardly describes the A3C framework but doesn't specify how this mixed action space is handled. Are continuous actions discretized, or is a distribution (like Gaussian) used for continuous parameters within the policy network output?

Response：We thank the reviewers for pointing out the unclear description of the A3C framework for handling hybrid action spaces in the paper. In this study, for the hybrid action space containing discrete decisions (offloading patterns) and continuous decisions (resource allocation), we adopt the following approach for processing:

For discrete offloading mode decisions, we use Softmax function to output the probability distribution of each offloading mode. Assuming that there are n offloading modes, the output of the policy network gets an n-dimensional probability vector πθ(x∣s) after the Softmax function, where x denotes the offloading mode, s denotes the current state, and θ is a parameter of the policy network. At each time step, sampling is performed based on this probability distribution to obtain a specific offloading mode decision.

For continuous resource allocation actions, e.g. , we use a Gaussian distribution to model the continuous parameters of the policy network output. The policy network outputs the mean μ and standard deviation σ of the continuous actions, and then samples from the Gaussian distribution N(μ,σ2) to obtain the specific continuous action values. At the same time, to ensure that the action values are within a reasonable range, we trim the sampled action values.

The results compare the “Proposed scheme” against “Local computing,” “MEC servers computing,” and “Dynamic offloading.” The “Proposed scheme” appears based on DRL (Q-learning/A3C), but the “Dynamic offloading” baseline is ill-defined. Is it a heuristic, a simpler rule-based policy, or the Model Predictive Control (MPC) mentioned in the abstract? This needs explicit clarification. If comparing against MPC (as the abstract suggests), the specifics of the MPC implementation (prediction horizon, model used) should be provided.

Response: We thank the reviewers for their careful correction of the baseline definition! After reorganisation, the ‘dynamic offloading’ baseline is defined in this paper as a simple policy based on heuristic rules rather than Model Predictive Control (MPC).

The results section (Figs 3-6) shows the performance of the “Proposed scheme”. Was this generated using Q-learning or A3C? The text mentions Q-learning (line 537), but Algorithm 2 details A3C, and A3C convergence is shown (Figs 7-8). Clarify which algorithm's results are presented in the performance comparison plots. If both were run, comparing their performance would be valuable.

Response: We thank the reviewers for their careful corrections on the attribution of the algorithms in the results section! After verification, the ‘proposed scheme’ in the original results section uniformly refers to the A3C algorithm, and the experimental results of Q-learning and A3C were not clearly differentiated before, which led to confusion. We have made a clear distinction in the experimental section and added Q-learning as a baseline for comparative analyses.

There seems to be a slight inconsistency in task generation description. Line 184 mentions task generation following a Bernoulli process (implying arrival/no arrival), while the task tuple definition A = <λi(t), τi,d(t)> uses λi(t) as the amount (data size) of the arrived task, suggesting variable sizes rather than just arrival events. Please clarify the exact task arrival and size generation process.

Response: We thank the reviewers for pointing out the ambiguity in the presentation of the task generation process. Task generation in the paper consists of two independent steps: the existence of task arrivals and the generation of task properties. the generation of computational tasks follows the Bernoulli process, refers only to the binary of whether a task arrives or not. refers only to the binary event of whether a task arrives or not, not to the generation of data sizes. The data size λi(t) is a property of the task arrival, and its distribution is independent of the Bernoulli process. Below is the specific correction scheme and the location of the paper revision:

The arrival events of computational tasks obey a Bernoulli process, i.e. within each time interval , the task of user arrives with probability and does not arrive otherwise .}

The performance heavily relies on the weighting factors (i, i, i) and the Lyapunov control parameter V. While the results show performance under specific (presumably chosen) weights, a sensitivity analysis showing how performance changes with different weightings would significantly strengthen the evaluation, demonstrating robustness and providing insights into tuning these crucial parameters for different QoS priorities.

Response: We gratefully appreciate for your valuable suggestion. Thank you for your attention to the weighting factors (αi,βi,γi) and Lyapunov parameter V, which are indeed the core tuning variables for system optimisation. Although the manuscript does not contain dedicated sensitivity analysis experiments, we indirectly show the impact of parameter variations on the performance through the design of the theoretical framework, interpretation of the physical meanings of the parameters, and analysis of the scenario-based strategies By adjusting αi, the system dynamically trade-offs between local computation and offloading decisions. For example, when αi increases, the algorithm tends to reduce high-energy offloading transmissions, corresponding to lower energy consumption. βi increases, the computation ratio of the MEC server increases, which utilises its strong arithmetic power to reduce the task completion latency, reflecting the adaptation to low-latency services. γi affects the task discard policy, and when γi = 0, the system may discard more tasks due to the lack of energy, as shown in Fig. 3. The high discard rate of the ‘dynamic offloading’ baseline can be indirectly inferred from the fact that the algorithm reduces discards by reserving energy or adjusting the offloading policy when γi > 0. The algorithm also reduces the number of tasks discarded by reserving energy or adjusting the offloading policy.

Although these analyses do not quantify the comparative data with different weights, the moderating effect of the weighting factors on the system behaviour is implied by the strategy differences in typical scenarios, which provides qualitative guidance for adjusting the parameters according to the QoS demand in practical applications.

As a trade-off factor between immediate cost (Φ(t)) and long-term queue stability (Lyapunov drift Δ(t)), the larger V is, the more the algorithm focuses on the long-term stability of the energy queue Bi(t). For example, in scenarios with large fluctuations in energy harvesting, increasing V will motivate the algorithm to reduce immediate high-energy decisions thus avoiding task discarding due to battery depletion.

Although the performance difference between different V is not directly tested, the paper demonstrates robustness under random energy arrivals through the combination of the A3C algorithm with the Lyapunov framework, the system recovers better than the baseline approach in case of energy dips, and the negative correlation between the energy harvesting power and the cost of the task in Fig. 4 hints at the effectiveness of the energy management strategy.

Through the interpretation of the physical meaning of the parameters in the theoretical modelling stage, the logic of policy regulation in the algorithm design, and the implied trend analysis of the experimental results, the paper has indirectly demonstrated the impact of the weighting factors and V on the performance. The current argumentation focuses on the core innovation points and meets the journal's requirements for research depth. Thank you for your suggestions, and we will further explore parameter sensitivity in future work to provide more detailed guidance for practical deployment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

1. The manuscript would benefit from a clearer explanation regarding the decision-making entities within the proposed framework. While it is stated that users can choose between local computation, offloading, or task discarding, it remains unclear whether such decisions are made independently or through information obtained from the server. It would be helpful to specify whether users obtain server-side information via a control channel.

2. The manuscript lacks a clear distinction between the proposed method and the prior works discussed in the related work section.

3. Although full-duplex techniques are mentioned, the rationale for their inclusion is not explicitly discussed. The manuscript does not address technical issues associated with full-duplex communication.

4. It would be better to provide a rationale for setting the path loss factor to a fixed value of -4.

5. The trade-off control parameter V, which plays a critical role in Lyapunov optimization, is not sufficiently explained. The authors are encouraged to provide a detailed discussion on how this parameter is set and its impact on performance.

6. While the authors claim that the A3C algorithm is simpler, faster, and more robust than other DRL methods, the manuscript does not include any comparative performance evaluations to substantiate this claim.

Author Response

Comments and Suggestions for Authors

The manuscript would benefit from a clearer explanation regarding the decision-making entities within the proposed framework. While it is stated that users can choose between local computation, offloading, or task discarding, it remains unclear whether such decisions are made independently or through information obtained from the server. It would be helpful to specify whether users obtain server-side information via a control channel.

Response: Thank you very much for your valuable comments on our manuscript. We fully agree with you that clarifying the behavioural patterns of decision-making entities and whether they rely on server information for decision-making is essential for understanding the proposed framework. In order to enhance the clarity and logic of the manuscript, we will make specific changes and additions in the following sections:

3.1 Computing Model

Each MD independently makes a decision on how to process its computational tasks. However, these decisions are not entirely independent. MDs can obtain certain information from the MEC servers via a control channel. This information includes, but is not limited to, the current computational load of the MEC server, the available bandwidth for offloading, and the expected delay in processing tasks on the server. Based on this information, MDs dynamically choose between local computation, offloading to the MEC server, or discarding the task.

5.1 MDP Framework

Within the MDP framework, it is important to note the information interaction between MDs and MEC servers. MDs, as agents, can obtain state information from the environment, which includes not only local parameters such as battery level and computation capacity, but also remote parameters provided by MEC servers via a control channel. This control channel facilitates the sharing of critical information needed for making informed offloading and resource allocation decisions.

The manuscript lacks a clear distinction between the proposed method and the prior works discussed in the related work section.

Response: We gratefully appreciate for your comment. We add to the gaps in existing research in our related work.

Revision: Current work mainly focuses on static resource allocation or single-objective optimisation, but lacks joint optimisation for the problem of dynamic energy harvesting coupled with high-dimensional decision-making in multi-user multi-server scenarios. In addition, traditional model predictive control methods are difficult to guarantee robustness in the face of time-varying channels and stochastic energy arrivals, while existing deep reinforcement learning methods do not effectively incorporate the Lyapunov optimisation framework in order to address energy causality constraints. In contrast, this study proposes for the first time a joint optimisation framework of hierarchical deep reinforcement learning (HDRL) and Lyapunov drift-penalty, which significantly reduces the complexity of the high-dimensional action space by decomposing the long term optimisation into temporal subproblems, while ensuring energy queue stability.

Although full-duplex techniques are mentioned, the rationale for their inclusion is not explicitly discussed. The manuscript does not address technical issues associated with full-duplex communication.

Response：Thanks to the reviewers for pointing out the lack of discussion of full-duplex technology. The paper mentions ‘Full Duplex / Half Duplex Energy Harvesting Technology in the system model, but does not expand on the underlying principles and technical implications. We have removed the full-duplex technique.

It would be better to provide a rationale for setting the path loss factor to a fixed value of -4.

Response：We are sincerely appreciate the useful suggestion to the path loss factor setting! The path loss factor is a core parameter of the communication model, and its value directly affects the calculation of channel gain and transmission power. A typical path loss index for an urban microcell scenario is shown.

The trade-off control parameter V, which plays a critical role in Lyapunov optimization, is not sufficiently explained. The authors are encouraged to provide a detailed discussion on how this parameter is set and its impact on performance.

Response: Thank you for your valuable comments. In response to your point about the insufficient explanation of the trade-off control parameter V in Lyapunov optimisation, we have made the following additions and modifications in the revised draft:

A detailed discussion of the parameter V has been added to the paragraph following Lemma 1:

The parameter V is used in the Lyapunov framework to balance the trade-off between queue stability and the optimisation objective. Specifically, a larger value of V will be more inclined to minimise the long-term average cost but may sacrifice queue stability, while a smaller value of V prioritises the stability of the virtual energy queue but may reduce the convergence efficiency of the optimisation objective. By tuning V, the system is able to achieve a flexible trade-off between performance and resource dynamic management.

A theoretical analysis of the range of values of V and its boundary conditions is added:

According to the Lyapunov optimisation theory, the value of V needs to satisfy 0 < V < ∞, but its upper limit is limited by the stability condition of the energy queue. For example, when V is too large, the drift term of the virtual queue may exceed the stability range, resulting in the energy constraint not being satisfied. Therefore, a reasonable range of V needs to be determined by experiment or theoretical derivation in practical settings.

The practical significance of parameter V is summarised in the conclusion:

The reasonable setting of parameter V is crucial to the system performance. Future research can incorporate online adaptive methods to dynamically adjust V to cope with the complex demands in time-varying environments.’

While the authors claim that the A3C algorithm is simpler, faster, and more robust than other DRL methods, the manuscript does not include any comparative performance evaluations to substantiate this claim.

Response: We sincerely thank you for your valuable comments on the paper. We deeply agree with you when you point out the importance of cross-sectional comparison of algorithm performance. Regarding the A3C algorithm, it has a significant advantage in dealing with hybrid action spaces.A3C utilises a single neural network to process discrete offloading decisions through a Softmax layer, which outputs the action probabilities; for continuous resource allocations such as CPU frequency and transmission power, the mean and standard deviation are directly generated with the help of a Gaussian distribution. This design avoids the complex process of having to design discrete and continuous action processing modules separately as in DQN/DDPG.

Although no direct experimental comparisons are made in the thesis, we provide indirect support for the advantages of the A3C algorithm in several ways. Meanwhile, compared with synchronous algorithms such as PPO, the asynchronous architecture of A3C greatly reduces the computational complexity. In addition, the distributed nature of the MEC system is naturally adapted to the asynchronous architecture of A3C, which further highlights the advantages of A3C in this research scenario.

Through theoretical analyses of the simplicity of the A3C algorithmic architecture, the efficiency of the training mechanism, and the robustness of the environment adaptation, combined with the descriptions of the algorithmic complexity, memory occupation, and scenario characteristics in the thesis, we have sufficiently argued for the advantages of A3C. These arguments are not only based on the general knowledge in the field and the consensus in the literature, but also closely integrated with the design logic of the model to ensure the reasonableness of the conclusions.

We are well aware of the importance of comparative experiments, and will make it a priority in our future research to supplement the comparative experiments of different DRL methods to further quantify the performance differences of the algorithms and provide a more solid experimental basis for the research results. Thank you again for your suggestion, which is of great significance for us to improve the quality of our research!

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

I appreciate the authors’ efforts in addressing the initial review comments through their revisions and responses. Below, I provide feedback on several points that require further attention, along with suggestions to enhance the manuscript’s clarity and rigor.

Comment 1

While Lyapunov-based methods are referenced in both the contributions and cited work (Paper 30), the foundational source for the specific Lyapunov optimization framework applied in this study requires explicit citation. [Line 188]

Comment 2

While the authors' textual comparisons in Section 2.3 provide useful insights, the current presentation lacks explicit, structured differentiation between cited works and this study's contributions as noted by the reviewer. To address this effectively, it is strongly recommended both textual and tabular enhancements to improve clarity and reader accessibility:

Textual Revisions

The textual revisions can adopt direct comparative statements after discussing each prior work, as exemplified below:

While [Author] achieved [X] through [method A], our work advances this by [Y] through [method B], specifically addressing [gap Z].

Unlike previous approaches focusing on [single objective X], our framework introduces [multi-objective Y] with [novel constraint Z].

This approach ensures readers immediately recognize how the study advances the field, rather than inferring distinctions from dispersed textual comparisons.

Tabular Comparison

A summary table should be introduced for prior works against this study’s contributions. For instance, here is an example of a six-column table could be structured as follows:

Column 1: Existing works (Author/Year)

Column 2: Methodology

Column 3: Optimization Objectives

Column 4: Experimental Scenarios

Column 5: Gaps/Limitations

Column 6: Our Novel Contributions

These changes will significantly strengthen the paper’s positioning within the existing literature while addressing the original intent of Comment 6 in round 1 of the review.

Comment 3

The current caption for Figure 1, which simply states "System Model," is insufficiently descriptive. For improved clarity and to aid reader understanding, it is recommended expanding the caption to include a brief but informative description of the figure’s content. This will help readers quickly grasp the purpose and components of the system model without having to refer extensively to the main text.

Comment 4

The use of "intelligentsia" in sentence (Lines 466-468) introduces potential ambiguity. While the metaphorical extension to technical decision-making entity is creative, for academic papers, which follow standard academic English conventions, this usage would likely be considered non-normative unless explicitly defined.

Comment 5

The response to Comment 12 from round 1 of the review should explicitly differentiate between simulation parameters taken from Paper 34 and those determined through pre-experimental testing and theoretical analyses in this study. This clarification is critical to ensure readers can accurately assess the methodological rigor of the parameter selection process and its impact on the validity of the results.

Comment 6

What is meant by sentence “The selection of parameters is well founded.” [Line 608]

Comment 7

The response to Comment 15 in round 1 of review regarding the A3C algorithm’s training process and the system used for training should be revised and explicitly incorporated into the manuscript. Including these details will ensure readers understand both the methodological implementation (e.g., adopt a dynamic learning rate strategy for A3C, initial learning rate, learning rate decay, and batch size) and the computational environment (e.g., hardware specifications, NVIDIA GeForce RTX 3090 GPUs with Intel Core i9-12900K CPUs ), which are critical for reproducibility and comparative analysis.

Comment 8

The response to Comment 16 from round 1 of review regarding the proposed algorithm's time complexity should be explicitly incorporated into the manuscript. Including this analysis will provide readers with critical insights into the computational efficiency and scalability of your methodology, ensuring transparency and reproducibility.

Author Response

Comment 1

Response：Thank you for reviewing our manuscript and providing valuable comments. In this study, we did apply a specific Lyapunov optimization framework, and we apologize for not clearly citing the underlying source in the text. We have double-checked and revised line 188 to add an accurate citation of the source underlying the Lyapunov optimization framework. Specifically, we have added a citation to the relevant classic literature, which details the theory of the Lyapunov optimization framework that we apply. Lyapunov optimization framework, which makes our research work more rigorous and complete in terms of theoretical references.

Comment 2

Textual Revisions

The textual revisions can adopt direct comparative statements after discussing each prior work, as exemplified below:

While [Author] achieved [X] through [method A], our work advances this by [Y] through [method B], specifically addressing [gap Z].

Unlike previous approaches focusing on [single objective X], our framework introduces [multi-objective Y] with [novel constraint Z].

This approach ensures readers immediately recognize how the study advances the field, rather than inferring distinctions from dispersed textual comparisons.

Tabular Comparison

A summary table should be introduced for prior works against this study’s contributions. For instance, here is an example of a six-column table could be structured as follows:

Column 1: Existing works (Author/Year)

Column 2: Methodology

Column 3: Optimization Objectives

Column 4: Experimental Scenarios

Column 5: Gaps/Limitations

Column 6: Our Novel Contributions

These changes will significantly strengthen the paper’s positioning within the existing literature while addressing the original intent of Comment 6 in round 1 of the review.

Response: We gratefully appreciate for your comment. We've made changes to enhance text and tables to improve clarity and accessibility for readers

(1)However, these studies lack a comprehensive view on integrating energy harvesting with MEC in multi-user multi-server scenarios while handling multiple costs. Our paper fills this gap by designing an EH-based MEC system, using Lyapunov architecture, and applying Q-learning to optimize resource allocation and offloading decisions more effectively.

(2)Prior work on EH in MEC includes diverse approaches like hybrid energy models and Lyapunov-based optimizations, but lacks comprehensive handling of multi-user multi-server scenarios with nonlinear EH while optimizing resources and minimizing total cost. Our paper bridges this gap, designing an EH-based MEC system with nonlinear EH, using Lyapunov architecture and Q-learning via MDP decisions to optimize resource and offload allocation and minimize total cost more effectively.

(3)Previous MEC research with EH used AI methods for optimization. Yet, with large-scale users and servers, it fails to comprehensively integrate nonlinear EH, multi-user multi-server scenarios, and total cost minimization. Our study bridges this gap by designing an EH-based MEC system with nonlinear EH. Utilizing a Lyapunov-based architecture and Q-learning via MDP decisions, we optimize resource and offload allocation while effectively cutting the total cost, an area previously unaddressed.

Comment 3

Response: Thank you very much for your valuable suggestion. The lack of descriptiveness in the title of Figure 1 that you pointed out does exist, and we fully agree with you. We have expanded and improved the caption of Figure 1 in order to provide readers with a clearer understanding of the diagram and a quick grasp of the purpose and components of the system model.

Revise: The caption of Figure 1 has been modified i.e. “Offloading Model for Mobile Edge Computing Based on Nonlinear Energy Harvesting”.

Comment 4

Response: Thank you for highlighting this critical terminological issue. We fully agree with your observation regarding the unconventional use of "intelligentsia" in the technical context of mobile edge computing (MEC) and energy harvesting systems. To address this concern, we have revised the text as follows:

Revised Sentence (Lines 466-468):

"In MEC and energy harvesting environments, such a rewarding design can motivate intelligent agents to satisfy task requirements while rationally utilizing energy to minimize the total system cost."

Comment 5

Response: Thank you very much for your valuable comments. We fully agree with the importance of clearly distinguishing between the sources of the parameters in order to assess the methodological and validity of the results of this study. In the thesis, we have carefully sorted out and made a clear distinction between the simulation parameters taken from Paper 34 and those determined in this study through pre-experimental tests and theoretical analyses. For the simulation parameters taken from Paper 34, we have added a clear citation in the text of the relevant section.

Comment 6

What is meant by sentence “The selection of parameters is well founded.” [Line 608]

Response：Thank you for raising this important point. We agree that the original phrasing lacks specificity and may inadvertently suggest incomplete justification for parameter selection. To address this concern, we have removed the statement "The selection of parameters is well founded" [previously Line 608].

Comment 7

Response: Thank you very much for your valuable comments, which our comments have been added to the manuscript.

Comment 8

Response: Thank you very much for your valuable comments, we are keenly aware of the importance of including an analysis of the time complexity of the proposed algorithm in the manuscript to enhance the transparency and reproducibility of the study. We have followed your suggestion closely and have explicitly included a response to comment 16 in the manuscript.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

authors have addressed my concerns.

Author Response

We sincerely appreciate the reviewer’s consideration and constructive comments. The concerns raised by the reviewer are very helpful in ensuring the quality of this manuscript.

Article Menu

Research on Offloading and Resource Allocation for MEC with Energy Harvesting Based on Deep Reinforcement Learning

Further Information

Guidelines

MDPI Initiatives

Follow MDPI