Research on Offloading and Resource Allocation for MEC with Energy Harvesting Based on Deep Reinforcement Learning

Chen, Jun; Mi, Junyu; Guo, Chen; Fu, Qing; Tang, Weidong; Luo, Wenlang; Zhu, Qing

doi:10.3390/electronics14101911

Open AccessArticle

Research on Offloading and Resource Allocation for MEC with Energy Harvesting Based on Deep Reinforcement Learning

by

Jun Chen

^1,2,

Junyu Mi

²,

Chen Guo

²,

Qing Fu

^2,*

,

Weidong Tang

²,

Wenlang Luo

² and

Qing Zhu

^1,*

¹

School of Electrical and Information Engineering, Hunan University, Changsha 410082, China

²

School of Electronics and Information Engineering, Jinggangshan University, ji’an 343009, China

^*

Authors to whom correspondence should be addressed.

Electronics 2025, 14(10), 1911; https://doi.org/10.3390/electronics14101911

Submission received: 28 March 2025 / Revised: 30 April 2025 / Accepted: 6 May 2025 / Published: 8 May 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

Mobile edge computing (MEC) systems empowered by energy harvesting (EH) significantly enhance sustainable computing capabilities for mobile devices (MDs). This paper investigates a multi-user multi-server MEC network, in which energy-constrained users dynamically harvest ambient energy to flexibly allocate resources among local computation, task offloading, or intentional task discarding. We formulate a stochastic optimization problem aiming to minimize the time-averaged weighted sum of execution delay, energy consumption, and task discard penalty. To address the energy causality constraints and temporal coupling effects, we develop a Lyapunov optimization-based drift-plus-penalty framework that decomposes the long-term optimization into sequential per-time-slot subproblems. Furthermore, to overcome the curse of dimensionality in high-dimensional action, we propose hierarchical deep reinforcement learning (DRL) solutions incorporating both Q-learning with experience replay and asynchronous advantage actor–critic (A3C) architectures. Extensive simulations demonstrate that our DRL-driven approach achieves lower costs compared with conventional model predictive control methods, while maintaining robust performance under stochastic energy arrivals and channel variations.

Keywords:

mobile edge computing; energy harvesting; Lyapunov optimization; deep reinforcement learning; Q-learning; A3C algorithm

1. Introduction

The increasing number of smart devices and mobile terminals generates a large amount of data to be processed, which also poses a significant challenge to the operation of mobile devices and network systems [1]. The limited computing power and data storage capacity of mobile devices in the processing of application data can lead to high energy consumption and delay in the completion of computing tasks [2]. For computationally intensive real-time applications, computing tasks often cannot be completed within strict time constraints due to the limited computing resources and battery capacity of mobile devices. This not only fails to meet the requirements of ultra-low latency applications but also leads to task abandonment when continuous execution is infeasible, thereby degrading device performance [3].

To address these challenges, MEC emerges as a promising paradigm that deploys computational and storage resources in close proximity to mobile devices and network systems [4]. MEC offloading enables mobile devices to achieve computational agility, low latency, and energy efficiency [5]. Compared with cloud computing, MEC servers reduce task offloading latency and minimize both task transmission and computational energy consumption by being physically near end-users during task execution [6]. Massive tasks generated by mobile devices can be offloaded to nearby MEC servers for processing, alleviating the computational burden on devices and addressing resource limitations [7].

MEC offloading is a novel solution for resource-constrained mobile devices using MEC server computing resources [8]. However, mobile devices face strict battery capacity limitations. They cannot provide continuous energy supply for computational tasks. This energy constraint frequently leads to insufficient power for task processing. Even after offloading, these tasks may still be discarded due to resource shortages. Such interruptions significantly degrade processing efficiency and ultimately reduce service quality [9,10]. Although the power supply problem can be solved using high-capacity batteries and uninterrupted charging, this approach leads to increased hardware costs and the unavailability of mobile devices in certain scenarios [11]. To overcome these bottlenecks, energy harvesting technology has emerged as a promising green MEC candidate that can extend the task processing of battery capacity-constrained mobile devices in a sustainable manner [12] to ensure that tasks can be completed within the allowed time. Mobile devices can harvest energy from solar energy, wind energy, human motion, and many other sources [9]. The collected energy can be used by the mobile device for local task processing or for offloading computing tasks to MEC servers and computing on the cloud. However, due to the dynamic changes and diversity of EH, computational offloading, and channel state information of users in EH-enabled MEC network systems, existing methods struggle to handle these complexities, motivating the use of deep reinforcement learning for adaptive decision making. In addition, the demand of different users’ computational task applications for MEC network system makes the resource allocation problem in MEC network systems more complex [13].

The resource allocation problem in the time-varying EH-enabled MEC application scenario with multiple users and multiple MEC servers is characterized by high parameter dimensionality, making it difficult to obtain accurate a priori statistical information, which leads to an increasingly sophisticated resource allocation problem [14] and poses a challenge for the energy management of users in MEC [3]. Machine learning methods such as deep learning and deep reinforcement learning provide a novel solution to problems that occur within time-varying environments like dynamic offloading and unstable energy harvesting in wireless network communication systems [15,16]. The complexity of the solution increases significantly alongside the growing number of users and computing servers in the MEC network, making it difficult to apply on a large scale [17]. Deep reinforcement learning achieves optimal long-term goals through the enhancement of network edge intelligence in the absence of a priori knowledge, based on the dynamically changing environment at the edge of the MEC network, thereby continuously learning, alternating states, and interacting continuously with the network in the environment [18]. Therefore, existing research leverages artificial intelligence approaches in MEC to empower network edge intelligence techniques with unique advantages in solving MEC computational offloading and resource allocation strategies for dynamic systems.

This paper addresses the joint optimization of computation offloading decisions, resource allocation, and energy harvesting in MEC systems. The goal is to ensure quality of service (QoS) for users while balancing energy consumption, latency, and task discard costs. The computational tasks generated by users are discarded when they are deemed intractable or do not satisfy the constraints, thus yielding task discard costs. The goal is to minimize the overall weighted sum of the overall delay, energy consumption, and discard cost for the users in the MEC network system. We consider the corresponding formulated problem as a mixed-integer nonconvex optimization problem and first propose a value-based iterative reinforcement learning method, namely Q-learning, to obtain an optimal joint optimization scheme for resource allocation and energy harvesting. However, as the number of users increases, the state and action in the system grows exponentially, as well as the data dimensions, which leads to a huge challenge for Q-table maintenance [16]. To address this issue, we further propose a deep reinforcement learning-based method, A3C reinforcement learning, which can effectively handle dynamic and complex problems. In summary, Q-learning and A3C form a hierarchical solution framework. The former validates the RL approach in simplified settings, while the latter addresses the scalability and complexity challenges of real-world MEC systems. This combination is not redundant but strategically complementary, ensuring both theoretical rigor and practical utility. It demonstrates a systematic progression from foundational RL techniques to advanced deep RL methodologies, tailored to the problem’s dual needs of clarity in small-scale modeling and efficiency in large-scale deployment.

The main contributions of this paper can be summarized as follows:

We design an MEC network system based on energy harvesting, adopting nonlinear energy harvesting techniques. This system focuses on the study of wireless and computing resource allocation as well as offloading decision optimization in multi-user and multiple MEC server computing scenarios. Users can choose among local computing, offloading computing, and task discarding modes. The main problem is formulated as minimizing the total cost composed of the energy consumption, delay, and task discarding cost. Meanwhile, we jointly optimize bandwidth allocation, power allocation, computing resources, local CPU frequency, and energy harvesting.
Due to the time-varying nature of the user’s energy collection in this EH-enabled MEC system, a Lyapunov-based architecture is employed to design a Lyapunov penalty plus offset function. This leads to the formation of a problem that aims to minimize the total cost by weighting the delay, energy consumption, and task discard cost.
Subsequently, we propose MDP decisions, design action and state, determine reward functions, and propose a Q-learning approach to solve the problem of resource and offload allocation optimization.
To solve the problem of large Q-matrix data, the A3C method is further proposed, and the A3C-based algorithm is designed.

2. Related Work

Research on computational offloading techniques in MEC has attracted extensive attention from industry and academia. Related studies mainly focus on improving energy efficiency, optimizing resource allocation, and reducing latency. The main related works are listed in Table 1.

2.1. Computational Offloading and Resource Allocation Study

For latent and resource-constrained mobile devices, an adaptive service offloading scheme is conceived for MEC with maximum revenue and maximum service utilization [32]. The authors in [19] studied the optimization problem of the resource management of offloading the transmission of the MMWave MEC system from the data transmission requirements of communication-oriented users and computing-oriented users. They transformed the MOP problem into a single-objective optimization problem (SOP) and proposed a three-stage iterative resource allocation algorithm under the premise of ensuring Pareto optimality. The authors in [20] designed a computationally offloading wireless charging scheme. This scheme optimizes data division, time allocation, transmission power, and energy beam formation using a new nested algorithm to minimize the total system energy consumption under power and time delay constraints. The authors in [21] designed a device-to-device collaborative MEC-based system and proposed a user mobility-aware task offloading architecture. This architecture jointly addresses task allocation and power optimization under user mobility, distributed resources, task attributes, and user device energy constraints, with the goal of minimizing task latency. The authors in [22], in order to solve the trade-off between delay and energy efficiency, studied an online offloading and resource allocation algorithm based on Lyapunov optimization theory to achieve the trade-off between EE and service delay

[O (1 / V), O (V)]

.

However, these studies lack a comprehensive view on integrating energy harvesting with MEC in multi-user multi-server scenarios while handling multiple costs. Our paper fills this gap by designing an EH-based MEC system, using a Lyapunov architecture; we apply Q-learning and the A3C algorithm to optimize resource allocation and offloading decisions more effectively.

2.2. Energy Harvesting-Driven MEC Systems

Energy harvesting has been extensively studied as a technology that can continuously harvest various types of energy to support the continuous operation of user devices. EH has been integrated into MEC network systems to support user devices for long-term computational offloading. In contrast to the previous study, in [23], the authors deploy energy harvesting devices on edge servers to form a hybrid green energy–grid supply model. In [9], the authors combine social network sensing and energy harvesting techniques and propose a dynamic computational offloading scheme that takes a queuing theory approach to model energy consumption and delay performance in the system, and they utilize game theory to address the interactions within the design community. In [24], the authors furthered proposed a task allocation strategy for an EH-based MEC system, constructing a task queue and using Lyapunov optimization to design an online dynamic task allocation algorithm for effectively balancing device energy consumption and execution latency and enhancing system performance. The researchers in [25] proposed an energy harvesting scheme in which devices first collect energy from the base station, and then utilize the harvested energy for partial or full data transmission back to the same base station. By jointly optimizing the stopping rule for energy harvesting and the allocation of fading blocks during data transmission, they maximize the average achievable data rate. To solve the coupling problem among the causal energy constraint, the offloading ratio and the allocation of resources in the computational unloading process, local and remote computational models and energy harvesting models are established in [26], and the Lyapunov method was used to find the optimal solution for multi-user partial offloading.

Prior work on EH in MEC includes diverse approaches like hybrid energy models and Lyapunov-based optimizations, but lacks comprehensive handling of multi-user multi-server scenarios with nonlinear EH while optimizing resources and minimizing total costs. Our paper bridges this gap, designing an EH-based MEC system with nonlinear EH and using Lyapunov architecture and Q-learning via MDP decisions to optimize resource and offload allocation and minimize total costs more effectively.

2.3. Reinforcement Learning in MEC

Although the above research work considers some resource allocation and optimization problems in terms of MEC computation offloading supported by EH, when the number of users and servers in the MEC network system is large in scale, it enhances the complexity of the problem, so it is worthwhile to study the work that combines artificial intelligence methods applied with the MEC network system. In [27], a deep reinforcement learning framework based on an actor–critic learning structure was proposed due to the difficulty for traditional numerical optimization methods in addressing the strong coupling between combinatorial offloading decisions and task execution. In [28], in an MEC system with blockchain support, the asynchronous learning method was used to solve the optimization problem of secure data sharing. In [29], the dynamic resource management problem of power control and computational resource allocation in industrial IoT MEC using deep reinforcement learning is investigated. The researchers in [30] proposed a deep reinforcement learning algorithm to jointly optimize the partial computation offloading strategy and channel resource allocation problem in a dynamic network environment with time-varying channels. In [33], a time-attentive deterministic policy gradient-based algorithm is proposed. This algorithm is different from traditional reinforcement learning methods and is used to jointly optimize the computational offloading and resource allocation problems in MEC. The researchers in [31] proposed a new framework for joint Lyapunov and deep reinforcement learning to solve the coupled problem of binary offloading and system resource allocation decisions in different time frames in a MEC network system with time-varying channels and randomly arriving user tasks.

Previous MEC research with EH used AI methods for optimization. Yet, with large-scale users and servers, it fails to comprehensively integrate nonlinear EH, multi-user multi-server scenarios, and total cost minimization. Our study bridges this gap by designing an EH-based MEC system with nonlinear EH. Utilizing a Lyapunov-based architecture and Q-learning via MDP decisions, we optimize resource and offload allocation while effectively cutting the total cost, an area previously unaddressed.

Current work mainly focuses on static resource allocation or single-objective optimization but lacks joint optimization for the problem of dynamic energy harvesting coupled with high-dimensional decision making in multi-user multi-server scenarios. In addition, traditional model predictive control methods struggle to guarantee robustness in the face of time-varying channels and stochastic energy arrivals, while existing deep reinforcement learning methods do not effectively incorporate the Lyapunov optimization framework in order to address energy causality constraints [34,35]. In contrast, this study proposes, for the first time, a joint optimization framework of hierarchical deep reinforcement learning and Lyapunov drift penalty, which significantly reduces the complexity of the high-dimensional action by decomposing the long-term optimization into temporal subproblems, while ensuring energy queue stability.

3. System Model

Based on the above study, we propose a novel and more comprehensive study of resource allocation and strategies based on the dynamic offloading of user tasks using energy harvesting techniques. The system model is shown in Figure 1, which consists of mobile devices, Macro Base Stations (MBSs) and MEC servers. In this system model, we assume that there are a number of MDs, denoted as

n = {1, 2, \dots, i, \dots, N}

,

i \in n

. Each MD is equipped with an EH component, which harvests energy from wireless power supply devices, including solar, wind, and other green energy sources. Meanwhile, in this system model, MEC servers with powerful processing capabilities are integrated into the MBSs of the edge network. The set of MEC servers is defined as

m = {1, 2, \dots, m, \dots, M}

,

m \in m

. Each mobile device is able to select the associated MBS based on distance and channel status information. Computational tasks can be offloaded to the MEC server for processing. The generation of computational tasks follows the Bernoulli process [26]. The arrival events of computational tasks obey a Bernoulli process [26], i.e., within each time interval t, the task of user i arrives with probability

p_{i} = 1

and does not arrive otherwise,

p_{i} = 0

. The computational tasks of MDs can be processed locally. However, since the execution of intensive computation applications is constrained by the computing capacity of MDs, it is feasible to offload the computational tasks to MEC servers for computing. As a result, the quality of services and user experience can be further improved [36]. A computational task can also be discarded when the channel state is unstable or affected by the task processing delay.

In this paper, we adopt a time slot model where the length of each time slot is

τ

and

t \in {0, 1, \dots, T}

denotes the set of time slots. We presume that the channel states in each time interval are independent of each other and remain in a stable state but vary over different time slots [36]. In time interval t, each mobile device has a task request, and we represent this computational task using a tuple, i.e.,

A = < λ_{i} (t), τ_{i, d} (t) >

, consisting of two parameters. Here,

λ_{i} (t)

denotes the amount value of the arrived computation task at time slot t. Without loss of generality,

λ_{i} (t) < λ_{i}^{max} (t)

,

λ_{i}^{max} (t)

is the maximum amount of computation at time slot t, and

τ_{i, d} (t)

denotes the maximum delay in computing the task completion. We focus on delay-sensitive applications whose execution duration is not greater than the time slot length, i.e.,

τ_{i, d} (t) ⩽ τ

. For ease of reference, we give the key notations of the proposed system in Table 2.

3.1. Computing Model

Each computation task can be performed locally on the mobile device or offloaded to an MEC server at the remote network edge for computing. Meanwhile, both computation methods may be infeasible when the mobile device has inadequate energy or the task processing latency cannot be satisfied. Each MD independently makes a decision on how to process its computational tasks. However, these decisions are not entirely independent. MDs can obtain certain information from the MEC servers via a control channel. This information includes, but is not limited to, the current computational load of the MEC server, the available bandwidth for offloading, and the expected delay in processing tasks on the server. Based on this information, MDs dynamically choose between local computation, offloading to the MEC server, or discarding the task. Hence, the computation task may be discarded. We define the computation task offloading decision strategy vector of the proposed MEC EH-enabled system in time slot t as

D_{i} (t) = {x_{i, l} (t), x_{i, m} (t), x_{i, d} (t) | 0, 1}

, where

x_{i, l} (t) = 1

indicates that the computation task of user i performs the computation locally,

x_{i, m} (t) = 1

represents the computation will be offloaded to the MEC server for executing, and

x_{i, d} (t) = 1

indicates the computation task is discarded in time slot t. The computation offloading decision schemes should meet the following constraints:

x_{i, l} (t) + x_{i, m} (t) + x_{i, d} (t) = 1

(1)

3.1.1. Local Computation

In this study, we employ a dynamic voltage scaling technique to allocate computational resources when the user chooses to perform computational task processing locally. In a real situation, each MD has varying computational processing power and energy consumption per CPU cycle. Based on this case, we denote

f_{i, l} (t)

as the computing capacity of MD i at time slot t, and

f_{i, l} (t) ⩽ f_{i}^{max}

. The time delay for the computing task can be expressed as

T_{i, l} (t) = c_{i} λ_{i} (t) / f_{i, l} (t)

(2)

where

c_{i}

refers to the computation intensity in CPU cycles per bit [22].

Correspondingly, the energy consumption of locally calculating during time slot t is

E_{i, l} (t) = τ ω f_{i, l}^{3} (t)

(3)

where

ω

represents the effective switched capacitance depending on the chip architecture of the device itself.

3.1.2. MEC Server Computation

When the computational tasks of MD i are offloaded to the associated MEC server m for processing, we assume that the computational capacity of the MEC server is limited and the maximum computational capacity is set to be

F_{m}

.

T_{i, m} (t) = c λ_{i} (t) / f_{i, m} (t)

(4)

The energy consumption of computing using MEC server m can be given as [16]

E_{i, m} (t) = p_{m} c_{i} λ_{i} (t) / f_{i, m} (t)

(5)

where

p_{m}

denotes the power of the MEC server when it processes the task.

In addition, when the computational task of MD i is processed on the MEC server, MD i is also waiting for the results. We assume that MD i is in standby mode during the waiting period, which will also consume a certain amount of energy. We set the power in the standby state as

p_{i, i d} (t)

. Thus, the energy consumption in the standby mode can be expressed as

E_{i, i d} (t) = p_{i, i d} (t) c_{i} λ_{i} (t) / f_{i, m} (t)

(6)

3.2. Communication Model

Under the coverage of MBSs, the computation tasks of MDs are constrained by the local computational capacity, the surplus battery energy, and the maximum tolerated delay, which makes local computation difficult to execute. In this case, the task request can be offloaded to the associated MEC server for processing. During the process of offloading computation tasks, the coverages of MBSs will overlap, which will lead to signal interference. When the computation tasks of multiple MDs are offloaded to the same MBS simultaneously, we consider using dynamic channel allocation technology to allocate channel bandwidth according to the demands of the MDs. Then, the computation task offload rate between MD i and MBS m can be expressed as

r_{i, m} (t) = ω_{i_{m}} (t) B_{i, m} {log}_{2} (1 + \frac{p_{i, m} (t) g_{i, m} (t)}{I + ω_{i_{m}} (t) B_{i, m} H_{j, m} (t)}),

(7)

where

H_{j, m} (t) = \sum_{j \in n, j \neq i} p_{j, m} (t) g_{j, m} (t)

represents the interference from other users j to MBS m,

g_{i, m} (t) = d_{i, m}^{σ} (t)

,

d_{i, m} (t)

denotes the distance between MD i and MBS m, and the path loss factor is set as

σ = - 4

.

The transmission delay and energy consumption of computation task offloading from MD i to MBS m can be shown, respectively, as

T_{i, m}^{r} (t) = λ_{i} (t) / r_{i, m} (t)

(8)

E_{i, m}^{r} (t) = p_{i, m} (t) λ_{i} (t) / r_{i, m} (t)

(9)

We express the overall computation delay and energy consumption of MD i in time slot t, respectively, as follows:

T_{i} (t) = x_{i, l} (t) T_{i, l} (t) + x_{i, m} (t) [T_{i, m}^{c} (t) + T_{i, m}^{r} (t)],

(10)

E_{i} (t) = x_{i, l} (t) E_{i, l} (t) + x_{i, m} (t) [E_{i, m}^{r} (t) + E_{i, m}^{c} (t) + E_{i, i d} (t)] .

(11)

3.3. Task Drop Model

When the remaining energy of the MDs is insufficient to support the tasks generated in the current time interval t for local computation or offloading to the edge servers of the MBS for processing, and when the channel state information from the MDs to the MBS is unstable during the offloading process of the computation tasks, resulting in the phenomenon of channel depth fading, which makes it difficult to offload the tasks successfully, the computation tasks generated in the current time interval will be discarded. Since the discarded tasks will affect the MDs’ task processing, we will impose a penalty on each discarded task, and the penalty cost of the MD for the computation of a discarded task in time interval t is shown as follows:

C_{i} (t) = ψ_{i} x_{i, d} (t) λ_{i} (t) τ

(12)

where

ψ_{i}

denotes the penalty cost of each discarded task of MD i.

3.4. Energy Harvesting Model

Within the proposed MEC offloading network system model based on energy harvesting technology, MDs are able to capture the renewable energy sources to support themselves in task execution. To better describe the energy harvesting situation, we model the energy harvesting process as a serial energy packet arrival process that follows a Poisson process with an average arrival rate

e_{i} (t)

. Furthermore,

e_{i} (t) ⩽ e_{i}^{m} (t)

, where

e_{i}^{m} (t)

is referred to as the rate of maximum energy arrival. The renewable energy processes can characterize stochastic and intermittent natures. The energy packets are independently and identically distributed in different time slots. The harvested energy will be stored in the battery of each MD, which can be used for local computation or computation task offloading in the next time slot. We denote

B_{i} (t)

as the battery energy level at the beginning of time slot t. Without loss of generality,

B_{i} (t) < \infty

,

\forall t \in T

.

The total energy consumption should be no more than the battery level at time slot t and satisfy the following condition:

E_{i} (t) ⩽ B_{i} (t) < \infty, t \in T .

(13)

Thus, the battery energy level of MD i evolves as below:

B_{i} (t + 1) = B_{i} (t) - E_{i} (t) + e_{i} (t), t \in T .

(14)

4. Problem Formulation

In this subsection, we consider the constraints of energy consumption, execution delay, queue stability, and quality of service. Based on these constraints, we formulate a weighted sum. This weighted sum aims to minimize the long-term average energy consumption, execution latency, and penalty costs for task dropping. Each MD i in each starting time slot t needs to make a decision as to which computation method to employ for the arrived computation tasks, i.e., local computation, MEC server computation, or discarding. In order to minimize the overall weighted sum, we will optimize the offloading decision

x (t) = {x_{i, l} (t), x_{i, m} (t), x_{i, d} (t)}

, the allocation of MD i’s local computing resources

f_{i, l} (t)

, the channel bandwidth allocation

ω_{i_{m}} (t)

between MD i and MBS m, the transmission power allocation

p_{i, m} (t)

, the computing resources

f_{i, m} (t)

allocated by the MEC server m for MD i, and the acquired energy

e_{i} (t)

. To ensure the QoS of the proposed system, the minimum rate of task transfer between MDs and MBSs is set as

R_{m}

. For simplicity, the parameter vector used for optimization can be defined as

Ω (t) ≜ {x (t), f_{i, l} (t), f_{i, m} (t), p_{i} (t), ω_{i_{m}} (t), e_{i} (t)}

. Therefore, the formulated problem is shown below:

P 1 : min_{Ω (t)} \frac{1}{T} \sum_{t = 1}^{T} [α_{i} E_{i} (t) + β_{i} T_{i} (t) + γ_{i} C_{i} (t)]

x_{i, l} (t) + x_{i, m} (t) + x_{i, d} (t) = 1,

(15)

\sum_{i = 1}^{N} x_{i, m} (t) r_{i, m} (t) ⩾ R_{m},

(16)

f_{i, l} (t) ⩽ f_{i}^{max},

(17)

p_{i, m} (t) ⩽ p_{i}

(18)

\sum_{m = 1}^{M} x_{i, m} (t) f_{i, m} (t) ⩽ F_{m},

(19)

\sum_{m = 1}^{M} x_{i, m} (t) ω_{i_{m}} (t) ⩽ 1,

(20)

x_{i, l} (t) T_{i, l} (t) + x_{i, m} (t) [T_{i, m}^{c} (t) + T_{i, m}^{r} (t)] ⩽ T_{i}^{max}

(21)

x_{i, l} (t) E_{i, l} (t) + x_{i, m} (t) [E_{i, m}^{r} (t) + E_{i, i d} (t)] < B_{i} (t)

(22)

B_{i} (t + 1) = B_{i} (t) - E_{i} (t) + e_{i} (t), i \in n, t \in T .

(23)

where

{α_{i}, β_{i}, γ_{i}}

represents the weight parameters of energy consumption, delay, and discarding cost in the whole system, respectively. Constraint (15) represents the decision condition of the computing task, and each MD can only choose one task processing method at the beginning time slot. Constraint (16) aims to guarantee the QoS during computational task offloading. The total task offloading rate of all MDs associated with MBS m requires that the lower bound should be greater than

R_{m}

. Constraints (17) and (18), respectively, represent the constraints on computational resource allocation for the local computation of MDs and the constraints on power optimization for the offloaded computation of MDs. Constraint (19) indicates that the computational resources allocated by MBS m for the associated MDs cannot be greater than

F_{m}

. Constraint (20) indicates that the sum of the bandwidth allocation ratio of MBS m to MDs is less than 1. Constraint (21) signifies that the computational task for MD i is completed within the maximum allowable delay

T_{i}^{max}

. Moreover, constraint (22) implies that in time slot t, the energy consumed by MD i for local computation or computation offloading is less than the battery capacity. Constraint (23) represents the formula for the battery capacity in time slot

t + 1

.

However, energy constraint (22) leads to a coupled offloading decision of MDs among various time slots, which means the offloading decisions in different time slots are interrelated and thus pose a challenge for the optimization of the decision strategies. Similarly to [3,9], we are able to introduce a nonzero lower bound

E_{i}^{min} (t)

and an valid upper bound

E_{i}^{max} (t)

as the minimum and maximum battery-charging energy in each time slot, respectively. Furthermore, in order to deal with the situation caused by the energy constraint and the coupled offloading decisions, we can eliminate the coupling effect generated by the battery in each time slot and optimize the system performance by neglecting the energy constraint. Thus, problem P1, with the following formulation, can be modified as

P 2 : min_{Ω (t)} \frac{1}{T} \sum_{t = 1}^{T} [α_{i} E_{i} (t) + β_{i} T_{i} (t) + γ_{i} C_{i} (t)]

s.t.(15) − (21), (23)

(24a)

\begin{matrix} E_{i} (t) \in {0} \cup [E_{i}^{min} (t), E_{i}^{max} (t)] . \end{matrix}

(24b)

Based on the investigation of [26], we find that due to the energy causality constraint, i.e., when the residual energy of the battery of the MD is insufficient, the computation offloading decisions in the current time slot will affect the offloading decision of the next time interval. As a result, the existence of coupled-energy causal constraints between different time slots, which will lead to complicated offloading decisions and resource allocation. Similarly to [3,36], important parameters such as virtual energy queues and perturbation parameters were introduced for the battery of each MD.

Definition 1.

To handle the energy causality constraint leading to inter-slot decision coupling, we introduced the virtual energy queue as

{\tilde{B}}_{i} (t) = B_{i} (t) - θ_{i}

, where

{\tilde{B}}_{i} (t)

indicates the actual battery energy that the MD is able to consume.

Lemma 1.

θ_{i}

is the perturbation parameter of the introduced MD cell, and the value of

θ_{i}

satisfies the following constant term boundary condition:

θ_{i} ⩾ {\tilde{E}}_{i}^{max} + V ψ_{i} {\tilde{E}}_{i}^{{min}^{- 1}}

(25)

We consider a model where (

0 < V < \infty

) is a nonnegative control parameter (in joules, J) used for weighing the importance of the energy consumption for local and offloading computation.

{\tilde{E}}_{i}^{max}

represents the maximum energy consumption.

{\tilde{E}}_{i}^{max} = min {max {E_{i, l} p (t), E_{i, m}^{r} (t) + E_{i, m}^{c} (t)}, E_{i}^{max} (t)}

(26)

In time slot t, we define the virtual energy queue vector of the system as

\tilde{B} (t) ≜ [{\tilde{B}}_{1} (t), {\tilde{B}}_{2} (t), \dots, {\tilde{B}}_{N} (t)]

, and it is necessary to ensure the stability of the virtual queue. So, we define a quadratic Lyapunov function on virtual queues as shown below:

L (\tilde{B} (t)) = \frac{1}{2} \sum_{i = 1}^{N} {\tilde{B}}_{i} {(t)}^{2}, L (\tilde{B} (t + 1)) = \frac{1}{2} \sum_{i = 1}^{N} {\tilde{B}}_{i} {(t + 1)}^{2}

(27)

Then, we can introduce a one-step conditional Lyapunov cost function that pushes the quadratic Lyapunov function to a bounded level and obtains a stable virtual queue:

Δ (\tilde{B} (t)) ≜ E [L [\tilde{B} (t + 1)] - L (\tilde{B} (t)) | \tilde{B} (t)] .

(28)

The drift-plus-penalty function can be defined as

Δ v (\tilde{B} (t)) = Δ (\tilde{B} (t)) + V E [G (t) | \tilde{B} (t)],

(29)

where

V (V > 0)

is a control parameter.

Parameter V is used in the Lyapunov framework to balance the trade-off between queue stability and optimization objectives. Specifically, a larger value of V will be more inclined to minimize the long-term average cost but may sacrifice queue stability, while a smaller value of V prioritizes the stability of the virtual energy queue but may reduce the convergence efficiency of the optimization objective. Adjusting V, the system can achieve a flexible trade-off between performance and dynamic resource management.

Lemma 2.

For any feasible decision

Ω (t)

to be applicable at any time, the offset-plus-penalty function satisfies the following condition:

\begin{matrix} Δ v (\tilde{B} (t)) & ⩽ V E [\sum_{i = 1}^{N} [α_{i} E_{i} (t) + β_{i} T_{i} (t) + γ_{i} C_{i} (t)] | \tilde{B} (t)] + C + E [\sum_{i = 1}^{N} [{\tilde{B}}_{i} (t) (e_{i} (t) - E_{i} (t))] | \tilde{B} (t)] \end{matrix}

(30)

where

C = \frac{1}{2} \sum_{i = 1}^{N} {[E_{i}^{max} (t)}^{2} + ({\tilde{E}}_{i}^{max})^{2}]

.

Proof.

From the equations of (27), we can obtain that

L (\tilde{B} (t + 1)) - L (\tilde{B} (t)) = \frac{1}{2} \sum_{i = 1}^{N} [{({\tilde{B}}_{i} (t) - E_{i} (t) + e_{i} (t))}^{2} - {\tilde{B}}_{i} {(t)}^{2}] ⩽ \frac{1}{2} \sum_{i = 1}^{N} [{(E_{i}^{max})}^{2} + {({\tilde{E}}_{i}^{max})}^{2}] + {\tilde{B}}_{i} (t) (e_{i} (t) - E_{i} (t)),

(31)

and let

C = \frac{1}{2} \sum_{i = 1}^{N} {[E_{i}^{max} (t)}^{2} + ({\tilde{E}}_{i}^{max})^{2}]

. □

According to the Lyapunov optimization method, we can convert the objective problem into an upper bound function optimized mainly by Equation (30). Based on this, problem P2 can be transformed into a series of optimization problems for each time interval. In these intervals, the battery capacity level of MD i can be continuously maintained stable under the influence of perturbation parameter

θ_{i}

. Here, we neglect the constant term in the upper bound of the drift-plus-penalty function. By doing so, we can transform long-term average optimization problem P2 into a series of optimization problems per time slot, which simplifies the problem-solving process and makes it more tractable. The new problem is shown as problem P3 below:

P 3 : min_{Ω (t)} V G (t) + \sum_{i = 1}^{N} {\tilde{B}}_{i} (t) [e_{i} (t) - E_{i} (t)]

(32)

s.t.(15) − (21), (23), (24b)

(33)

5. Problem Solution

In this section, we will investigate how to solve problem P3. This problem involves the optimization of energy harvesting, the offloading decision strategy, and resource allocation. Our goal is to obtain the optimal offloading decision, the allocation of transmission power, the allocation of bandwidth between MD and MBS, and the allocation of computing resources for both local computation and the MEC servers. Since problem P3 is a mixed-integer optimization problem, because the computation offloading decision is an integer variable and the other optimization variables are continuous variables, we employ a deep reinforcement learning-based approach to address it. We formulate the system as a discrete MDP problem. An MDP is a mathematical framework. It is used to model decision making. This modeling is for situations where outcomes are partly random. Also, outcomes are partly under the control of a decision-maker. Here, we aim to maximize the system payoff through this MDP model. Since the transfer probability of system states and system rewards cannot be predicted in advance, we propose a model-free scheme based on DRL to solve the MDP problem. We adopt a four-tuple

< S, A, P, R >

to represent it, where

S

represents the state,

A

denotes the action,

P

indicates the state transfer probability, and

R

refers to the system reward.

5.1. MDP Framework

Nearly all programming problems controlled by decisions can be characterized in terms of MDP [37]. In general, MDP is solvable using linear programming or dynamic programming methods. Within the MDP framework, it is important to note the information interaction between MDs and MEC servers. MDs, as agents, can obtain state information from the environment, which includes not only local parameters such as battery level and computation capacity but also remote parameters provided by MEC servers via a control channel. This control channel facilitates the sharing of critical information needed for making informed offloading and resource allocation decisions. Herein, we will attempt to use the RL-based methods to deal with these challenging problems. Therefore, above all, we enumerate the related critical elements in MDP as shown below:

State: In current time slot t, we define the state of the system as $s (t) \in S$ , including the channel gains $g (t) = {g_{1, 1} (t), \dots, g_{i, m} (t), \dots, g_{N, M} (t)}$ between MDs and MBSs. We formulate a four-tuple $< S, A, P, R >$ to represent the system. The available MEC server computing resources are denoted as $F = {F_{1}, \dots, F_{m}, \dots, F_{M}}$ ; the data size of each arrival task, $λ (t) = {λ_{1} (t), \dots λ_{i} (t), \dots, λ_{N} (t)}$ ; channel interference received between the MD and MBS, $H (t) = {H_{1, 1} (t), \dots, H_{i, m} (t), \dots, H_{N, m} (t)}$ ; and the harvested energy, $E (t) = {e_{1} (t), e_{2} (t), \dots, e_{i} (t), \dots, e_{N} (t)}$ . Thus, we describe the vector of state system at time slot t as

$s (t) = (g (t), F, λ (t), H (t), E (t))$

(34)
Action: Each agent needs to independently adopt actions based on the current state $s (t)$ and time slot t, the action is defined as $a_{i} (t) \in A$ , which includes computational resource allocation, wireless resource allocation, offloading decision schemes, and energy harvesting, i.e., the available computing resource capacity $f_{i, l} (t)$ of the MD, the computational resource allocation of MEC server $f_{i, m} (t)$ , the transmission power of MD $p_{i, m} (t)$ , the bandwidth allocation between MD i and MBS m, $ω_{i, m} (t)$ , and computation offloading decision schemes $x (t) = {x_{i, l} (t), x_{i, m} (t), x_{i, d} (t)}$ . Hence, we define the action as a set of tuples:

$A = (f_{i, l} (t), f_{i, m} (t), p_{i, m} (t), ω_{i_{m}} (t), e_{i} (t), x (t))$

(35)
Reward: After taking a feasible action, MDs can obtain a corresponding reward $r (t) \in R$ for the action $a_{i} (t)$ taken from the environment based on the current state $s (t)$ . In this investigation, we focus on formulating a multiobjective optimization problem between the task completion delay, overall energy consumption, and the cost of task discard. Then, we transform the primal problem into an optimization problem based on the Lyapunov framework by optimizing the upper bound of the Lyapunov drift-plus-penalty function to solve the corresponding resource allocation, energy harvesting, and offloading decision schemes. It is notable that the reward value of each agent should fulfill the constraints so as to guarantee the validity of the computing results. The reward function is shown below:

$r (t) = \{\begin{matrix} - Φ (t), & if constraints are satisified \\ 0, & otherwise \end{matrix}\}$

(36)

where $Φ (t) = V G (t) + \sum_{i = 1}^{N} {\tilde{B}}_{i} (t) [e_{i} (t) - E_{i} (t)]$ .

Φ (t) = V G (t) + \sum_{i = 1}^{N} {\tilde{B}}_{i} (t) [e_{i} (t) - E_{i} (t)]

, where

V G (t)

denotes the instantaneous cost weighed through the Lyapunov framework, which includes factors such as the task completion delay and overall energy consumption. This is used to guide the algorithm to focus on the immediate performance of the system during the optimization process;

\sum_{i = 1}^{N} {\tilde{B}}_{i} (t) [e_{i} (t) - E_{i} (t)]

is then related to the virtual energy queue, which is used to balance the energy consumption between different time gaps to ensure the long-term stable operation of the system. In MEC and energy harvesting environments, such a rewarding design can motivate the intelligent agent to satisfy the task requirements while rationally utilizing energy to minimize the total system cost.

In accordance with the MDP model given above, such an optimization problem can be transformed into maximizing the long-term cumulative reward obtained by MDs through the optimal control strategy. We define the control policy of the proposed system as

π : S \to A

, which indicates a mapping relationship from state

s (t)

to action

a (t)

. Each agent determines an action

a (t) = {f_{i, l} (t), f_{i, m} (t), p_{i, m} (t), ω_{i_{m}} (t), e_{i} (t)} \in A

by formulating a strategy

π

based on the observed environmental state

s (t) = {g (t), F, λ (t), H (t)} \in S

.

In this study, the action contains discrete decisions (offloading patterns) and continuous decisions (resource allocation). For this hybrid action, we use the following approach to deal with it.

For the discrete offloading mode decisions, the output of the policy network goes through the Softmax function to obtain the probability distribution

π θ

(x|s) for each offloading mode, where x denotes the offloading model, s denotes the current state, and

θ

is a parameter of the policy network. At each time step, the specific offloading mode decision is obtained by sampling according to this probability distribution.

For continuous resource allocation actions

(f_{i, l} (t), f_{i, m} (t), p_{i, m} (t), ω_{i_{m}} (t), e_{i} (t))

, the policy network outputs the mean

μ

and standard deviation

σ

of the continuous actions, and then samples from a Gaussian distribution

N (μ, σ^{2})

to obtain specific continuous action values. In order to ensure that the action values are within a reasonable range, we trim the sampled action values, i.e.,

a = c l i p (a, a_{min}, a_{max})

, where a is the sampled action value, and

a_{min}

and

a_{max}

are the minimum and maximum values of the action, respectively.

5.2. Q-Learning Algorithm

Q-learning is considered a model-free form of reinforcement learning and can be seen as an asynchronous dynamic programming methodology [38]. The Q-learning algorithm employs a value iteration method to replace the Q-value, and then the optimal action policy is obtained based on the computed Q-value. The Q-learning algorithm contains an agent, action, state and reward. Specifically, in the implementation of Q-learning, the agent first selects an action

a (t)

, constructs a state–action pair

(s (t), a (t))

based on state

s (t)

, and calculates the Q-value under the current state–action pair based on the Q-function. Subsequently, the agent will receive a reward from the current environment and then the agent transfers the state from

s (t)

to

s (t + 1)

. The agent computes the Q-value and updates the Q-table continuously, based on the maximum Q-value in each state–action pair so as to determine the optimal policy.

Herein, we design a Q-learning-based approach to solve the MDP decision problem. The Q-value computed from the state–action pairs in each time slot, which is estimated using the Q-learning method, will be stored in the Q-table, and then the Q-table is updated with the value iteration method. The Q-value can be expressed as

\begin{matrix} Q (s (t), a (t)) = (1 - α) Q (s (t), a (t)) + α [R (s (t), a (t)) + γ max Q (s (t + 1), a (t + 1))] \end{matrix}

(37)

Q (s (t), a (t))

denotes the Q-value calculated from the state–action pair in time slot t, where

0 < α < 1

indicates the learning rate,

R (s (t), a (t))

denotes the reward obtained by the agent for the state–action pair, and

γ

is a discount factor.

Each agent gradually learns to update individual actions using the appropriate functions to obtain optimal offloading decisions and resource allocation through continuous observation of the system environment, and this process minimizes the overall cost of the system in the absence of other user information. Each agent develops a corresponding strategy through continuous interactive learning, and takes the calculated Q-value as a long-term cumulative reward. Initially, in the implementation of the Q-learning Algorithm 1, Q-values with respect to each state–action pair in the Q-matrix are sparsely and randomly set, and the values in the Q-matrix accumulate as the learning time slot increases. In addition, we use

χ

-greedy for action decisions in order to achieve the exploration–exploitation trade-off during the Q-learning process. We utilize

χ

-greedy for action decision making. The agent selects actions with probability

1 - χ

to maximize the Q-value and random actions with probability

χ

. The expression of

χ

-greedy is shown below:

a (t) = \{\begin{matrix} random action & χ \\ arg max Q (s (t), a) & 1 - χ \end{matrix}\} .

(38)

Each agent receives the corresponding reward

r (s, a)

after performing action

a (t)

and transitions from state

s (t + 1)

. The Q-value matrix among the state–action pairs is updated according to Equation (37).

The time complexity of the Q-learning Algorithm 1 is jointly determined by the state, the action, and the number of training rounds. The time complexity is

O (E * T^{m a x} * | S | * | A |)

, where

| S |

is the number of states,

| A |

is the number of actions, E is the number of training rounds, and

T^{m a x}

is the maximal time step of each round. This complexity grows exponentially with dimension D.

Algorithm 1 Computational offloading and resource allocation algorithms based on Q-learning

1:: Input: state $S$ , action $A$ , learning rate $α$ , discount factor $γ$ .
2:: Output: Q-value for each state-action team.
3:: Initialization: set arbitrarily for $\forall s \in S$ , $a \in A$ .
4:: for each episode do
5:: Initialization state $s (t)$ ;
6:: while $T^{e p i s o d e} \leq T^{max}$ do
7:: Select an action with random probability in the current state $s (t)$ ;
8:: if $Φ \leq ε$ then
9:: Randomly select an action $a (t)$ ;
10:: else
11:: Set $a (t) = arg max Q (s (t), a)$ ;
12:: Perform the action $a (t)$ , get the reward $r (t)$ and start the next state $s (t + 1)$ ;
13:: Update $Q (s (t), a (t)) = Q (s (t), a (t)) + α [r (s (t), a (t)) + γ max Q (s (t + 1),$
$a (t + 1)) - Q (s (t), a (t))]$ ;
14:: Set $s (t) = s (t + 1)$ ;
15:: $T^{e p i s o d e} = T^{e p i s o d e} + 1$ ;
16:: end if
17:: end while
18:: end for

5.3. A3C Algorithm

Although the best policy can be obtained by constantly updating and recording the Q-values in the Q-table, it is difficult to effectively maintain the Q-table when the space of the feasible action states of the agent becomes larger and larger. This will increase the dimensionality of the Q-table data matrix and pose a great challenge to Q-value data storage, search, and renewal.

The A3C algorithm is one of the DRL algorithms that are simpler, faster, and more robust than other DRL algorithms such as actor–critic learning algorithms and deep deterministic policy gradient (DDPG) algorithms [39]. A3C leverages the value-based and policy-based methods for both discrete and continuous actions [40,41]. Using multiple CPU threads on a single machine, A3C can efficiently obtain optimal policies using an asynchronous actor learner.

As depicted in Algorithm 2, the A3C algorithm consists of multiple actor learners and a global network. In particular, asynchronous multithreading is employed to implement multiple actors. All training weights obtained during the process are stored in the global network. At the beginning of training, the global network transmits the relevant parameters to these actor learners simultaneously. The parameters trained by the actor learners are uploaded to the global network. After a period of time, the global network sends updated parameters to the actor–critic to ensure that they can share a common strategy.

The A3C algorithm requires maintaining a policy

π (a_{t} | s_{t}; η)

that consists of a set of action probabilities

a_{t}

and a value function

V (s_{t}; η_{v})

. The value function is used to evaluate how good a certain state is. The policy and the value function are determined by parameters

η

and

η_{v}

, respectively. Policy

π (a_{t} | s_{t}; η)

and value functions

V (s_{t}; η_{v})

are approximated using a single convolutional neural network. Specifically, the estimates of the policy function and the value function are output by the softmax layer and the linear layer, respectively. The agents in the A3C algorithm update the strategy and the rules using the estimated value function and the advantage function, respectively.

The current environment shifts from state

s_{t}

to state

s_{t + 1}

with a certain probability, and each intelligence point receives a reward

r_{t}

. The state value function of A3C can be defined as

V (s_{t}; θ_{v}) = E [G_{t} | s = s_{t}, π] = E [\sum_{k = 0}^{\infty} γ^{k} r_{(t + k)} | s = s_{t}, π]

(39)

where

G_{t} = \sum_{k = 0}^{\infty} γ^{k} r_{(t + k)}

denotes the discounted reward, i.e., the cumulative discounted reward of state

s_{t}

.

γ \in [0, 1]

is the discount factor, which indicates how future rewards affect the current state value.

In using the K-step reward, the discounted reporting is given by

R_{t} (θ_{v}) = \sum_{i = 0}^{k - 1} γ^{i} r_{t + i} + γ^{k} V (s_{t + k}; θ_{v}),

(40)

where k is the upper bound on the maximum value

t_{max}

,

r_{t + i}

is the immediate reward, and

γ = (0, 1]

is the discount factor.

Based on Equation (40), the advantage function can be written as

A (a_{t}, s_{t}; θ, θ_{v}) = R_{t} (θ_{v}) - V (s_{t}; θ_{v})

(41)

The policy loss function is defined as

f_{π} (θ) = log π (a_{t} | s_{t}; θ) (R_{t} - V (s_{t}; θ_{v}) + β H (π (s_{t}; θ))

(42)

H (π (s_{t}; θ))

is an entropy term utilized to encourage exploration during the training process so as to prevent possible premature convergence. Parameter

β

is used to control the strength of the regularization of the entropy, which facilitates the trade-off between exploration and exploitation.

In the context of our analysis related to the relevant functions and parameters, when we aim to explore the characteristics and behaviors of the function

f_{v} (θ_{v})

, solving for the partial derivative

f_{v} (θ_{v})

with respect to

θ_{v}

yields the following equation:

\nabla_{θ} f_{π} (θ) = \nabla_{θ} log π (a_{t} | s_{t}; θ) (R_{t} - V (s_{t}; θ_{v})) + β \nabla_{θ} H (π (s_{t}; θ))

(43)

The actor can be updated using the following equation:

d θ \leftarrow d θ + \nabla_{θ^{'}} log π (a_{t} | s_{t}; θ^{'}) (R_{t} - V (s_{t}; θ_{v})) + δ \nabla_{θ^{'}} H (π (s_{t}; θ^{'}))

(44)

Solving for the partial derivative of

f_{π} (θ_{v})

with respect to

θ_{v}

yields the following equation:

\nabla_{θ_{u}} f_{v} (θ_{v}) = 2 (R_{t} - V (s_{t}; θ_{v})) \nabla_{θ_{v}} V (s_{t}; θ_{v})

(45)

The critics are updated on the basis of the following cumulative gradient:

d θ_{v} \leftarrow d θ_{v} + 2 (R_{t} - V (s_{t}; θ_{v})) \nabla_{θ_{v}} V (s_{t}; θ_{v})

(46)

We use the RMSProp algorithm to minimize the loss function, and the gradient estimate is calculated as

g = α g + (1 - a) Δ θ^{2},

(47)

where a denotes the momentum, and

θ

denotes the cumulative gradient of the loss function.

We can then update the RMSProp algorithm based on the following estimated gradient descents:

ϑ \leftarrow θ - η \frac{Δ θ}{\sqrt{g + ε}}

(48)

where

Δ

denotes the learning rate, and

ε

denotes a small positive value.

Algorithm 2 presents the A3C-based joint optimization of the computation offloading decision, energy harvesting time allocation, and mode selection. The time complexity of A3C depends on three key factors: the number of parallel threads N, the time step of a single update T, and the computational complexity of the neural network. The overall time complexity can be expressed as

O (N * T * d)

.

Algorithm 2 Joint optimization of the computation offloading decision, energy harvesting time allocation, and mode selection

1:: Initialize: Set the global network parameters of actor network and critic network as $θ$ and $θ_{v}$ respectively.
2:: Set the local network parameters of actor network and critic network as $θ^{'}$ and $θ_{v}^{'}$ respectively
3:: Set the $T = 0$ , $t = 1$ , $η$ , $T_{m a x}$ , $t_{m a x}$ .
4:: Iteration:
5:: while $T < T_{m a x}$ do
6:: for $w = 1$ to W do
7:: Set $θ^{'}$ = $θ$ , $θ_{v}^{'}$ = $θ_{v}$
8:: Obtain the system state $S (t)$
9:: for $t < t_{m a x}$ do
10:: Perform the action $a (t)$ under the policy $π (a_{t} | s_{t}; θ^{'})$ ;
11:: Get the reward $r (t)$ and the new state $s (t + 1)$ ;
12:: end for
13:: $r (t) \{\begin{matrix} 0 & forterminalstate \\ V (s_{t}, {θ^{'}}_{v}) & fornon - terminalstate \end{matrix}\}$
14:: for $t = t_{m a x}$ do
15:: $R = r (t) + γ R$
16:: Obtain accumulate gradient wrt $θ^{'}$ based on (44)
17:: Obtain accumulate gradient wrt $θ_{v}^{'}$ based on (46)
18:: end for
19:: Update $θ^{'}$ and $θ_{v}^{'}$ based on (47)
20:: $T = T + 1$
21:: end for
22:: end while

5.4. A3C Algorithm Core Parameter Settings and the Training Process

In this study, we designed a neural network architecture with specific configurations for the actor and critic networks. The actor network takes an input layer corresponding to the state-space dimension (

S = | g, F, λ, H | = 512

). It consists of two hidden layers, each containing 256 neurons with the ReLU activation function. The output layer represents the action-space dimension, where continuous actions pertain to resource allocation parameters, and discrete actions refer to offloading decisions. The probabilities of discrete actions are processed using the Softmax function.

The critic network shares the first two hidden layers with the actor network. Its output layer is a scalar value function that evaluates the state value. Regarding hyperparameters, we set the learning rate

η = 1 \times 10^{- 4}

, utilizing the RMSProp optimizer with a momentum parameter

α = 0.9

. The discount factor

γ = 0.99

is employed to balance short-term and long-term rewards. An entropy regularization parameter

β = 0.01

is introduced to encourage policy exploration and prevent premature convergence. To avoid gradient explosion and stabilize the training process, a gradient clipping parameter of

c l i p_{n} o r m = 0.5

is used. We use asynchronous threads

W = 8

to explore different environmental samples in parallel, thereby accelerating the training process. The training consists of

T_{m a x} = 5000

epochs, with each epoch containing 1000 time slot iterations.

Our implementation framework is based on PyTorch 2.0, and we refrain from relying on third-party libraries such as Stable Baselines to ensure that the algorithm logic is fully consistent with the research description. The key modules include environmental interaction, where each asynchronous thread independently simulates the dynamic behavior of the MEC system, including energy harvesting, channel variations, and task arrivals, to generate state–action–reward samples. Additionally, for parameter synchronization, the global network aggregates the gradient updates from each thread every 50 steps to avoid parameter divergence during asynchronous training. The selection of parameters is well founded. The neural network structure draws inspiration from the successful configurations of the classic A3C algorithm in continuous control tasks. We performed ablation experiments to verify the impact of the dimensions of the hidden layer on the convergence speed. The gradient clipping and entropy regularization parameters were determined through cross-validation to strike a balance between exploration efficiency and policy stability. The number of asynchronous threads was set to 8, which effectively balances computational resource utilization and training stability, as an excessive number of threads may lead to parameter update conflicts.

In the training process of the A3C algorithm, we adopted a dynamic learning rate strategy. The initial learning rate was set to 0.001, and as the training progressed, the learning rate was adjusted to decay by a factor of 0.95 every 100 training rounds. This learning rate setting helped the algorithm to quickly explore the environment in the early stages of training, and more accurately converge to the optimal policy in the later stages. We set the batch size to 64. The choice of batch size has a significant impact on the training effectiveness of the algorithm, as larger batch sizes allow the algorithm to be more stable when updating parameters, but increase memory requirements and training time, while smaller batch sizes speed up training, but may lead to larger fluctuations in the training process. The A3C algorithm uses an asynchronous training mechanism to improve the efficiency of the training, and we set up eight worker threads, each of which was a thread of a worker that learns in an independent instance of the environment. These worker threads simultaneously interact with the environment to collect experience samples and update the global neural network parameters. In our experiments, we used NVIDIA GeForce RTX 3090 GPUs with Intel Core i9-12900K CPUs to run the A3C algorithms and PyTorch as the deep learning framework for algorithm implementation. This hardware configuration can meet the computational requirements of the A3C algorithm in the large-scale training process to ensure efficient training. At the same time, we also reasonably allocated and optimized the hardware resources to avoid resource bottlenecks and ensure the accuracy and reliability of the experimental results.

6. Results

In this study, simulation experiments were conducted to verify the feasibility and superiority of the proposed method. And we list the simulation parameters in Table 3.

Figure 2 depicts the average task computation cost with respect to the distance between the user and the MBS, under local and edge server computing approaches, and under the influence of different task arrival rates and weighting factors. As can be seen in Figure 2, the cost of task processing for the user device remains constant with increasing distance, but the total computational processing cost increases continuously with increasing distance between the user and the MBS when offloading computational tasks to the edge server, which generates task transfer energy consumption, server computation energy consumption, and user device energy consumption in the idle state while the tasks are being processed. At the same time, when the distance between the user and the base station is close, the total cost of choosing to offload the computational tasks to the edge service computation is smaller than that of local computation because the edge server in the base station has powerful computational resources. In contrast, as the distance increases, the computing tasks of the user consume more energy during the offloading process, which results in a higher computational cost compared to local processing. In addition, it can be seen from the figure that as the computational task arrival rate increases, it leads to an increasing amount of computational tasks to be processed, thus increasing the computational cost for the user. With the same computational task arrival rate, as the weight factor of latency decreases and the weight factor of energy consumption increases, the total cost of user computational tasks decreases in both modes of local and offloaded computation, which requires continuous optimization to reduce costs for application scenarios with high latency requirements.

Figure 3 represents the analysis of the task discard ratio under different computation offloading modes with respect to the distance between the user and the base station. From Figure 3, it can be concluded that the task discard ratio of local computing does not change with the distance between the user and the base station. The task discard ratio in the local computing mode is related to the computing capability of the user device, task delay, energy consumption, and battery energy. Moreover, when the distance between the user and the base station is short, the proportion of tasks discarded in the local computing mode is higher than that in other computing offload modes. This is because at shorter distances, the transmission energy consumption and time delay required for other computing offload modes are lower, making them more efficient and cost-effective compared to the local computing mode. However, as the distance increases, due to changes in channel state information during offloading, path loss and the resulting interference become greater. The delay and energy consumption of task offloading also increase, which may lead to difficulties in completing the user’s computational tasks, and thus the user’s computational tasks will be discarded. Compared with edge computing and the dynamic offloading mode of computing, the Q-learning-based strategy proposed in this study can select different states and actions according to factors such as the offloading process, the transmission distance of user computing tasks, and channel state information to maximize the reward for user devices and reduce the proportion of tasks discarded due to long-distance transmission.

Figure 4 and Figure 5, respectively, depict the relationships between energy harvesting power and the task discard percentage of user devices using energy harvesting techniques.

As shown in Figure 4, an increase in energy harvesting power leads to a decrease in the cost of processing user computational tasks. This is because a higher power enables the user device to gather more energy per unit time, and renewable energy consumption incurs no execution costs. When the energy harvesting power is low, it is hard to fully utilize the limited energy, resulting in a high task discard rate. As the energy increases, the discard rate declines, but slowly. This is due to the Q-Learning algorithm’s inability to quickly adapt to the complex state changes brought about by energy variations.

In contrast, the A3C algorithm, with its well-crafted neural network architecture, can rapidly respond to energy harvesting power fluctuations. Its input layer takes in state information like energy harvesting power and task data volume. The hidden layer then deeply extracts features, and the output layer generates appropriate action strategies. When energy is scarce, A3C prioritizes critical tasks and allocates energy rationally, maintaining a relatively low discard rate. When energy is abundant, it can flexibly use the surplus energy to further reduce the discard rate, demonstrating excellent adaptability and decision-making capabilities.

Figure 5 reveals that, similarly to the task execution cost, the proportion of discarded tasks for different user-task computation offloading modes drops as the energy harvesting power rises. A higher power means the user device can obtain more renewable energy, providing more energy for task computation and offloading, effectively reducing the discard proportion caused by energy shortages.

The A3C algorithm’s asynchronous multi-threaded architecture allows it to efficiently process high-dimensional state information. It can dynamically adjust policies and allocate resources based on the task arrival rate, keeping the average latency low even at high arrival rates. Overall, A3C has a lower average latency than Q-learning across all task arrival rates, with a more pronounced advantage at high rates, highlighting its superiority in handling dynamic loads.

Figure 6 shows the relationship between the user’s computational task arrival rate and the task discard ratio. From Figure 6, it can be seen that as the user’s computational task arrival rate increases, the user’s task discard ratio also increases. Since the user tasks have limited computing resources and energy in the user’s device when local computing is selected, local computing is limited by computing resources and energy as the task arrival rate increases, which results in a higher percentage of discarded computational tasks. Compared to local computing, the percentage of discarded computational tasks in the offloading mode is lower because the offloading mode has more powerful task communication and computational resources. The strategy proposed in this study, however, reduces the task discard ratio as the task arrival rate increases because Q-learning is able to dynamically adjust the resource allocation and offloading strategy according to the surrounding environment.

The A3C algorithm can dynamically adjust the computational resource allocation and task offloading strategy according to the number of users in the experiment. When the number of users is small, it makes reasonable use of local resources to reduce energy consumption, and when the number of users increases, it finds a better balance between local computation and offloading, so that the system energy consumption gradually increases.

From Figure 6, it can be seen that the A3C algorithm consumes less energy than the Q-learning algorithm for different numbers of users. Especially after the number of users reaches a certain scale, the A3C algorithm’s advantage in controlling energy consumption is obvious, which verifies its effectiveness and superiority in dealing with large-scale user scenarios, and it can control energy consumption more effectively while guaranteeing system performance.

Figure 7 and Figure 8 validate the cumulative rewards and the number of steps for the proposed A3C algorithm for different training rounds, respectively. As shown in Figure 7, the cumulative reward for each round is presented. The A3C algorithm in this section has a positive effect as the rounds increase and the final results converge to a relatively stable value, indicating a good convergence of the algorithm. It can be seen from Figure 8 that as the training rounds increase, the A3C algorithm decreases the number of steps to reach the target and finally converges to a stable step value to achieve the optimal strategy.

7. Conclusions

In this paper, we consider an MEC offloading network system with energy harvesting support. We divide the time into multiple time intervals during which the mobile device can perform dynamic energy harvesting and store it in the battery. The user’s computational tasks can be computed locally or at the edge server, and we take into account the situation where the tasks will be discarded if they cannot be completed. In the context of this paper, the main problem lies in the weighted sum of the processing delay, energy consumption, and the cost of discarding the computational tasks. And the Lyapunov framework is used to construct the offset-plus-penalty function because of the energy collection coupling in adjacent time intervals. The rational setting of parameter V is crucial to the system performance. Future research could incorporate online adaptive methods to dynamically adjust V to cope with complex demands in time-varying environments. Finally, we employ Q-learning and A3C methods from reinforcement learning to solve the problem. Then, we propose Q-learning and A3C-based algorithms to jointly optimize computational offloading decisions, resource allocation, and energy harvesting.

Author Contributions

Conceptualization, J.C.; methodology, J.M.; software, J.M.; validation, J.C. and Q.F.; formal analysis, W.T.; investigation, W.L. and C.G.; resources, Q.F.; data curation, J.C.; writing—original draft preparation, J.C. and C.G.; writing—review and editing, J.M.; supervision, W.T.; project administration, Q.Z.; funding acquisition, W.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the 2024 Jiangxi Provincial Department of Education Scientific and Technological Research Project, grant numbers GJJ2401518, GJJ2401507; 2024 Jiangxi Provincial Natural Science Foundation, grant number 20242BAB20128; 2022 Ji’an Science and Technology Program Special-Digital Economy Category, grant number 20222-151857; 2022 Jiangxi Province Natural Science Foundation, grant number 20224BAB205025; Jiangxi Province Major Academic and Technical Leader Training Program 20243BCE51123.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bebortta, S.; Senapati, D.; Panigrahi, C.; Pati, B. Adaptive Performance Modeling Framework for QoS-Aware Offloading in MEC-Based IIoT Systems. IEEE Internet Things J. 2022, 9, 10162–10171. [Google Scholar] [CrossRef]
Mach, P.; Becvar, Z. Mobile Edge Computing: A Survey on Architecture and Computation Offloading. IEEE Commun. Surv. Tutor. 2017, 19, 1628–1656. [Google Scholar] [CrossRef]
Hu, H.; Wang, Q.; Hu, R.; Zhu, Z. Mobility-Aware Offloading and Resource Allocation in a MEC-Enabled IoT Network with Energy Harvesting. IEEE Internet Things J. 2021, 8, 17541–17556. [Google Scholar] [CrossRef]
Ale, L.; Zhang, N.; Fang, X.; Chen, X.; Wu, S.; Li, L. Delay-Aware and Energy-Efficient Computation Offloading in Mobile-Edge Computing Using Deep Reinforcement Learning. IEEE Trans. Cogn. Commun. Netw. 2021, 7, 881–892. [Google Scholar] [CrossRef]
Jiao, X.; Chen, Y.; Chen, Y.; Guo, X.; Zhu, W. SIC-Enabled Intelligent Online Task Concurrent Offloading for Wireless Powered MEC. IEEE Internet Things J. 2024, 11, 22684–22696. [Google Scholar] [CrossRef]
Guo, K.; Gao, R.; Xia, W.; Quek, T. Online Learning Based Computation Offloading in MEC Systems with Communication and Computation Dynamics. IEEE Trans. Commun. 2021, 69, 1147–1162. [Google Scholar] [CrossRef]
Yang, G.; Hou, L.; He, X.; He, D.; Chan, S.; Guizani, M. Offloading Time Optimization via Markov Decision Process in Mobile-Edge Computing. IEEE Internet Things J. 2021, 8, 2483–2493. [Google Scholar] [CrossRef]
Khalid, R.; Shah, Z.; Naeem, M.; Ali, A.; Al-Fuqaha, A.; Ejaz, W. Computational Efficiency Maximization for UAV-Assisted MEC Networks with Energy Harvesting in Disaster Scenarios. IEEE Internet Things J. 2024, 11, 9004–9018. [Google Scholar] [CrossRef]
Liu, L.; Chang, Z.; Guo, X. Socially Aware Dynamic Computation Offloading Scheme for Fog Computing System with Energy Harvesting Devices. IEEE Internet Things J. 2018, 5, 1869–1879. [Google Scholar] [CrossRef]
Chu, W.; Jia, X.; Yu, Z.; Lui, J.; Lin, Y. Joint Service Caching, Resource Allocation and Task Offloading for MEC-Based Networks: A Multi-Layer Optimization Approach. IEEE Trans. Mob. Comput. 2024, 23, 2958–2975. [Google Scholar] [CrossRef]
Bi, S.; Ho, C.; Zhang, R. Wireless powered communication: Opportunities and challenges. IEEE Commun. Mag. 2015, 53, 117–125. [Google Scholar] [CrossRef]
Chang, Z.; Wang, Z.; Guo, X.; Yang, C.; Han, Z.; Ristaniemi, T. Distributed Resource Allocation for Energy Efficiency in OFDMA Multicell Networks with Wireless Power Transfer. IEEE J. Sel. Areas Commun. 2019, 37, 345–356. [Google Scholar] [CrossRef]
Dai, Y.; Zhang, K.; Maharjan, S.; Zhang, Y. Edge Intelligence for Energy-Efficient Computation Offloading and Resource Allocation in 5G Beyond. IEEE Trans. Veh. Technol. 2020, 69, 12175–12186. [Google Scholar] [CrossRef]
Rodrigues, T.; Suto, K.; Nishiyama, H.; Liu, J.; Kato, N. Machine Learning Meets Computation and Communication Control in Evolving Edge and Cloud: Challenges and Future Perspective. IEEE Commun. Surv. Tutor. 2020, 22, 38–67. [Google Scholar] [CrossRef]
Guo, Y.; Zhao, R.; Lai, S.; Fan, L.; Lei, X.; Karagiannidis, G. Distributed Machine Learning for Multiuser Mobile Edge Computing Systems. IEEE J. Sel. Top. Signal Process. 2022, 16, 460–473. [Google Scholar] [CrossRef]
Zhou, H.; Jiang, K.; Liu, X.; Li, X.; Leung, V. Deep Reinforcement Learning for Energy-Efficient Computation Offloading in Mobile-Edge Computing. IEEE Internet Things J. 2022, 9, 1517–1530. [Google Scholar] [CrossRef]
Jiang, F.; Dong, L.; Wang, K.; Yang, K.; Pan, C. Distributed Resource Scheduling for Large-Scale MEC Systems: A Multiagent Ensemble Deep Reinforcement Learning with Imitation Acceleration. IEEE Internet Things J. 2022, 9, 6597–6610. [Google Scholar] [CrossRef]
Liu, T.; Ni, S.; Li, X.; Zhu, Y.; Kong, L.; Yang, Y. Deep Reinforcement Learning Based Approach for Online Service Placement and Computation Resource Allocation in Edge Computing. IEEE Trans. Mob. Comput. 2023, 22, 3870–3881. [Google Scholar] [CrossRef]
Zhao, Z.; Shi, J.; Li, Z.; Si, J.; Xiao, P.; Tafazolli, R. Multiobjective Resource Allocation for mmWave MEC Offloading Under Competition of Communication and Computing Tasks. IEEE Internet Things J. 2022, 9, 8707–8719. [Google Scholar] [CrossRef]
Malik, R.; Vu, M. Energy-Efficient Joint Wireless Charging and Computation Offloading in MEC Systems. IEEE J. Sel. Top. Signal Process. 2021, 15, 1110–1126. [Google Scholar] [CrossRef]
Saleem, U.; Liu, Y.; Jangsher, S.; Li, Y.; Jiang, T. Mobility-Aware Joint Task Scheduling and Resource Allocation for Cooperative Mobile Edge Computing. IEEE Trans. Wirel. Commun. 2021, 20, 360–374. [Google Scholar] [CrossRef]
Hu, H.; Song, W.; Wang, Q.; Hu, R.; Zhu, H. Energy Efficiency and Delay Tradeoff in an MEC-Enabled Mobile IoT Network. IEEE Internet Things J. 2022, 9, 15942–15956. [Google Scholar] [CrossRef]
Chen, Y.; Zhao, F.; Chen, X.; Wu, Y. Efficient Multi-Vehicle Task Offloading for Mobile Edge Computing in 6G Networks. IEEE Trans. Veh. Technol. 2022, 71, 4584–4595. [Google Scholar] [CrossRef]
Yin, L.; Guo, S.; Jiang, Q. Joint Task Allocation and Computation Offloading in Mobile Edge Computing with Energy Harvesting. IEEE Internet Things J. 2024, 11, 38441–38454. [Google Scholar] [CrossRef]
Gu, Q.; Jian, Y.; Wang, G.; Fan, R.; Jiang, H.; Zhong, Z. Mobile Edge Computing via Wireless Power Transfer Over Multiple Fading Blocks: An Optimal Stopping Approach. IEEE Trans. Veh. Technol. 2020, 69, 10348–10361. [Google Scholar] [CrossRef]
Guo, M.; Wang, W.; Huang, X.; Chen, Y.; Zhang, L.; Chen, L. Lyapunov-Based Partial Computation Offloading for Multiple Mobile Devices Enabled by Harvested Energy in MEC. IEEE Internet Things J. 2022, 9, 9025–9035. [Google Scholar] [CrossRef]
Yan, J.; Bi, S.; Zhang, Y.J.A. Offloading and Resource Allocation with General Task Graph in Mobile Edge Computing: A Deep Reinforcement Learning Approach. IEEE Trans. Wirel. Commun. 2020, 19, 5404–5419. [Google Scholar] [CrossRef]
Liu, L.; Feng, J.; Pei, Q.; Chen, C.; Dong, M. Blockchain-Enabled Secure Data Sharing Scheme in Mobile-Edge Computing: An Asynchronous Advantage Actor–Critic Learning Approach. IEEE Internet Things J. 2021, 8, 2342–2353. [Google Scholar] [CrossRef]
Chen, Y.; Liu, Z.; Zhang, Y.; Wu, Y.; Chen, X.; Zhao, L. Deep Reinforcement Learning-Based Dynamic Resource Management for Mobile Edge Computing in Industrial Internet of Things. IEEE Trans. Ind. Inform. 2021, 17, 4925–4934. [Google Scholar] [CrossRef]
Tuong, V.; Truong, P.; Nguyen, T.; Noh, W.; Cho, S. Partial Computation Offloading in NOMA-Assisted Mobile-Edge Computing Systems Using Deep Reinforcement Learning. IEEE Internet Things J. 2021, 8, 13196–13208. [Google Scholar] [CrossRef]
Bi, S.; Huang, L.; Wang, H.; Zhang, Y. Lyapunov-Guided Deep Reinforcement Learning for Stable Online Computation Offloading in Mobile-Edge Computing Networks. IEEE Trans. Wirel. Commun. 2021, 20, 7519–7537. [Google Scholar] [CrossRef]
Samanta, A.; Chang, Z. Adaptive Service Offloading for Revenue Maximization in Mobile Edge Computing with Delay-Constraint. IEEE Internet Things J. 2019, 6, 3864–3872. [Google Scholar] [CrossRef]
Chen, J.; Xing, H.; Xiao, Z.; Xu, L.; Tao, T. A DRL Agent for Jointly Optimizing Computation Offloading and Resource Allocation in MEC. IEEE Internet Things J. 2021, 8, 17508–17524. [Google Scholar] [CrossRef]
Chen, Y.; Xu, J.; Wu, Y.; Gao, J.; Zhao, L. Dynamic Task Offloading and Resource Allocation for NOMA-Aided Mobile Edge Computing: An Energy Efficient Design. IEEE Trans. Serv. Comput. 2024, 17, 1492–1503. [Google Scholar] [CrossRef]
Li, H.; Chen, Y.; Li, K.; Yang, Y.; Huang, J. Dynamic Energy-Efficient Computation Offloading in NOMA-Enabled Air–Ground-Integrated Edge Computing. IEEE Internet Things J. 2024, 11, 37617–37629. [Google Scholar] [CrossRef]
Mao, Y.; Zhang, J.; Letaief, K. Dynamic Computation Offloading for Mobile-Edge Computing with Energy Harvesting Devices. IEEE J. Sel. Areas Commun. 2016, 34, 3590–3605. [Google Scholar] [CrossRef]
Liu, Q.; Zhang, H.; Zhang, X.; Yuan, D. Joint Service Caching, Communication and Computing Resource Allocation in Collaborative MEC Systems: A DRL-Based Two-Timescale Approach. IEEE Trans. Wirel. Commun. 2024, 23, 15493–15506. [Google Scholar] [CrossRef]
Watkins, C.; Dayan, P. Q-learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
Sun, W.; Zhao, Y.; Ma, W.; Guo, B.; Xu, L.; Duong, T. Accelerating Convergence of Federated Learning in MEC with Dynamic Community. IEEE Trans. Mob. Comput. 2024, 23, 1769–1784. [Google Scholar] [CrossRef]
Ye, X.; Li, M.; Si, P.; Yang, R.; Wang, Z.; Zhang, Y. Collaborative and Intelligent Resource Optimization for Computing and Caching in IoV with Blockchain and MEC Using A3C Approach. IEEE Trans. Veh. Technol. 2023, 72, 1449–1463. [Google Scholar] [CrossRef]
Sun, M.; Xu, X.; Han, S.; Zheng, H.; Tao, X.; Zhang, P. Secure Computation Offloading for Device-Collaborative MEC Networks: A DRL-Based Approach. IEEE Trans. Veh. Technol. 2023, 72, 4887–4903. [Google Scholar] [CrossRef]
Tong, Z.; Cai, J.; Mei, J.; Li, K.; Li, K. Dynamic Energy-Saving Offloading Strategy Guided by Lyapunov Optimization for IoT Devices. IEEE Internet Things J. 2022, 9, 19903–19915. [Google Scholar] [CrossRef]

Figure 1. Offloading model for mobile edge computing based on nonlinear energy harvesting.

Figure 2. Average task cost vs. distance between user and MBS.

Figure 3. Energy harvesting power vs. ratio of dropping tasks (1).

Figure 4. Energy harvesting power vs. ratio of dropping tasks (2).

Figure 5. Power of EH vs. average task cost.

Figure 6. Task arrived rate vs. ratio of dropping tasks.

Figure 7. Episode vs. cumulative reward.

Figure 8. Episodes vs. steps.

Table 1. Main comparison of related work.

Existing Works	Methodology	Optimization Objectives	Experimental Scenarios	Consider Constraints
[19]	Multiobjective optimization problem	Achieve maximum revenue and maximum service utilization	For mobile devices with latency and resource constraints	Task delay
[20]	Nested algorithm	Minimize the total energy consumption of the system	Compute offload wireless charging	Task delay
[21]	Genetic algorithm, heuristic algorithm	Minimize task delay	Mobile perception scene	Delay constraint, power constraint
[22]	Lyapunov optimization theory	Minimize the long-term average energy efficiency of the network	A multi-user and multi-server MEC Internet of Things system	Computing resources, power resource constraint
[23,24,25,26]	Lyapunov optimization theory	Minimize system cost	Multi-server MEC system	Energy constraint, task queue stability constraint
[27]	Deep reinforcement learning framework	Minimize the energy time cost (ETC) of MD	Multi-task MEC system	Delay constraints
[28]	A3C Deep Reinforcement Learning algorithm	Minimize the energy consumption of the MEC system and maximize the throughput of the blockchain system	Data sharing in MEC	delay constraints
[29]	DDPG	Minimize the long-term average delay of all computing tasks	Dynamic resource management in MEC	Power constraints, energy consumption constraints, computer resource constraints
[30]	The ACDQN algorithm of deep reinforcement learning	Minimize the weighted sum of energy consumption and delay	NOMA assists the MEC system	Channel resource constraints
[31]	Lyapunov-guided deep reinforcement learning	Maximize the weighted sum calculation rate	Multi-user MEC	offloading decision constraints, resource allocation constraints

Table 2. Summary of key notations.

Notations	Meanings
$U$	The set of users in the system
$n$	The set of mobile devices
$m$	The set of MBSs
t	The time slot
$τ$	The length of each time slot
$λ_{i} (t)$	The data size of the arrived computation task
$τ_{i, d} (t)$	The maximum delay in computing the task completion
$D_{i} (t)$	The set of computation task offloading decision strategies
$x_{i, l} (t)$	Local computing decision
$x_{i, m} (t)$	The decision of offloading from MD i to MBS m server process
$x_{i, d} (t)$	The decision of computation task discarding
$f_{i, l} (t)$	The computation resource allocation of the local computation of MD i
$f_{i, m} (t)$	The computation resource allocation of MBS server m for MD i
$ω_{i_{m}} (t)$	The bandwidth allocation between MD i and MBS m
$p_{i, m} (t)$	The power allocation of MD i for task offloading
$e_{i} (t)$	The harvested energy of MD i in time slot t
$B_{i} (t)$	The battery capacity of MD i in time slot t
$α_{i}$	The weighting parameter of energy consumption
$β$	The weighting parameter of the delay of task completion
$γ_{i}$	The weighting parameter of the cost of task discarding
$g (t)$	Channel gains between MDs and MBSs
F	Available MEC server computing resources
$λ_{i} (t)$	The data size of the arrived computation task
$H (t)$	Channel interference between MDs and MBSs

Table 3. Simulation parameters.

Notations	Value
$ω$	$10^{- 27}$
$τ$	1 ms
$p_{i}^{max}$	2 W [33]
$f_{i}^{max}$	1.5 GHz [42]
$d_{i, m}$	[10, 400] m
$B_{i, m}$	[2, 12] MHz
$E_{i, max}$	20 mJ
$p_{m}$	4 W
$ψ_{i}$	0.0001
$F^{max}$	10 GHz
$c_{i}$	5900 cycle/bit
$σ$	−4
$T_{i, max}$	2 S
$λ_{i}$	$[1, 10] \times 2^{10}$ bit
N	10
M	4
$p_{i, i d}$	0.002 W

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, J.; Mi, J.; Guo, C.; Fu, Q.; Tang, W.; Luo, W.; Zhu, Q. Research on Offloading and Resource Allocation for MEC with Energy Harvesting Based on Deep Reinforcement Learning. Electronics 2025, 14, 1911. https://doi.org/10.3390/electronics14101911

AMA Style

Chen J, Mi J, Guo C, Fu Q, Tang W, Luo W, Zhu Q. Research on Offloading and Resource Allocation for MEC with Energy Harvesting Based on Deep Reinforcement Learning. Electronics. 2025; 14(10):1911. https://doi.org/10.3390/electronics14101911

Chicago/Turabian Style

Chen, Jun, Junyu Mi, Chen Guo, Qing Fu, Weidong Tang, Wenlang Luo, and Qing Zhu. 2025. "Research on Offloading and Resource Allocation for MEC with Energy Harvesting Based on Deep Reinforcement Learning" Electronics 14, no. 10: 1911. https://doi.org/10.3390/electronics14101911

APA Style

Chen, J., Mi, J., Guo, C., Fu, Q., Tang, W., Luo, W., & Zhu, Q. (2025). Research on Offloading and Resource Allocation for MEC with Energy Harvesting Based on Deep Reinforcement Learning. Electronics, 14(10), 1911. https://doi.org/10.3390/electronics14101911

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on Offloading and Resource Allocation for MEC with Energy Harvesting Based on Deep Reinforcement Learning

Abstract

1. Introduction

2. Related Work

2.1. Computational Offloading and Resource Allocation Study

2.2. Energy Harvesting-Driven MEC Systems

2.3. Reinforcement Learning in MEC

3. System Model

3.1. Computing Model

3.1.1. Local Computation

3.1.2. MEC Server Computation

3.2. Communication Model

3.3. Task Drop Model

3.4. Energy Harvesting Model

4. Problem Formulation

5. Problem Solution

5.1. MDP Framework

5.2. Q-Learning Algorithm

5.3. A3C Algorithm

5.4. A3C Algorithm Core Parameter Settings and the Training Process

6. Results

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI