Intelligent End-Edge Computation Offloading Based on Lyapunov-Guided Deep Reinforcement Learning

Feng, Xue; Xu, Chi; Jin, Xi; Xia, Changqing; Jiang, Jing

doi:10.3390/app142311160

Open AccessArticle

Intelligent End-Edge Computation Offloading Based on Lyapunov-Guided Deep Reinforcement Learning

by

Xue Feng

^1,2,3,

Chi Xu

^2,3,*

,

Xi Jin

^2,3

,

Changqing Xia

^2,3

and

Jing Jiang

¹

College of Automation and Electrical Engineering, Shenyang Ligong University, Shenyang 110159, China

²

State Key Laboratory of Robotics, Shenyang Institute of Automation, Chinese Academy of Sciences, Shenyang 110016, China

³

Key Laboratory of Networked Control Systems, Chinese Academy of Sciences, Shenyang 110016, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(23), 11160; https://doi.org/10.3390/app142311160

Submission received: 24 September 2024 / Revised: 22 November 2024 / Accepted: 27 November 2024 / Published: 29 November 2024

(This article belongs to the Special Issue Real-Time Systems and Industrial Internet of Things)

Download

Browse Figures

Versions Notes

Abstract

To address the end-edge computation offloading challenge in the multi-terminal and multi-server environment, this paper proposes an intelligent computation offloading algorithm based on Lyapunov optimization and deep reinforcement learning. We formulate a network computation rate maximization problem while balancing constraints including offloading time, CPU frequency, energy consumption, transmission power, and data queue stability. Due to the fact that the problem is mixed integer nonlinear programming, we transform it into a deterministic problem based on Lyapunov optimization theory, and then model it as a Markov decision process. Then, we employ deep reinforcement learning algorithm, i.e., asynchronous advantage actor-critic (A3C), and propose Lyapunov-guided A3C algorithm named LyA3C to approximate the optimal computation offloading policy. Experiments show that the LyA3C algorithm can converge stably and effectively improve the long-term network computation rate by 2.8% and 5.7% in comparison to the A2C-based and TD3-based algorithms.

Keywords:

edge computing; computation offloading; Lyapunov optimization; deep reinforcement learning

1. Introduction

In the context of the fourth industrial revolution represented by “Industry 4.0”, communication technology and manufacturing technology are deeply integrated, enhancing networking and intelligence industrial of production process [1]. Thus, it is necessary to establish an interconnected, intelligent and stable industrial wireless network [2]. Industrial wireless technology, an innovation of this century, has become a hot spot after the field bus. Industrial wireless changes the information transfer of the existing control system. It boasts robust anti-interference capabilities, low power consumption, high reliability, and numerous other technical virtues [3]. In the future, industrial control technology will necessitate stringent requirements for reliability, low latency, and high-speed performance, further underscoring the significance of these advancements.

The industrial production process involves a large number of industrial tasks and the rapid increase of CPU-bound applications and delay-sensitive applications has added new challenges to the development of industrial wireless networks. Confronted with this deluge of industrial tasks, the conventional cloud computing model encounters limitations in addressing data processing and computational demands. Mobile edge computing (MEC) has emerged to enhance data processing efficiency.

By strategically deploying edge servers at the network’s periphery, the MEC system effectively brings the computing prowess of the cloud center closer to the edge, satisfying the escalating demand for computational augmentation at industrial terminals while fostering proximity between industrial terminals and edge servers. This arrangement curtails transmission time for industrial tasks, mitigating the latency in cloud computing. Nonetheless, when large-scale industrial tasks migrate to these edge servers concurrently, network congestion occurs, escalating task processing time delays and reducing computational throughput. Consequently, computation offloading in industrial wireless networks confronts the following pivotal challenges. Firstly, devising offloading strategies that are prudently calibrated to performance metrics. Secondly, dynamically allocating industrial tasks according to actual demand, ensuring optimal resource utilization. Enhancing resource utilization and optimizing performance metrics like energy consumption, time delay, and computation rate are essential to address these challenges.

When tackling dynamic challenges in the computation offloading strategies, ensuring system stability is very important. The Lyapunov optimization theory, originating from control theory, emerges as a robust approach to validate system stability and has been ingeniously adapted to construct queuing models for dynamic systems. Incorporating queue size into the objective function resolution, it fosters a relatively stable system state, thereby achieving the desired optimization objectives. Solving dynamic problems such as mixed-integer nonlinear programming (MINLP) requires high computation complexity, especially in large networks. The Lyapunov optimization theory excels in dynamically adapting to the ever-changing network conditions at every instant of system evolution, negating the need for manual intervention in adjusting control variables [4]. It is adaptable to real-time control of dynamically changing systems. Also, it ensures relatively low computation complexity, which is more straightforward than the direct solution of other algorithms [5]. Therefore, the application of Lyapunov optimization can transform the MINLP problem into continuous deterministic sub-problems while providing theoretical guarantee for the long-term stability of the system.

Traditional model-driven computation offloading and resource allocation strategies rely on exhaustive system information to construct accurate system models. Nevertheless, the inherent dynamism and randomness of industrial wireless networks pose significant challenges in gathering comprehensive system information necessary for accurate modeling [6]. This becomes particularly problematic in large-scale industrial settings, where a vast amount of intricately coupled system data results in an unwieldy state space, significantly impeding the efficiency of reinforcement learning. To overcome this bottleneck, data-driven deep reinforcement learning (DRL), as an innovative scheme, opens up new avenues for the computation offloading problems. DRL skillfully combines deep learning and reinforcement learning, using deep neural networks (DNN) to directly map from the environment state to the optimal action in order to maximize the long-term reward. This process is done automatically through continuous interaction with the environment. This integration not only alleviates the computational complexity of the problem but also endows the approach with the capability to autonomously learn from past experiences, bypassing the need for manually labeled training data, thus greatly facilitating the feasibility of its real-time online application. This is significant for efficiently generating computation offloading strategies in dynamic environments.

However, existing works have given limited consideration to the long-term network computation rate problem under long-term stability of the system. To this end, this paper proposes an intelligent computation offloading algorithm based on Lyapunov optimization and DRL for the end-edge task offloading problem under long-term stability constraints to maximize the long-term network computation rate. In the case of rapidly changing channel conditions and dynamic task arrivals, our algorithm can maximize the long-term network computation rate and guarantee long-term stability of the system.

The main contributions of this paper are summarized as follows.

Firstly, we construct an edge computing-enhanced industrial wireless network, where each industrial terminal can choose an industrial base station for binary computation offloading. In this scenario, we formulate a network computation rate maximization problem while balancing constraints including offloading time, CPU frequency, energy consumption, transmit power, and data queue stability.
Secondly, to maintain system stability, we define two dynamic queues, which are task queue and virtual energy queue. Based on these two dynamic queues, we minimize the upper bound of the drift-penalty function by using Lyapunov optimization, and transform the long-term MINLP problem into deterministic sub-problems. As the Lyapunov optimization problem is still non-convex, we further solve the problem with DRL.
Thirdly, we employ asynchronous advantage actor-critic (A3C) and propose Lyapunov-guided A3C algorithm named LyA3C for solution. Experiment results show that the LyA3C algorithm can converge stably and effectively improve the long-term network computation rate by comparing it with A2C-based and TD3-based algorithms.

The rest of this paper is organized as follows. Section 2 discusses related works. Section 3 describes the system model. Then, Section 4 presents the proposed LyA3C algorithm. Next, Section 5 evaluates LyA3C through extensive experiments. Finally, Section 6 concludes the whole work.

2. Related Work

For industrial tasks, different algorithms were proposed to balance the computation offloading and resource allocation in different scenarios.

Most existing works take delay or energy consumption minimization as the optimization objective. For example, Ref. [7] employed the double and dueling architectures on the basis of deep Q-network, and proposed the D3QN-based multi-priority computation offloading scheme to minimize overall task delay. Ref. [8] used Lyapunov optimization to decompose the multi-layer multi-timescale resource allocation problem into three sub-problems, and employed a deep actor-critic algorithm to minimize the total queuing delay of all devices. Ref. [9] proposed a multi-agent deep reinforcement learning (MADRL)-based scheduling algorithm, which actor-critic (AC) framework with estimation and target networks is designed for policy and value iterations to minimize delay. Similarly, Ref. [10] proposed multi-agent double actor-critic algorithm to reduce the task processing delay while improving the blockchain transaction throughput. Ref. [11] designed a joint communication and computation resource allocation mechanism based on Q-learning to minimize the total task delay cost. On the other hand, Ref. [12] incorporated a data fusion system in the architecture and designed a joint computation offloading and resource allocation scheme to minimize the overall queuing delay of the system. Ref. [13] designed an algorithm that decomposes the problem into task offloading and channel allocation sub-problems, and proposed a low-complexity heuristic algorithm to solve the sub-problem efficiently for total weighted task processing latency minimization. Ref. [14] developed a novel online SBS peer offloading framework by leveraging the Lyapunov technique, in order to maximize the long-term system performance while keeping the energy consumption of SBSs below individual long-term constraints. Ref. [15] aimed to optimize the average energy consumption by using non-orthogonal multiple access (NOMA) technique to improve the spectral efficiency and access, which is solved using successive convex approximation. Ref. [16] combined NOMA technique and frequency division multiple access and used long short-term memory network to optimize the objective. Ref. [17] used the air-ground integrated computing networks multi-agent deep deterministic policy gradient algorithm to minimize the average energy consumption by jointly optimizing the computation task allocation and wireless resource allocation.

Furthermore, there are also many works that combined time delay and energy consumption as optimization objectives. For example, Ref. [18] added a greedy algorithm to multi-intelligence depth determination of policy gradients to minimize the weighted sum of delay and energy consumption. Ref. [19] established a dual queue model containing data and computation queues, used the proximal policy optimization algorithm to minimize the weighted sum of average time delay and system energy consumption. Ref. [20] proposed an online joint offloading and resource allocation framework under the long-term MEC energy constraint, aimed at guaranteeing the users’ QoE. Ref. [21] aimed to minimize the task expected cost, which includes the task execution time delay and energy consumption, and applied the time division multiple access technique to optimize the objective function.

In addition, there are also some works that used other variables as the optimization objectives. For example, Ref. [22] considered the computation rate and energy consumption in computation offloading, proposed a deep deterministic policy gradient-based multiple continuous variable decision model to make the optimal offloading decision in edge computing. Ref. [23] proposed a computation offloading algorithm based on deep Q-Learing and used asynchronous federated deep Q-Learing to offload task. Ref. [24] proposed a multi-agent soft-actor-critic-discrete for task offloading and resurce allocation to maximize throughput while minimizing power consumption on the remote side. Ref. [25] proposed a low-complexity online computation offloading and trajectory scheduling algorithm to minimize the system energy efficiency by using Lyapunov optimization methods, where the system energy efficiency is defined as the ratio of the system’s total long-term energy consumption. Ref. [26] took the same variables as the optimization objective and used convex decomposition methods.

3. System Model

3.1. Network Model

In this paper, we propose an end-edge industrial wireless network system with edge computation capability, which consists of an edge layer and an end layer. As depicted in Figure 1, the industrial wireless network is composed of M industrial base stations and N industrial terminals to support industrial production and manufacturing in factories. We define

M = {1, 2, \dots, M}

as the set of M industrial base stations and

N = {1, 2, \dots, N}

as the set of N industrial terminals.

In this scenario, the industrial base station is equipped with an edge computing server, which is designed to supply computation resources for multiple industrial terminals. It also facilitates the scheduling of industrial terminals within its service area. The end layer comprises numerous industrial terminals with sensing, computing, communication and control capabilities, each of which generates an indivisible task that necessitates processing. Initially, each industrial terminal endeavors to process the task data using its own resources. However, when the task’s computation requirements, in terms of cycles, are substantial, the industrial terminal may find itself unable to meet the designated deadline due to resource limitations. In such case, the industrial terminal must offload the task to a proximate industrial base station to ensure timely completion.

In t-th time slot, the channel gain between m-th industrial base station and n-th industrial terminal is denoted as

h_{n, m}

, where

h_{n, m}

remains constant during the time slot, but it varies independently from time slot to time slot.

3.2. Task Model

For n-th industrial terminal, its computation task is denoted as

D_{n}

. The data arriving at the data queue of n-th industrial terminal is denoted as

A_{n}

and it is assumed that

E [{(A_{n})}^{2}] = η_{n} < \infty

,

n = 1, \dots, N

.

When

d_{n, m} = 1

, n-th industrial terminal offloads all task

D_{n}

to m-th industrial base station. When

d_{n, m} = 0

, n-th industrial terminal does not offload any task to m-th industrial base station. According to the execution location of the computation task and the size of the offloading data, the offloading decision is binary with two cases: end computing and edge computing.

Correspondingly, each industrial terminal has a task buffer, the queue length of the buffer in t-th time slot is denoted as

Q_{n} (t)

. Thus, the queue length can be updated as

Q_{n} (t + 1) = max {Q_{n} (t) - D_{n}, 0} + A_{n},

(1)

3.3. End Computing Model

When n-th industrial terminal processes data locally, the local CPU frequency is

f_{n}

is bounded by an upper limit value

f_{n}^{max}

, namely

f_{n} \leq f_{n}^{max}

. Thus, the computation size of the offloaded task in t-th time slot is calculated as

D_{n}^{e n d} = f_{n} τ_{n} T / ϕ,

(2)

where

τ_{n} \in [0, 1]

is the computation offloading time ratio and T is the length of a frame. Obviously,

τ_{n} T

is denoted as the amount of time allocated to the industrial terminal for computation offloading, and there should be

\sum_{n = 1}^{N} τ_{n} \leq 1

.

Furthermore, the energy consumption in t-th time slot is calculated as

E_{n}^{e n d} = κ {(f_{n})}^{3} τ_{n} T, \forall d_{n, m} = 0,

(3)

where

ϕ > 0

represents the number of computation cycles required to process one bit of data,

κ > 0

represents the computation energy efficiency parameter.

3.4. Edge Computing Model

When n-th industrial terminal decides to offload data on the end layer, we define transmit power as

P_{n}

with an upper limit

P_{n}^{max}

, namely

P_{n} \leq P_{n}^{max}

. Note that n-th industrial terminal evaluates the computation resources of each candidate industrial base station, and only chooses one of the multiple industrial base stations for computation offloading at one time.

The energy consumption for computation offloading is calculated as

e_{n} = P_{n} τ_{n} T

, where

P_{n}

is the transmit power. At this time, we neglect the delay on edge computing and result downloading. Thus, according to Shannon’s theorem, the computation size of the offloaded task on the end layer is given by

\begin{matrix} D_{n, m}^{e d g e} = W {log}_{2} (1 + \frac{P_{n} h_{n, m}}{N_{0}}) τ_{n} T \\ = W {log}_{2} (1 + \frac{e_{n} h_{n, m}}{τ_{n} T N_{0}}) τ_{n} T, \forall d_{n, m} = 1, \end{matrix}

(4)

where W is bandwidth,

h_{n, m}

is the channel gain between m-th industrial base station and n-th industrial terminal, and

N_{0}

denotes the noise power.

In this way, the total computation task by end computing or edge computing is calculated as

D_{n, m} = (1 - d_{n, m}) D_{n}^{e n d} + d_{n, m} D_{n, m}^{e d g e}

, namely

D_{n, m} = \frac{(1 - d_{n, m}) f_{n} τ_{n} T}{ϕ} + d_{n, m} W {log}_{2} (1 + \frac{e_{n} h_{n, m}}{τ_{n} N_{0}}) τ_{n} T

(5)

Correspondingly, the network computation rate given by

z_{n, m} = D_{n, m} / T

is calculated as

z_{n, m} = \frac{(1 - d_{n, m}) f_{n} τ_{n}}{ϕ} + d_{n, m} W {log}_{2} (1 + \frac{e_{n} h_{n, m}}{τ_{n} T N_{0}}) τ_{n},

(6)

As the industrial base stations are powered by constant energy, we ignore the computing energy consumption and only consider the offloading energy consumption. Thus, the energy consumption for edge computing is calculated as

E_{n}^{e d g e} = e_{n}

. Furthermore, the total energy consumption is calculated as

E_{n, m} = (1 - d_{n, m}) E_{n}^{e n d} + d_{n, m} E_{n}^{e d g e}

, namely

e_{n, m} = (1 - d_{n, m}) κ {(f_{n})}^{3} τ_{n} T + d_{n, m} e_{n},

(7)

where

κ

denotes the computing energy efficiency.

4. End-Edge Computation Offloading Based on Lyapunov-Guided DRL

In this section, we first explicitly formulate the network computation rate maximization problem, then decouple the problem by Lyapunov optimization and finally propose the intelligent computation offloading algorithm based on Lyapunov-guided DRL.

4.1. Problem Formulation

In order to achieve end-edge computation offloading for complex industrial tasks, we formulate the network computation rate maximization problem, assuming that each industrial terminal generates only one indivisible real-time task

S_{n}

. The set of tasks is represented as

S = {S_{1}, S_{2}, \dots, S_{N}}

, where each task is denoted as

S_{n} = {d, τ, f, e}

. In detail,

d_{m} = {d_{n, 1}, d_{n, 2}, \dots, d_{n, M}}

,

τ = {τ_{1}, τ_{2}, \dots, τ_{N}}

,

f = {f_{1}, f_{2}, \dots, f_{N}}

,

e = {e_{1}, e_{2}, \dots, e_{N}}

.

Then, we formulate the network computation rate maximization problem

P_{1}

as

\begin{matrix} max_{d, τ, f, e} lim_{K \to \infty} \frac{1}{K} \sum_{t = 1}^{K} \sum_{n = 1}^{N} \sum_{m = 1}^{M} β_{n} z_{n, m} \\ s . t . C_{1} : \sum_{n = 1}^{N} τ_{n} \leq 1, \\ C_{2} : d_{n, m} \in {0, 1}, \\ C_{3} : f_{n} \leq f_{n}^{max}, \forall n, \\ C_{4} : P_{n} \leq P_{n}^{max}, \forall n, \\ C_{5} : e_{n} \leq P_{n}^{max} τ_{n}, \forall n, \\ C_{6} : \sum_{m = 1}^{M} (\frac{(1 - d_{n, m}) f_{n} τ_{n} T}{ϕ} + d_{n, m} W {log}_{2} (1 + \frac{e_{n} h_{n, m}}{τ_{n} N_{0}}) τ_{n} T) \leq Q_{n} (t), \forall n, \\ C_{7} : lim_{K \to \infty} \frac{1}{K} \sum_{t = 1}^{K} E [Q_{n} (t)] < \infty, \forall n, \\ C_{8} : lim_{K \to \infty} \frac{1}{K} \sum_{t = 1}^{K} E [(1 - d_{n, m}) κ {(f_{n})}^{3} τ_{n} + d_{n, m} e_{n}] < ξ_{n}, \forall n, \end{matrix}

(8)

where

β_{n}

is the weight for n-th industrial terminal,

ξ_{n}

is power threshold.

C_{1}

is the computation offloading time ratio constraint, which means the ratio sum of computation offloading time should not exceed one.

C_{2}

is the binary computation offloading decision constraint.

C_{3}

is local CPU frequency constraint, which means there is a maximum value

f_{n}^{max}

for the local CPU frequency.

C_{4}

is the transmit power constraint, which means there is a maximum value

P_{n}^{max}

for the transmit power.

C_{5}

is the energy consumption constraint used for data offloading, which means there is a maximum value

P_{n}^{max} τ_{n}

for the energy consumption.

C_{6}

is the data causality constraint.

C_{7}

is the data queue stability constraint, which means time queue is strongly stable if the average queue length is limited.

C_{8}

is the average energy consumption constraint, which means there is a maximum value

ξ_{n}

for the average energy consumption.

4.2. Problem Transformation by Lyapunov Optimization

As the problem

P_{1}

is a MINLP problem, its direct solution process is complicated, the Lyapunov optimization theory is introduced to model the dynamic queue, and this strategy decomposes the MINLP problem into deterministic sub-problems through the construction of mathematical models under the premise of ensuring the stability of the system.

In order to satisfy the average power consumption constraint

C_{8}

in

P_{1}

, we introduce N virtual energy queues

{Y_{n, m} (t)}_{N \times M}

for industrial terminals. Let

Y_{n, m} (1) = 0

, the queue length can be updated as

Y_{n, m} (t + 1) = max (Y_{n, m} (t) + e_{n, m} - ξ_{n}, 0),

(9)

When the virtual energy queue is stabilized, the average energy consumption satisfies

e_{n, m} \leq ξ_{n}

.

After establishing dynamic queues, we define

F (t) = {Q (t), Y (t)}

as the total queue backlog, then we employ quadratic Lyapunov function

L (F (t)) = \frac{1}{2} \sum_{n = 1}^{N} Q_{n} {(t)}^{2} + \frac{1}{2} \sum_{n = 1}^{N} \sum_{m = 1}^{M} Y_{n, m} {(t)}^{2},

(10)

to represent the stability of the queues.

Correspondingly, the Lyapunov drift function of the total queue backlog is defined as

Δ L (F (t)) = E [(L (F (t + 1)) - L (F (t)) |F (t)},

(11)

which demonstrates the improvement from t-th time slot to (

t + 1

)-th time slot. That is to say, the Lyapunov drift can reflect the system dynamics between two continuous states. From the perspective of system stability, the smaller the Lyapunov drift, the more stable the system is.

However, the Lyapunov drift still depends on the system information of the next time slot. Thus, we further derive the Lyapunov drift.

According to [27], for nonnegative real numbers A, B, C, and D satisfying

A \leq max (B - C, 0) + D

, there should be

A^{2} \leq B^{2} + C^{2} + D^{2} - 2 B (C - D)

, so we have

\begin{matrix} Q_{n} {(t + 1)}^{2} & = {[[Q_{n} (t) - D_{n}] + A_{n}]}^{2} \\ \leq Q_{n} {(t)}^{2} + {(D_{n})}^{2} + {(A_{n})}^{2} + 2 Q_{n} (t) (A_{n} - D_{n}), \end{matrix}

(12)

In this way, we can drive the upper bound of the Lyapunov drift function for data queue as

\begin{matrix} Δ L (Q_{n} (t)) & = E [\frac{1}{2} \sum_{n = 1}^{N} Q_{n} {(t + 1)}^{2} - \frac{1}{2} \sum_{n = 1}^{N} Q_{n} {(t)}^{2} |Q_{n} (t)] \\ \leq E [\sum_{n = 1}^{N} (\frac{1}{2} ({(D_{n})}^{2} + {(A_{n})}^{2}) + Q_{n} (t) (A_{n} - D_{n})) |Q_{n} (t)] \\ = B_{1} + E [\sum_{n = 1}^{N} (Q_{n} (t) (A_{n} - D_{n})) |Q_{n} (t)], \end{matrix}

(13)

where

B_{1} = \sum_{n = 1}^{N} \frac{{(D_{n})}^{2} + {(A_{n})}^{2}}{2}

.

At the same time, we can have the upper bound of the Lyapunov drift function for virtual energy queue as

Δ L (Y_{n, m} (t)) \leq B_{2} + E [\sum_{n = 1}^{N} \sum_{m = 1}^{M} Y_{n, m} (t) (e_{n, m} - ξ_{n}) |Y_{n, m} (t)],

(14)

where

B_{2} = \sum_{n = 1}^{N} \sum_{m = 1}^{M} \frac{{(e_{n, m})}^{2} + {(ξ_{n})}^{2}}{2}

.

Through the above process, the upper bound of the Lyapunov drift function for total queue backlog is given by

Δ L (F (t)) \leq B + \sum_{n = 1}^{N} Q_{n} (t) E [(A_{n} - D_{n}) |F (t)] + \sum_{n = 1}^{N} \sum_{m = 1}^{M} Y_{n, m} (t) E [(e_{n, m} - ξ_{n}) |F (t)],

(15)

where

B = B_{1} + B_{2}

.

At the same time, it is obvious that the upper bound of the Lyapunov drift function is no longer decided by the system information of the next time slot. Then, the Lyapunov drift-plus-penalty approach is used to maximize the network computation rate while satisfying queue stability, that is to say minimizing the drift-plus-penalty expression within each time slot, we have

Δ (F (t)) ≜ Δ L (F (t)) - V \sum_{n = 1}^{N} E [β_{n} z_{n, m} |F (t)],

(16)

where

V > 0

is penalty weight that controls the significance of system data and virtual energy queue backlog.

Then the upper bound of the Lyapunov drift-plus-penalty is given by

\begin{matrix} Δ (F (t)) & = B + \sum_{n = 1}^{N} Q_{n} (t) E [(A_{n} - D_{n}) |F (t)] \\ + \sum_{n = 1}^{N} \sum_{m = 1}^{M} Y_{n, m} (t) E [(e_{n, m} - ξ_{n}) |F (t)] - V E [β_{n} z_{n, m} |F (t)], \end{matrix}

(17)

By removing the constant terms from observation at the beginning of t-th time slot, the algorithm decides the actions by maximizing the following

\sum_{n = 1}^{N} \sum_{m = 1}^{M} ((Q_{n} (t) + V β_{n}) z_{n, m} - Y_{n, m} (t) e_{n, m}),

(18)

Then the problem

P_{1}

can be decoupled into deterministic sub-problems

P_{2}

, we formulate the problem

P_{2}

as

\begin{matrix} max_{d, τ, f, e} \sum_{n = 1}^{N} \sum_{m = 1}^{M} ((Q_{n} (t) + V β_{n}) z_{n, m} - Y_{n, m} (t) e_{n, m}) \\ s . t . C_{1} : \sum_{n = 1}^{N} τ_{n} \leq 1, \\ C_{2} : d_{n, m} \in {0, 1}, \\ C_{3} : f_{n} \leq f_{n}^{max}, \forall n, \\ C_{4} : P_{n} \leq P_{n}^{max}, \forall n, \\ C_{5} : e_{n} \leq P_{n}^{max} τ_{n}, \forall n, \\ C_{6} : \sum_{m = 1}^{M} (\frac{(1 - d_{n, m}) f_{n} τ_{n} T}{ϕ} + d_{n, m} W {log}_{2} (1 + \frac{e_{n} h_{n, m}}{τ_{n} N_{0}}) τ_{n} T) \leq Q_{n} (t), \forall n, \end{matrix}

(19)

At this time,

P_{1}

has been decoupled into deterministic sub-problems within each time slot, the next step is to apply DRL algorithm to solve

P_{2}

in each time slot.

4.3. MDP Modeling

However, the transformed problem

P_{2}

is still difficult to solve as the variables are still coupled with each other. Thus, we first transform it into a MDP and then employ DRL to approximate the optimal solution.

For MDP modeling, there are state space, action space and reward space for DRL.

State space: In t-th time slot, the state $s (t)$ consisting of the channel gain and the system queue is defined as

$s (t) = {h (t), Q (t), Y (t)},$

(20)

where $h (t) = {h_{n, m} (t)}_{N \times M}$ is the channel gain among industrial base stations and industrial terminals, $Q (t) = {Q_{n} (t)}_{N}$ is the set of data queue length, and $Y (t) = {Y_{n, m} (t)}_{N \times M}$ is the set of virtual energy queue length.
Action space: In t-th time slot, the action $a (t)$ consists of an optimal offloading action, the computation offloading time ratio, the local CPU frequency and the total energy consumption, defined as

$a (t) = {d (t), τ (t), f (t), e (t)},$

(21)

where $d (t) = {d_{n, m} (t)}_{N \times M}$ is the set of optimal offloading action, and $τ (t) = {τ_{n} (t)}_{N}$ is the set of computation offloading time ratio, $f (t) = {f_{n} (t)}_{N}$ is the set of local CPU frequency, $e (t) = {e_{n, m} (t)}_{N \times M}$ is the set of energy consumption.
Reward space: The reward $r (t)$ denotes the action reward generated when taking action $a (t)$ in the current state $s (t)$ . According to the objective function, the reward $r_{n} (t)$ for n-th industrial terminal and m-th industrial base station in t-th time slot is defined as

$r_{n, m} (t) = (Q_{n} (t) + V β_{n}) z_{n, m} - Y_{n, m} (t) e_{n, m} (t) .$

(22)

Based on this, the sum of incentives received by all industrial terminals is calculated as

$r (t) = \sum_{n = 1}^{N} \sum_{m = 1}^{M} r_{n, m} (t),$

(23)

Obviously, the larger the reward, the larger the network computation rate of the industrial terminal. Furthermore, the cumulative reward is defined as

$R (t) = \sum_{t = 0}^{K} γ^{t} r (t),$

(24)

where $γ \in [0, 1]$ is the discount factor indicating how the previous reward impacts the current reward.

Through the above process, we model the transformed problem into an MDP. As such, we can maximize the network computation rate by maximizing the long-term accumulative reward while satisfying all constraints, during which an effective policy can be obtained.

4.4. Lyapunov-Guided A3C Algorithm

With the formulated MDP model, we further propose the LyA3C algorithm based on the A3C algorithm. The structure of our proposed algorithm is depicted in Figure 2. There are a global network and several worker agents, where the globe network and the worker agents are set in the same actor-critic structure.

Taking one of the worker agents as an example, it includes two modules: the actor module and the critic module. The two modules accept the state and output the state value as well as the corresponding policy. Specifically,

a (t) = π (s (t) |θ)

denotes the policy learned from the current state, where

π (s (t) |θ)

is the explored offloading policy. The network parameter of actor network is denoted as

θ

. The gradient of the expected cumulative discounted reward is

\nabla θ E_{π} [\sum_{t = 0}^{\infty} γ^{t} r (t)] = E_{π} [\nabla_{θ} log π (s (t) |θ) A (s, a)],

(25)

where

A (s, a)

is the advantage function.

A (s, a)

is used to measure the dominance of the state

s (t)

when the agent selects action

a (t)

and executes the policy

π

compared to executing an arbitrary action. According to the current strategy, the value of

A (s, a)

is obtained from the value

V (s (t))

and is calculated as

A (s, a) = r (t) + γ V (s (t + 1)) - V (s (t)),

(26)

where

γ

is discount factor.

The parameter

θ

is updated based on

θ = θ + α \sum_{t} \nabla_{θ} log π (s (t) |θ) A (s, a),

(27)

where

α

is the learning rate of the actor network.

In contrast, the parameter of the critic network

φ

is updated as

φ = φ + χ {\sum_{t} \nabla_{θ} (r (t) + γ V (s (t + 1)) - V (s (t)))}^{2},

(28)

where

χ

is the learning rate of the critic network.

Through the above process, each worker agent will send the accumulated updates to the global network, then the global network asynchronously updates

θ

and

φ

.

To sum up, the overall flowchart of the LyA3C algorithm is shown in Figure 3.

The algorithm is summarized as Algorithm 1.

Algorithm 1 The LyA3C Algorithm

1:: Initialize number of industrial terminals N; number of industrial base stations M; Lyapunov drift-plus-penalty parameter V;
2:: Initialize the A3C network with actor $π (s)$ and critic $V (s)$ ;
3:: repeat
4:: Collect a set of trajectories { $s (t)$ , $a (t)$ , $r (t)$ , $s (t + 1)$ };
5:: for each time slot $t = 1$ to K do
6:: Actor step sample action $a (t) \sim π (s (t))$ ;
7:: Critic step compute value estimate $V (s (t))$ ;
8:: Compute advantage function using Equation (25);
9:: Each worker agent update $θ$ and $φ$ using Equations (27) and (28);
10:: Global network update $θ$ and $φ$ ;
11:: end for

4.5. Complexity Analysis

The complexity analysis of the proposed LyA3C algorithm is based on the structure of the neural network used for the critic network and actor network. The critic network contains the input layer

K_{i}

, the hidden layer

K_{h}

and the output layer

K_{o}

. The complexity of the critic network is calculated as

O (K_{i} K_{h} + K_{i} K_{o})

. Similarly, the actor network also contains the input layer

K_{i}

, the hidden layer

K_{h}

, and the output layer

K_{o}

. The complexity of the actor network is calculated as

O (K_{i} K_{h} + K_{i})

. For each agent, the overall complexity is the sum of the complexity of the critic network and the actor network:

O (2 K_{i} K_{h} + K_{i} K_{o} + K_{i}) = O (K_{i} (2 K_{h} + K_{o} + 1))

.

5. Experimental Results and Analysis

5.1. Experimental Setup

In this section, we evaluate the performance of the proposed algorithm. All experiments are evaluated on a platform equipped with Intel Core i7-11700 2.5GHz CPU and Pytorch. During the experiments, we choose two benchmark DRL algorithms for comparison, including A2C-based and TD3-based algorithms.

During experiments, it is assumed that the channel gain

h_{n, m}

follows the path loss model as

\bar{h_{n, m}} = A_{d} {(\frac{3 \times 10^{8}}{4 π f_{c} d_{n, m}})}^{d_{e}}

, where

d_{n, m}

denotes the distance between n-th industrial terminal and m-th industrial base station and is measured in meters. The noise power is denoted by

N_{0} = W υ_{0}

, indicates that the task arrivals of all industrial terminals obey an exponential distribution with equal average rates

E [{(A_{n})}^{2}] = η_{n} < \infty, n = 1, \dots, N

. The number of industrial base stations is set to 3 and the number of industrial terminals is set to 10, and all industrial terminals are identical industrial terminals with the specific parameter values shown in Table 1.

5.2. Experiments and Analysis

Figure 4 presents a comparative analysis of the network computation rates across various algorithms under identical model conditions. The network computation rates of all algorithms oscillate within a defined range, signifying that each algorithm achieves a relatively stable rate over time. Specifically, TD3-based and A2C-based algorithms record lower long-term network computation rates. Although a temporary intersection exists between the LyA3C and A2C-based algorithms at a specific time, the LyA3C algorithm ultimately outperforms the rest by attaining the highest optimal network computation rate. This superiority is attributed to the LyA3C algorithm’s deployment of multiple parallel agents, which enhances the network computation rate during operation. As time progresses, the fluctuation range diminishes, steadily converging toward the optimal network computation rate.

In Figure 5, Figure 6 and Figure 7, we evaluate the performance of the three algorithms in terms of network computation rate, average data queue length and average energy consumption under different data arrival rates. Note that the experimental values are the average of multiple experiments in which we fixed the other parameter settings and gradually increased the data arrival rate from 0 to 5 Mbps. Note that LyA3C performs the best in all three experiments.

From Figure 5, Obviously, as the data arrival rate

η_{n}

increases, the network computation rate exhibits an upward trend. This is because when the data arrival rate is low, the queue is stable in the long term. At this point, the network computation rate is directly proportional to

η_{n}

. However, when the data arrival rate is very high, conflicts among resources increase, causing the network computation rate to stop increasing or even decrease.

Figure 6 shows that the average data queue length tends to rise with the increase in

η_{n}

. This is because a large amount of data is constantly entering the queue. Although an increase in the network computation rate can accelerate the data leaving the queue, the amount of data leaving the queue is still less than the amount of data arriving. Consequently, the average data queue length grows until it reaches maximum capacity.

In Figure 7, the average energy consumption is positively correlated with the data arrival rate. Hence the average energy consumption gradually increases as

η_{n}

grows, leveling off when

η_{n}

is relatively large.

Figure 8 presents the trend of network computation rates over time under various energy consumption threshold constraints

ξ_{n}

for all industrial terminals. It is observable that the larger the value of

ξ_{n}

, the higher the network computation rate. The primary reason is that as

ξ_{n}

increases, the available energy for computation also increases, which in turn enhances the network computation rate, and ensures the long-term stability of the system. Specifically, when

ξ_{n} = 2

, the network computation rate gradually increases with time and ultimately converges to a higher value. When

ξ_{n} = 1.5

and

ξ_{n} = 1

, although the network computation rate also increases, it eventually falls short of the value observed when

ξ_{n} = 2

.

Figure 9 evaluates the changes in task queue length of the three algorithms with different energy thresholds. The figure shows that the task queue of the three algorithms decreases with the increase of the energy consumption threshold. This is because when the energy consumption threshold increases, the energy available to the industrial terminal for computation offloading increases, more tasks are processed on time, and the length of the task queue decreases. When the energy consumption threshold reaches a specific value, the industrial terminal is no longer affected by energy consumption, tasks can be freely scheduled, and the task queue length converges to a specific value. When the energy consumption threshold is small, the LyA3C algorithm is affected by the energy consumption threshold, and the length of the task queue is relatively long. As the energy consumption threshold increases, the number of tasks in the queue continues to decrease, reaching the optimum among the three algorithms, and eventually reaching stability.

6. Conclusions

In this paper, we proposed an intelligent computation offloading algorithm based on Lyapunov optimization and DRL. With full consideration of the offloading time, data causality, local CPU frequency, energy consumption, transmit power, and data queue stability, the long-term network computation rate maximization problem was formulated. To solve this problem, we designed the data queue and virtual energy queue backlog, derived the upper bound of the Lyapunov drift based on the Lyapunov optimization theory, and decoupled the complex MINLP problem into deterministic sub-problems. Then, we reformulated the transformed problem into MDP, and approximated the optimal solution using the proposed LyA3C algorithm. Experimental results showed that compared with two benchmark algorithms, including A2C-based and TD3-based algorithms, the proposed LyA3C algorithm could converge stably and effectively improve the long-term network computation rate by 2.8% and 5.7% while satisfying various constraints.

In the future, in addition to the binary offloading adopted in the paper, we will consider dividing the computation task into multiple independent sub-tasks for multi-destination computation offloading. Moreover, different from the fact that all industrial base stations are homogeneous providing the same computation resource, i.e., CPU, we will consider the case that, different industrial base stations provide different kinds of computation resources including CPU, GPU and NPU.

Author Contributions

Conceptualization, C.X. (Chi Xu), J.J. and X.F.; Formal analysis, C.X. (Chi Xu) and X.F.; Methodology, C.X. (Chi Xu) and X.F.; Project administration, C.X. (Chi Xu), C.X.(Changqing Xia) and X.J.; Resoures, C.X. (Chi Xu), C.X.(Changqing Xia) and X.J.; Validation, X.F.; Writing—original draft preparation, X.F.; Writing—review and editing, C.X. (Chi Xu) and X.F. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant 92267108, Grant 62173322, Grant 62133014, Grant 61821005 and Grant 92267205; in part by the Science and Technology Program of Liaoning Province under Grant 2023JH3/10200004, Grant 2022JH25/10100005, and Grant 2023JH3/10200006; in part by the Youth Innovation Promotion Association CAS under Grant Y2021062, and Grant 2020207; in part by the Independent Subject of the State Key Laboratory of Robotics under Grant 2024-Z12.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Xu, C.; Yu, H.; Jin, X.; Xia, C.; Li, D.; Zeng, P. Industrial Internet for intelligent manufacturing: Past, present, and future. Front. Inf. Technol. Electron. Eng. 2024, 25, 1173–1192. [Google Scholar] [CrossRef]
Ahlén, A.; Akerberg, J.; Eriksson, M. Toward wireless control in industrial process automation: A case study at a paper mill. IEEE Control. Syst. Mag. 2019, 39, 36–57. [Google Scholar] [CrossRef]
Wang, K.; Wang, Y.; SUN, Y. Green industrial internet of things architecture: An energy-efficient perspective. IEEE Commun. Mag. 2016, 54, 48–54. [Google Scholar] [CrossRef]
Cui, T.; Hu, Y.; Shen, B.; Chen, Q. Task offloading based on Lyapunov optimization for MEC-Assisted vehicular platooning networks. Sensors 2019, 19, 4974. [Google Scholar] [CrossRef]
Zhang, J.; Du, J.; Shen, Y. Dynamic computation offloading with energy harvesting terminal devices: A hybrid decision based deep reinforcement learning approach. IEEE Internet Things J. 2020, 7, 9303–9317. [Google Scholar] [CrossRef]
Bi, S.; Huang, L.; Zhang, Y. Joint optimization of service caching placement and computation offloading in mobile edge computing systems. IEEE Trans. Wirel. Commun. 2019, 19, 4947–4963. [Google Scholar] [CrossRef]
Xu, C.; Zhang, P.; Yu, H. D3QN-based Multi-Priority Computation Offloading for Time-Sensitive and Interference-Limited Industrial Wireless Networks. IEEE Trans. Veh. Technol. 2024, 73, 13682–13693. [Google Scholar] [CrossRef]
Liao, H.; Jia, Z.; Zhou, Z. Cloud-edge-end collaboration in air–ground integrated power IoT: A semidistributed learning approach. IEEE Trans. Ind. Inform. 2022, 18, 8047–8057. [Google Scholar] [CrossRef]
Xu, C.; Tang, Z.; Yu, H. Digital twin-driven collaborative scheduling for heterogeneous task and edge-end resource via multi-agent deep reinforcement learning. IEEE J. Sel. Areas Commun. 2023, 41, 3056–3069. [Google Scholar] [CrossRef]
Xu, C.; Zhang, P.; Xia, X.; Kong, L.; Zeng, P.; Yu, H. Digital twin-assisted intelligent secure task offloading and caching in blockchain-based vehicular edge computing networks. IEEE Internet Things J. 2024. [Google Scholar] [CrossRef]
Xu, S.; Liu, Q.; Gong, B. RJCC: Reinforcement-learning-based joint communicational-and-computational resource allocation mechanism for smart city IoT. IEEE Internet Things J. 2020, 7, 8059–8076. [Google Scholar] [CrossRef]
Liu, Q.; Luo, R.; Liang, H. Energy-efficient joint computation offloading and resource allocation strategy for ISAC-aided 6G V2X networks. IEEE Trans. Green Commun. Netw. 2023, 7, 413–423. [Google Scholar] [CrossRef]
Fan, W.; Chen, Z.; Hao, Z. Joint task offloading and resource allocation for quality-aware edge-assisted machine learning task inference. IEEE Trans. Veh. Technol. 2023, 72, 6739–6752. [Google Scholar] [CrossRef]
Chen, L.; Zhou, S.; Xu, J. Computation peer offloading for energy-constrained mobile edge computing in small-cell networks. IEEE/ACM Trans. Netw. 2018, 26, 1619–1632. [Google Scholar] [CrossRef]
Nouri, N.; Entezari, A.; Abouei, J. Dynamic power–latency tradeoff for mobile edge computation offloading in NOMA-based networks. IEEE Internet Things J. 2019, 7, 2763–2776. [Google Scholar] [CrossRef]
Ma, G.; Wang, X.; Hu, M. DRL-based computation offloading with queue stability for vehicular-cloud-assisted mobile edge computing systems. IEEE Trans. Intell. Veh. 2022, 8, 2797–2809. [Google Scholar] [CrossRef]
Qin, P.; Fu, Y. Multi-agent learning-based optimal task offloading and UAV trajectory planning for AGIN-power IoT. IEEE Trans. Commun. 2023, 71, 4005–4017. [Google Scholar] [CrossRef]
Qin, P.; Wang, Y.; Cai, Z. MADRL-based URLLC-aware task offloading for air-ground vehicular cooperative computing network. IEEE Trans. Intell. Transp. Syst. 2023, 25, 6716–6729. [Google Scholar] [CrossRef]
Tan, L.; Sun, L.; Cao, B. Research on weighted energy consumption and delay optimization algorithm based on dual-queue model. IET Commun. 2024, 18, 81–95. [Google Scholar] [CrossRef]
Jiang, H.; Dai, X. Joint task offloading and resource allocation for energy-constrained mobile edge computing. IEEE Trans. Mob. Comput. 2022, 22, 4000–4015. [Google Scholar] [CrossRef]
Cai, P.; Yang, F.; Wang, J. JOTE: Joint offloading of tasks and energy in fog-enabled IoT networks. IEEE Internet Things J. 2020, 7, 3067–3082. [Google Scholar] [CrossRef]
Sun, F.; Zhang, Z.; Chang, X. Toward heterogeneous environment: Lyapunov-orientated imphetero reinforcement learning for task offloading. IEEE Trans. Netw. Serv. Manag. 2023, 20, 1572–1586. [Google Scholar] [CrossRef]
Pan, C.; Wang, Z.; Liao, H. Asynchronous federated deep reinforcement learning-based URLLC-aware computation offloading in space-assisted vehicular networks. IEEE Trans. Intell. Transp. Syst. 2022, 24, 7377–7389. [Google Scholar] [CrossRef]
Wang, Y.; Wu, H.; Jhaveri, R. DRL-based URLLC-constraint and energy-efficient task offloading for internet of health things. IEEE J. Biomed. Health Inform. 2023, 28, 3305–3316. [Google Scholar] [CrossRef]
Hu, H.; Zhou, X.; Wang, Q. Online computation offloading and trajectory scheduling for UAV-enabled wireless powered mobile edge computing. China Commun. 2022, 19, 257–273. [Google Scholar] [CrossRef]
Zhang, Q.; Gui, L.; Hou, F. Dynamic task offloading and resource allocation for mobile-edge computing in dense cloud RAN. IEEE Internet Things J. 2020, 7, 3282–3299. [Google Scholar] [CrossRef]
Georgiadis, L.; Neely, M.; Tassiulas, L. Resource allocation and cross-layer control in wireless networks. Found. Trends® Netw. 2006, 1, 1–144. [Google Scholar]

Figure 1. System model.

Figure 2. Algorithm structure.

Figure 3. Flowchart of the LyA3C algorithm.

Figure 4. Network computation rate for different algorithms.

Figure 5. Relationship between data arrival rate and network computation rate for different algorithms.

Figure 6. Relationship between data arrival rate and average data queue length for different algorithms.

Figure 7. Relationship between data arrival rate and average energy consumption for different algorithms.

Figure 8. Effect of different energy consumption thresholds on network computation rate.

Figure 9. Relationship between energy consumption thresholds and data queue length for different algorithms.

Table 1. Simulation Parameters.

Parameter	Symbols	Value
Antenna gain	$A_{d}$	3
Carrier frequency	$f_{c}$	915 MHz
Path loss exponent	$d_{e}$	3
Noise power spectral density	$υ_{0}$	−174 dBm/Hz
Bandwidth	W	2 MHz
Maximum local CPU	$f_{n}^{max}$	0.3 GHz
Maximum Transmit Power	$P_{n}^{max}$	0.1 watt
Communication overhead	$υ_{u}$	1.1
Computation energy efficiency	$κ$	$10^{- 26}$
Computation cycles per one bit	$ϕ$	100
Memory capacity	q	1024
Energy consumption threshold	$ξ_{n}$	0.04 watt
Lyapunov control parameter	V	20
Data arrival rate	$λ_{n}$	2.7 Mbps

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Feng, X.; Xu, C.; Jin, X.; Xia, C.; Jiang, J. Intelligent End-Edge Computation Offloading Based on Lyapunov-Guided Deep Reinforcement Learning. Appl. Sci. 2024, 14, 11160. https://doi.org/10.3390/app142311160

AMA Style

Feng X, Xu C, Jin X, Xia C, Jiang J. Intelligent End-Edge Computation Offloading Based on Lyapunov-Guided Deep Reinforcement Learning. Applied Sciences. 2024; 14(23):11160. https://doi.org/10.3390/app142311160

Chicago/Turabian Style

Feng, Xue, Chi Xu, Xi Jin, Changqing Xia, and Jing Jiang. 2024. "Intelligent End-Edge Computation Offloading Based on Lyapunov-Guided Deep Reinforcement Learning" Applied Sciences 14, no. 23: 11160. https://doi.org/10.3390/app142311160

APA Style

Feng, X., Xu, C., Jin, X., Xia, C., & Jiang, J. (2024). Intelligent End-Edge Computation Offloading Based on Lyapunov-Guided Deep Reinforcement Learning. Applied Sciences, 14(23), 11160. https://doi.org/10.3390/app142311160

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Intelligent End-Edge Computation Offloading Based on Lyapunov-Guided Deep Reinforcement Learning

Abstract

1. Introduction

2. Related Work

3. System Model

3.1. Network Model

3.2. Task Model

3.3. End Computing Model

3.4. Edge Computing Model

4. End-Edge Computation Offloading Based on Lyapunov-Guided DRL

4.1. Problem Formulation

4.2. Problem Transformation by Lyapunov Optimization

4.3. MDP Modeling

4.4. Lyapunov-Guided A3C Algorithm

4.5. Complexity Analysis

5. Experimental Results and Analysis

5.1. Experimental Setup

5.2. Experiments and Analysis

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI