A Learning Game-Based Approach to Task-Dependent Edge Resource Allocation

.


Introduction
With the rapid advancement of hardware and software technologies, there is an increasing demand for computationally intensive real-time applications, such as facial recognition, virtual reality, and augmented reality [1][2][3].These applications often require lower response times to meet user experience expectations.However, due to the limitations of device hardware and available resources, the devices cannot complete task computations in the required time [4][5][6].Cloud computing is one of the candidate technologies to solve this problem, but the distance between the cloud and the devices is considerable.Transferring a large number of tasks to the cloud imposes a significant communication burden on the entire network.Edge computing (EC) deploys resources closer to devices, thereby enabling devices to offload tasks to a nearby edge server (ES) with less overhead, thus further reducing the time used for computational tasks [7].
Although edge servers (ESs) can provide the necessary resources for devices, the behavior of ESs and the user in practice is usually driven by interests [8,9].In order to offset operational costs, ESs charge a certain fee for processing tasks, while the user needs to pay for the costs incurred from using these resources, thus ensuring a satisfactory quality of experience (QoE).On the other hand, dependent tasks differ from tasks that can be arbitrarily divided (such as atomic tasks).The order of scheduling of its sub-tasks and the amount of resources allocated to each sub-task are important factors that affect the execution time of the dependent task [10,11].Therefore, motivating ESs to engage in task computation and the effective allocation of resources and tasks are key issues in edge computing (EC).
Game theory is a mathematical apparatus that studies the decision-making process of players in a competitive environment.It involves exploration of the interactions between participants, as well as their selection of strategies and expected outcomes [12][13][14].Using the strengths of game theory, numerous studies in recent years have applied it to the research of edge resource allocation strategies.For example, Roostaei et al. [15] employed a Stackelberg game for the dynamic allocation and pricing of communication and computational resources in networks, in which they proposed a joint optimal pricing and resource allocation algorithm based on game theory.Chen et al. [16] investigated multiple resource allocation and pricing issues by modeling the problem as a Stackelberg game.Kumar et al. [17] introduced a game model for edge resource allocation with the aim of maximizing the utilization of computational resources.However, these studies all rest on a strong assumption that the game participants should share their information, such as the user needing to disclose resource preference parameters and the edge servers (ESs) required to reveal their cost parameters.Yet, game participants are rational, and they are typically reluctant to disclose such private information.
This paper presents a two-stage resource allocation method designed for task-dependent scenarios.In the first stage, an incentive mechanism is proposed and modeled as a multivariate Stackelberg game.We then analyze the uniqueness of the Stackelberg equilibrium (SE) in an information sharing scenario, as well as design an iterative optimization algorithm that can approximate the SE solution.The research is then extended to non-information-sharing scenarios, where the game problem is formulated as a POMDP.A reinforcement learning algorithm based on a learning game is then proposed, which can learn from the participants' historical decisions while ensuring the privacy of game participants.The second stage aims to allocate the resources purchased by users under the incentive mechanism in a rational manner.A greedy-based deep reinforcement learning algorithm is designed to minimize task execution time.Specifically, edge server resources are allocated using a greedy method, and a sequence-to-sequence neural network-based reinforcement learning algorithm is employed to obtain optimal task allocation decisions.The S2S neural network is a deep learning model that is implemented using multiple layers of a recurrent neural network.It is capable of transforming an input sequence into a corresponding output sequence.Furthermore, these two approaches are then combined to minimize task execution time.The main contributions of this paper are as follows.
• We propose a two-stage resource allocation method in the context of dependent tasks.• In the first stage, we model the problem of incentivizing users to request resources from edge servers as a multivariate Stackelberg game.We analyze the uniqueness of SE under the scenario of information sharing.Furthermore, we investigate the incentive problem in the absence of information sharing, and we transform it into a partially observable Markov decision process for multiple agents.To solve the SE in this situation, we design a learning-based game-theoretic reinforcement learning algorithm.• In the second stage, to allocate resources effectively, we design a greedy-based deep reinforcement learning algorithm to minimize the task execution time.• Through experimental simulation, it is demonstrated that the reinforcement learning algorithm proposed in this paper, which is based on learning games, can achieve SE in scenarios without information disclosure, and that it outperforms the conventional A2C algorithm.The reinforcement learning algorithm, grounded in the principle of greediness, can significantly reduce the execution time of tasks.
The rest of the paper is structured as follows.Section 2 describes related works.Section 3 presents our system model.Section 4 discusses the design of incentives in information sharing scenarios.Section 5 presents our game-theoretic learning algorithm in the context of non-information sharing.Section 6 details how greedy methods are used in the design of reinforcement learning algorithms.Section 7 evaluates the performance of the proposed algorithm through simulations.Finally, in Section 8, we provide a comprehensive summary of the article.

Related Work
Existing research on the problem of task offloading and resource allocation falls into two main categories.The first assumes that ESs can provide resources at no cost [18][19][20][21][22][23].The other assumes that ESs provide resources under some form of incentive mechanism [24][25][26][27][28][29][30][31][32][33].Zhang et al. [18] performed a joint optimization of channel allocation and dependent task offloading, as well as designed an algorithm based on genetic algorithms and deep deterministic policy gradients.Liang et al. [19] studied the problem of optimal offloading and optimal allocation of computational resources under dependent task constraints and proposed a heuristic algorithm to solve the problem.Xiao et al. [20] optimized a task offload strategy, communication resources, and computing resources under the constraints of task processing delay and device energy consumption, and they went on to analyze the optimal solution from the perspectives of slow fading channels and fast fading channels.Jiang et al. [21] proposed an online framework for task offloading and resource allocation issues in edge computing environments.Chen et al. [22] transformed the optimal resource allocation problem into an integer linear programming problem, and they proposed a distributed algorithm based on Markov approximation to achieve an approximately optimal solution within polynomial-time complexity.Chen et al. [23] described the auxiliary caching-assisted computation offloading process, which is characterized as a problem of maximizing utility while considering the quality of experience (QoE).In addition, the problem was decomposed into two sub-problems for separate solutions.
The aforementioned literature implicitly assumes that ESs can provide resources freely.However, in reality, ESs are driven by profit.Therefore, it is necessary to consider how to incentivize ESs.Based on whether the edge servers have sufficient resources, the study of incentive mechanisms can be divided into two aspects: First, when ESs lack adequate resources, the energy consumption of ESs acts as a counter incentive for users [24,25].Second, when the ESs have sufficient resources, the energy consumption of ESs becomes an incentive for users [26][27][28][29].Avgeris et al. [24] studied the offloading problem involving multiple users and multiple edge nodes, where the energy consumption of the ESs was considered a constraining factor in finding the optimal solution.Chen et al. [25].investigated the task offloading and collaborative computing problem in mobile edge computing (MEC) networks, and they proposed a two-level incentive mechanism based on bargaining games.Additionally, to address the issue of edge node overload, the energy consumption of edge nodes was treated as a disincentive in the second stage of problem analysis.The aforementioned paper considered the possibility of ES overload, but its research problem and scenario differ from the context of this paper.
The energy consumption of ESs serves as a positive incentive.Liu et al. [26] studied resource allocation and pricing problems in an EC system, then formulated them as mixedinteger linear programming and proved the problem to be NP-hard.To solve this problem, auction-based mechanisms and linear programming-based approximation mechanisms were proposed and developed.Tao et al. [27] studied the server resource pricing and task offloading problem by establishing a Stackelberg game model, as well as used a differential algorithm to solve for the optimal solution.Seo et al. [28] aimed to increase the utilization rate of ESs' computing resources, and they formulated the optimization problem as a Stackelberg game.Supervised learning was designed to obtain equilibrium strategies.However, the Stackelberg games used in [27,28] were based on a single variable and did not consider the dependencies between tasks.Kang et al. [29] introduced an auction mechanism to encourage ESs to offer services for tasks with dependencies, as well as designed an algorithm based on multiple rounds of truthful combinatorial reverse auctions to solve the problem of maximizing social welfare.However, the multi-round auction process in this algorithm may lead to high waiting latency.
The algorithmic designs in the aforementioned literature are all centralized.However, in reality, much of the information from users and ESs is difficult to control globally.Over the recent years, certain literature has also explored the design of incentive mechanisms under distributed systems.Bahreini et al. [30] proposed a learning-based distributed resource coordination framework which transforms the computation offloading and resource allocation problems into dual timescale problems, and it then solves them using game theory and distributed reinforcement learning algorithms.Liu et al. [31] described the computational offloading mechanism with resource allocation in EC networks as a stochastic game, and they designed a Q-learning algorithm to achieve NE.Li et al. [32] jointly optimized offloading decisions and resource pricing, as well as designed a learning game method to obtain optimal decisions.Despite these methods being based on a distributed approach, the aforementioned studies did not consider the dependencies of tasks.Song et al. [33] proposed a multi-objective offloading optimization algorithm based on reinforcement learning, which was aimed at minimizing execution time, energy consumption, and the cost of dependent task offloading.However, that work did not dive into the cost aspect.This paper investigates the problem of incentive mechanisms in dependent task scenarios, as well as proposes two distributed reinforcement learning algorithms to find optimal solutions.

Preliminary Technology
Reinforcement learning (RL) is an approach used to tackle problems involving uncertainty and decision making.In RL, an agent generates corresponding actions by observing the state of the environment.After executing the action, the agent receives feedback from the environment in the form of reward signals.Subsequently, the agent continually adjusts its strategy based on this feedback so as to maximize cumulative rewards.
The Markov decision process (MDP) is a formal modeling of the interaction between an agent and the environment in RL.MDP assumes that the agent can fully observe the state of the environment, and that the current system state can be described by a state variable.The transitions of the system states satisfy the Markov property, thus meaning that future states depend only on the current state and are independent of past states.By defining the available actions, transition probabilities between states, and reward functions associated with each state-action pair, the optimal decision policy can be found for a given MDP.However, in real-world scenarios, we often encounter decision problems with incomplete observations, where the complete state of the system cannot be directly observed.To address this situation, the Markov decision process is extended to a partially observable Markov decision process (POMDP).In a POMDP, the agent can only infer the system's state based on partially observed information, and they utilize the inferred state to interact with the environment and obtain maximum rewards.

System Model
As shown in Figure 1, in this paper, we investigate an EC system composed of a single user and multiple ESs.In this system, the set of ESs is defined as M = {1, 2, ..., m, ...M}, where each ES is equipped with an access point (AP).The user can select a designated AP to upload its dependent task to the corresponding ES.Moreover, these ESs are interconnected via wired connections.At a specific time slot, the user generates a dependent task.Given the limited resources and energy of the user's device, it is infeasible to complete the task execution within the desired time and energy consumption ranges.Consequently, the user offloads this dependent task to a proxy node.This proxy node then assigns the dependent task to other ESs that are incentivized to participate in task computation.Notably, this proxy node could be the ESs closest to the user.The working architecture of the system in this article is shown in Figure 2, in which the user and the ESs construct an incentive mechanism through the Stackelberg game.Within this mechanism, the user takes the role of leader and proposes an incentive strategy, which is designed to encourage the ES to participate in task computation by offering a reward.The proxy node publicizes the user's incentive strategy, and, if the ES (follower) responds to the user's incentive strategy, it informs the proxy node of the amount of resources it can contribute.The proxy node then informs the user of the ES's response strategy.Next, the user offloads the dependent task to the proxy node, which, by assuming a greedy strategy, calculates that each ES needs to compute i ∈ [ C N , C] sub-tasks, where C is the number of sub-tasks in the dependent task and N is the number of ESs encouraged to participate in task computation.The proxy node iterates through the values of i to distribute resources.Following this, based on each resource allocation strategy, a task distribution plan is obtained using reinforcement learning that is based on S2S neural networks.Upon completion of the task, the proxy node returns the results to the user.If the user verifies the results as correct, the reward is given to the proxy node, which then distributes the reward according to each ES's ratio of resource contribution.This process can be repeated if extended to a multi-user scenario.

Local Computation
Although our study's task offloading strategy involves completely offloading the dependent task to the ES, the time it takes for the dependent task to execute locally still serves as a baseline for comparison, thus necessitating their modeling.In terms of the task model, we emulated most of the existing literature [18][19][20][21] by modeling a dependent task as a directed acyclic graph (DAG) G = {V, E}, where V = {v n |n = 1, 2, ..., N} denotes the set of sub-tasks of the current dependent task.Each sub-task v n is represented by a binary group, i.e., v n = {d n , r n }, where d n denotes the size of the current task and r n denotes the size of the computation result.E = {e(v i , v n )|i, n ∈ 1, 2, ..., N} denotes the dependency relationship between the sub-tasks, and when v i depends on v n , v i can start the computation only after v n finishes and the result of the computation is transmitted to v i .
If f l is used to represent the computing capacity of the local device, and η u is the computing capacity required to process unit bit data, then the number of CPU cycles required to complete the task calculations locally is expressed as c n,loc = d n η u .Therefore, the time T n loc required to execute v n locally can be written as follows: Then, the local completion time of v n is the following: where AT n loc = max{AT n−1 loc , FT n−1 loc } is the earliest available time of the local processor and pre(v n ) is the set of direct predecessor tasks of v n .

Edge Computation
When a dependent task is offloaded to a proxy node, in order to fully utilize the resources purchased by the user, the proxy node uploads the sub-tasks of the dependent task to other ESs to achieve a collaborative execution of the task.When taking, as an example, a dependent task in ES j with the allocated sub-task v n , as well as when a target server is assigned to ES m, the execution of this task can be divided into three stages: the transmission phase, the execution phase, and the feedback phase of the result.When recording the completion time of sending as FT n up , the transmission time is T n up ; then, we have where is a direct predecessor of v n and v n when they are not executing on ES m, FT n down is the result of the backhaul time of the parent task of v n , and r j is the size of the communication resources allocated for v n .
In the execution phase, we let FT n m , T n m , and f n m represent the completion time of v n on ES m, the computation time, the computational resources allocated by ES m for v n , and the arithmetic power required by ES m to process the data per unit number of bits, respectively.Then, the computation completion time of the task is written as follows: where is a direct precursor of v n and executes on ES, m, c n,m = d n η m is the number of CPU cycles required by ES m to process v n , and η m is the arithmetic power required by ES m to process one unit of bit data.
We let T n down represent the return time of the result after v n computes the result on ES m.Then, the result return completion time FT n down can be written as follows: where AT n down = max{AT n down , FT n down } is the earliest available time for the downlink and r n m is the communication resource allocated to v n by ES m for the return result.

Incentives under Information Sharing Conditions
In order to incentivize the participation of ESs in task computation, this section establishes an incentive mechanism based on the Stackelberg game, wherein the user is the leader and the ESs are the followers.

Participant Utility Functions
This subsection first discusses the utility function of ESs (the followers).We let f m denote the resources contributed by ES m, and let R 1 represent the funds spent by the user to calculate the current dependent task.Given that ESs incur additional energy consumption when processing external tasks, this extra energy consumption is considered to be the inconvenience that is caused by handling these tasks.Therefore, for each ES, its utility is the reward gained from the contributing resources subtracted by its internal inconvenience [34].Consequently, the utility of ES m by selling computational resources is as follows: where α m is the unit cost expense of computing energy consumption, D is the average data size for each sub-task, and κ m is the effective switching capacitance.
For communication resource r m , this paper uses communication speed as a measure.If the user purchases communication resources with amount R 2 , then the utility of ES m by selling communication resources is as follows: where β m is the unit transmission rate cost.From this, the total utility of ES m can be obtained as follows: When designing utility functions for the user, two factors need to be considered: price satisfaction and resource acquisition satisfaction.Since these two aspects are in conflict with each other, the satisfaction gained from resource acquisition follows a law of diminishing returns.This principle can be represented by a continuously differentiable, concave, strictly increasing function [35].Therefore, a user's utility can be modeled as follows: where δ is the compromise factor.

Problem Formulation
The goal of this paper is to incentivize ES participation in task computations under the premise of maximizing user utility, as well as to rationally allocate the resources purchased by users under the incentive mechanism to minimize the task execution time.Therefore, the objectives of this research can be defined as two sub-optimization problems.
Given user strategy R = {R 1 , R 2 }, each ES competes for the rewards offered by the user.As a result, a non-cooperative game is formed among these ESs, with the goal of reaching a state that all ESs find satisfactory.Once this state is achieved, the user adjusts their strategy to maximize their utility.Thus, the optimization objective at this stage can be stated as follows: P1 : max where C1 indicates the funds used for purchasing resources, which must not exceed the maximum funds available from the user; and C2 represents the state of satisfaction from the perspective of the party involved.
Under the designed incentive mechanism, the user purchases the resources needed to process the current dependent task.Thus, the objective of the second sub-problem is to effectively allocate the purchased computing resources and obtain the optimal task allocation strategy to minimize the task completion time.Therefore, the optimization objective at this stage can be written as min(T(G)), st.(13), where T(G) = max v n ∈end(G) {FT n j , FT n down }, end(G) is the set of exit tasks that depend on task (G).

Stackelberg Equilibrium Analysis under Information Sharing Conditions
In this paper, we first analyze the existence and uniqueness of the SE under information sharing conditions, in which each participant in the game can obtain all the information from other participants, such as the preference factor δ reflecting the user's preference for communication and computing resources, and the inconvenience factors α m , β m reflecting the ES m's reception and processing tasks.The objective of the Stackelberg game model proposed by the user and ESs is to find a unique SE that maximizes user utility.In this equilibrium, neither the user nor the ESs have the motivation to unilaterally change their strategy.The SE is defined in this paper as follows: .., Z * m } that are the respective Stacklberg equilibria for the leader and the follower if the following conditions are satisfied: Next, the Nash equilibrium (NE) of the non-cooperative game between the follower parties is analyzed, first to analyze the follower's best response for the computational resources, as well as to differentiate Equation ( 9) with respect to f m to obtain the following: Clearly, utility function u 1,m with respect to f m is concave; as such, there exists a maximum value for u 1,m .This implies that a non-cooperative game concerning the allocation of computational resources has an NE.Therefore, by making ∂u 1,m ∂ f m = 0, we can obtain the following: By summing both sides of Equation ( 17), we obtain where γ m = α m Dη m κ m , and, since the difference between any two γ m is extremely small, the right side of Equation ( 18) can be approximated as 2 γ∑ M m=1 f m .As such, Equation ( 18) is further computed as follows: Thus, it can be derived that By substituting Equation (20) back into Equation (17), we obtain the closed-form solution for the optimal response of f m : Given user's strategy R 1 , each ES always has its own optimal response f m .Due to the concavity of the utility function of the ES u 1,m , the optimal response is unique.Therefore, the NE of the non-cooperative game between the ESs regarding computational resources is also unique.
For communication resource Equation ( 10), the derivative with respect to r m can be obtained as follows: Since Equation ( 23) is a constant and less than zero, utility function u 2,m is a concave function; thus, there is an NE in the non-cooperative game between the follower parties over communication resources.
From reference [36], if the game of communication resources satisfies Theorem 1, the uniqueness of the non-cooperative game's NE can be demonstrated.Theorem 1.Given strategy R 2 and set S = {m ∈ M|r m > 0} of follower parties, we let r = (r 1 , r2 , ..., rn ) be a Nash equilibrium (NE) strategy if the following four conditions are satisfied.If so, then the NE is proven to be a unique NE as follows: (1). | S| ≥ 2, . Suppose the unit rate transmission cost of the edge server satisfies and let h be the largest integer in [2,n] such that β h < ∑ h j=1 β j h−1 , then S = {1, 2, ..., h}.
Clearly, the non-cooperative game between parties over communication resources satisfies these four conditions.
Proof of Theorem 1.When assuming |S| = 0, no ES participates in the game and the game is not established.Therefore, it can be inferred that S ≥ 1 .Now, we suppose S = 1; this implies that k ∈ M, rk > 0. According to Equation (10), the utility of ES k is R 2 − rk β k , which means that ES k can unilaterally modify its strategy to increase its utility, thereby contradicting the NE.Thus, Condition 1 is proved.To next prove Condition 2 and to accumulate Equation ( 22), we obtain the following: By substituting Equation ( 24) back into Equation ( 22), as well as by setting the result to zero while considering rj = 0 for any j ∈ M \ S, we obtain the following: The proof of Condition 2 is thus complete.For (3) given | S| for any i ∈ S, we can deduce that ri > 0. According to Equation (25), ri > 0 implies (|M|−1)β m ∑ M m=1 β m < 1, and hence we can obtain the following: This tells us Assuming that there is β q < max j∈ S{β j }, but q / ∈ S, then-according to Equation ( 25)-we have rq = 0. Hence, Equation ( 22) can be rewritten as follows: This implies that ESs can unilaterally modify their strategy to increase their utility, which contradicts the NE definition.Thus, Condition 3 is proven.Next, we prove Condition 4. From Condition 1 and 3, we have S = 1, 2, ..., q, where q ∈ [2, n].According to Inequality (26), we can conclude that q ≤ h.This implies that , which means that when r = r, the derivative of utility function u q+1,m with respect to r q+1 is ∑ q+1 j β j q − β q+1 > 0. This suggests that q = h, and thus Condition 4 is proven.Therefore, the NE of the non-cooperative communication resources between the follower parties is unique.Theorem 2. In the proposed multivariate Stackelberg game, there is a unique Stackelberg equilibrium between the user and the edge servers.
Proof of Theorem 2. By substituting Equations ( 20) and (24) into Equation ( 12), we obtain the following: where . The Hessian matrix of Equation ( 29) is obtained as follows: Since the eigenvalues of this matrix are all less than zero, the matrix is negative definite, implying that the original function is concave and hence possesses a maximum value.
As the best response strategy of the square is unique, the value that maximizes U is also unique.Therefore, the equilibrium of this Stackelberg game is unique.
According to the analysis above, this article develops a centralized algorithm that is capable of calculating the SE under information sharing conditions, the details of which are presented in Algorithm 1 .This algorithm achieves an approximate equilibrium for the multivariable Stackelberg game.The approximation precision depends on ε.When ε is smaller, the result is highly accurate but the algorithm converges slowly.When ε is larger, the accuracy is low but the algorithm converges quickly.D, α, β, κ, I, ε, R, x i Output: optimal strategy R [k] , x

Input: initialization
calculation of the followers' utility u 1,m , u 2,m by equations ( 9) and (10) 5: save the strategy that maximizes u 1,m and u 2,m as x save the strategy that maximizes U as R [k]   9: end while 10: end while

Study of the Incentives under Non-Information-Sharing Conditions
In the previous section, we focused on analyzing NE under the conditions of complete information transparency.However, each participant in the game is rational and may be unwilling to disclose their private parameters.This situation precludes the use of centralized algorithms to solve the NE.Deep reinforcement learning (DRL) achieves a balance between exploration and exploitation, thereby learning from the accumulated experience through environmental exploration to maximize rewards.Inspired by reference [37], we designed a reinforcement learning algorithm based on the learning game to solve the SE without sharing information.In the sections that follow, we first provide an overview of the entire framework for the incentive mechanism based on DRL.Subsequently, the NE solving problem is expressed as a DRL learning task.

Overview
To achieve the NE with respect to privacy preservation, each participant becomes an agent in the DRL, as shown in Figure 3.In the above figure, the user is viewed as Agent 0, ES m is viewed as Agent m, and the user cannot issue new incentives until the ES makes a new decision; this is because the learning state of each agent is updated according to each participant's decision.The agent training process is divided into T iterations, and in each iteration t, the user interacts with the environment as the leader of the game and determines action a t 0 through acquired state S t 0 ; furthermore, ES m determines action a t m by interacting with the environment as the follower of the game through the acquisition of state S t m , as well as calculates incentives R t 0 and R t m for all participants by collecting, respectively, their strategies, which then creates the state of the next moment based on the collected strategies.Meanwhile, the agent saves the historical data of all actions in a buffer queue of size L.After D time slots, the experience in the buffer queue is re-played to compute the rewards, which are then utilized to update the network parameters.Since in each iteration of training the agent's state is generated based on the action decisions and no privacy information is acquired, this allows the finding of the NE without privacy leakage.

Design Details
In this paper, the interaction between the game participants is formulated as a multiintelligence POMDP.The details of its state space, action space, and reward function are as follows.
State space: For the current iteration at time t, the state of each participant in the game is composed of its previous experiences and the experiences of all the other participants in the most recent training sets L. Specifically, S t 0 = {ω t−L , ω t−L+1 , ..., ω t−1 } is expressed as the state of the user, and Action space: According to the game decision variables, the user's action at iteration time t is defined as a t 0 = v t 0 = {R 1 , R 2 }, and the action of ES m at time t is defined as And, in order to increase the learning efficiency, the action values of each agent are restricted to the range of 0-1 using the Min-Max normalization method.
Reward function: Based on the utility function and constraints, the agent reward function is designed since the maximum amount that the user can offer is R max .Thus, a penalty factor µ must be added to the reward function when the total price exceeds Rmax.Then, the reward function of the user at the moment of iteration t is set to be The reward function of ES m at time t is

Optimization of Learning Objectives and Strategies
In this paper, an actor network π θ and a critic network v σ are designed for each agent.The agent learns to approximate the policy function with π θ and the value function with v σ , where θ and σ represent the parameters of the networks.Furthermore, for agent m, we express the state value function as V(S t m ; π θ m ) and the action value function as Q(S t m , a t m ; π θ m ).The learning objective of the agent m is defined as L m .Therefore, its learning objective can be described as follows: The training process in this study uses the proximal policy optimization (PPO) algorithm that was introduced in reference [38].Specifically, the policy gradient and the policy gradient clipping term were defined as follows: where is the scale factor of the old and new policies, which is used to control the magnitude of the gradient update of the policy.
π θ (S m ) represents the advantage function that adjusts the direction of the updates of the policy gradient.
The clipping function in Equation ( 34) is defined as follows: where ε is an adjustable parameter deployed to prevent policy updates from becoming excessively large (which could lead to unstable training).Following this, the actor-critic network model is updated using stochastic gradient ascent and gradient descent, respectively.As the training process progresses, the agent incrementally learns the optimal policy.Upon convergence of the training process, the agent determines the policy based on the output of the actor network.

Task Allocation for DRL Based on Greedy Thinking
Motivated by the user, edge servers participate in task computations.To efficiently allocate the resources purchased by user under the incentive mechanism, we design a deep reinforcement learning algorithm inspired by the work presented in reference [39].This algorithm first employs the concept of greediness to distribute resources, and it then applies sequence-to-sequence (S2S) neural network reinforcement learning to assign dependent tasks with the aim of minimizing task execution time.The specifics of S2S neural network reinforcement learning are detailed below.

Overview
During the task assignment and scheduling process, each edge server (ES) deploys an S2S neural network.The details of the network are shown in Figure 4.The vector embedding of the DAG is denoted as V = [v 1 , v 2 , . . ., v n ], the function of the encoder network is represented as f enc , and by feeding the embedding vector into the encoder the hidden state of encoding step i is represented as where θ enc denotes the parameters of the encoder network.Upon completion of the encoding, the hidden-state representation of the original sequence is obtained.We let f dec be the function of the decoder network, then output d j of decoding step j is calculated as follows: where a j−1 is the predicted value at the previous moment, θ dec is the network parameter of the decoder network, and c j is the context vector of the attention mechanism.According to reference [40], c j is defined as follows: where α j is defined as where [e i ; e j ] and [e k ; e j ] are the concatenations of row vectors, and W a and v are the learnable parameters.In addition, two fully connected layers are added to output d j of the decoder, wherein one serves as the output distribution of the action network π(a j |s j ) and the other serves as the output state value of value network v(s j ).When the training of the S2S neural network is completed, the output of the network is the task assignment decision to be obtained.

Design Details
In order to solve the task offloading problem by using DRL, this paper modeled the task assignment problem as a Markov decision process (MDP), and the specific description of this process is as follows.
State space: When scheduling task v n , the current state of the system depends on the scheduling results of the tasks preceding v n ; therefore, the state space is defined as a combination of the directed acyclic graph (DAG) information and a partial offloading plan, that is, S = {G, A 1,i }, where G is the embedding vector of the DAG (including the index number of task v n ), the estimated transmission and execution cost of the task, and the task numbers of the direct predecessor and direct successor.Furthermore, in this paper, we set the upper limit of the number of tasks of the direct predecessor and the direct successor to six, and A 1,i represents the sequence of decisions about task assignment from v 1 to v i .
Action space: After a user passes over the dependent task to the edge server (ES) j, the edge server cooperates to execute, and this is based on the resources purchased by the user.Therefore, the action space can be defined as A = {1, 2, . . ., j, . . ., M}.
Reward function: Since the goal of this paper is to minimize the task completion time, in order to reach this goal, this paper defines the learned reward function as the time increment saved.Therefore, the offloading sub-task v n reward function is defined as follows: where Tj represents the average execution time of the sub-task on ES j, and A 1:n−1 represents the actual time spent executing the task.
Since the purchased resources are limited, a penalty factor is set for situations where the usage exceeds the purchased resources.By taking the dependent task on ES j as an example, a counting function C(j) is defined under the final decision on task assignment A 1:N .This then provides the count of action j in decision set A 1:N .This function can be represented as C(j) = |{A 1:N in x : x = j}|.Then, reward R, for executing the current dependent task, can be written as follows: where φ is the penalty factor.

Optimization of Learning Objectives and Strategies
Assuming the training objective is L, the goal is to find an optimal policy that maximizes cumulative rewards; as such, the learning goal of this section is written as follows: where θ is the parameters of the S2S neural network, N represents the sub-task number, and R represents a reward function with a penalty factor.The network is also trained using the PPO method in this section.During the training process, the agent uses a discount factor γ, and the S2S neural network is updated with the discounted cumulative rewards according to every T iterations.As the training process progresses, the network gradually converges.Subsequently, the ESs can obtain the optimal unloading decision and resource allocation scheme that minimizes the execution time of dependent tasks based on the predicted results of the S2S neural network.

Simulation Results
In this paper, we evaluate the performance of algorithms through numerical simulations, which were implemented in a Python 3.7 environment using TensorFlow.The efficacy of the two algorithms was verified under conditions where the number of the ESs was two, three, or four.Each nonproxy ES possesses the same computational capacity, but their communication capabilities differ.The dependent tasks used were generated using the DAG generator provided in reference [39].The specific parameters are shown in Table 1 below: This section presents the convergence analysis of the reinforcement learning algorithm based on learning games using two edge servers as an example.Additionally, a comparison is made between the proposed algorithm and the A2C algorithm proposed in reference [41].A2C estimates the goodness of agent actions using an advantage function, and it updates network parameters using policy gradients to learn strategies that result in higher rewards.Furthermore, we introduce the greedy algorithm and random method as baselines through which to evaluate the performance of the proposed algorithm.
The convergence plots of the user and edge server utilities are shown in Figure 5a-c.From the graphs, it can be observed that the proposed algorithm in this paper achieves a complete convergence at approximately 600 rounds, whereas the A2C algorithm converges at around 800 rounds, thus indicating a performance improvement of approximately 22.4 percent by the proposed algorithm.In terms of stability, when the proposed algorithm converges, the utilities of the participants are particularly close to the theoretical maximum social welfare (SE), whereas the A2C algorithm still exhibits a significant gap from the theoretical SE at convergence.On the other hand, the greedy and random methods oscillate without converging during the solving process.These results highlight the significant advantages of the proposed algorithm in terms of utility convergence and stability.
Based on the analysis in Figure 5, it can be concluded that the greedy algorithm and random method are unable to solve for the maximum social welfare (SE).Therefore, in order to enhance the readability of the result graphs, the subsequent strategy and convergence plots do not include the results of these two methods.Figure 6a-c are the convergence graphs of the user price strategy, ES computing resource strategy, and the ES communication resource strategy, respectively.For the user, their price strategy is a reward offered to incentivize ESs to participate in task computation, and this corresponds to both computing resources and communication resources.In the algorithm designed in this paper, as the number of iterations increases, the participants' strategy gradually approaches the theoretical Nash equilibrium, which finally converges to a position that is exceedingly close to equilibrium.Although the A2C algorithm can also converge, its policy after convergence still has a certain distance from the theoretical Nash equilibrium strategy.Therefore, in terms of convergence speed or convergence accuracy, the method designed in this paper is generally superior to the traditional A2C method.

Analysis of the Effectiveness of the Greedy-Based DRL Algorithm
In this subsection, we validate the effectiveness of the greedy-based DRL algorithm using two different dependency structures with identical task sizes and quantities.One dependency structure primarily consists of two parallel structures, while the other primarily consists of three parallel structures, and the depth of the dependency tasks in the two parallel structures is greater than that in the three parallel structures.
Figure 7a reflects the usage of computing resources under different numbers of edge servers.As shown in the figure, when the number of ESs is fixed, the dependency structures may vary, but the amount of computing resources consumed remains the same.This is because, regardless of the dependency structure in this case, the number of subtasks processed by the proxy ES in the optimal scenario is the same, the number of further distributed sub-tasks is the same, and the computing capability of each non-proxy ES is the same.Therefore, the amount of computing resources consumed is the same.Figure 7b shows the usage of communication resources in different situations.When the number of edge servers is two, regardless of the main parallel structure of the dependent task, the proxy ES distributes five sub-tasks externally; as such, the communication resources used in this case are the same.However, as the number of edge servers increases, the number of tasks assigned to each edge server under different dependent task structures varies, thus resulting in different usages of communication resources.Furthermore, the changes in total resources as shown in Figures 7a,b indicate that the more edge servers that are motivated to participate in the computation, the less resources each edge server contributes to, on average.Figure 8 presents the impact of dependency structures and the number of ESs on the execution time of dependent tasks.When dependent tasks are fully executed locally, the dependency structure has no impact on the task, and thus its local execution time remains constant.For tasks primarily based on two parallel dependency structures, the execution time with three ESs is slightly less than with two edge servers.This is due to the waiting time for sub-tasks in the dependent tasks being greater than the communication time between edge servers.When the number of ESs is four, since the computing resources of each non-proxy ES are the same compared to when there are three ESs, the offloading optimal decision remains the same regardless of the parallel structure.However, as each edge server's average resource contribution decreases and less resources are allocated to each task, the execution time increases.But, overall, the execution time of the dependent task under the incentive mechanism is lesser than the total local execution time.

Conclusions
This paper investigated a game-based incentive mechanism that is based on multiple Stackelberg variables.In this mechanism, the users act as leaders proposing incentive strategies, while the ESs respond as followers by providing available resources in response to user incentives.Subsequently, we analyzed the uniqueness of SE under information sharing conditions.Considering that participants are unwilling to disclose privacy parameters, we proposed a reinforcement learning method based on learning games to solve the Nash equilibrium under non-information sharing conditions.Upon obtaining the optimal decision, we used a greedy approach to allocate the resources provided by the ES.We employed a reinforcement learning method based on S2S neural networks to obtain the optimal decision on task allocation to minimize task execution time.The effectiveness of the model was finally demonstrated through empirical validation.Future work will consider more efficient resource allocation methods and will aim to further optimize the task allocation process with the goal of maximizing the utility of task execution.

Figure 7 .
Figure 7. Resource usage graph.(a) Computing resource usage graph and (b) communication resource usage graph.
7.1.Performance Analysis of the Incentive Mechanism Algorithm Based on Learning Games