Adaptive Real-Time Ofﬂoading Decision-Making for Mobile Edges: Deep Reinforcement Learning Framework and Simulation Results

: This paper proposes a novel dynamic ofﬂoading decision method which is inspired by deep reinforcement learning (DRL). In order to realize real-time communications in mobile edge computing systems, an efﬁcient task ofﬂoading algorithm is required. When the decision of actions (ofﬂoading enabled, i.e., computing in clouds or ofﬂoading disabled, i.e., computing in local edges) is made by the proposed DRL-based dynamic algorithm in each unit time, it is required to consider real-time/seamless data transmission and energy-efﬁciency in mobile edge devices. Therefore, our proposed dynamic ofﬂoading decision algorithm is designed for the joint optimization of delay and energy-efﬁcient communications based on DRL framework. According to the performance evaluation via data-intensive simulations, this paper veriﬁes that the proposed dynamic algorithm achieves desired performance.


Introduction
According to the fact that the 5G era has been realized based on networking technology innovation, many new computing and communications paradigms are introduced such as millimeter-wave communications, hyper-dense networks, and device-to-device proximal networking [1][2][3]. Among them, mobile edge computing (MEC) is one of the major technologies for realizing data/computing distribution in order to improve cellular network performance [4]. Based on the benefits of MEC technologies such as data rate improvement and quality enhancement, mobile cellular users can enjoy high quality, seamless, and real-time communication networking services.
Together with the MEC technologies, many networked components are also of interest such as cloud servers and connected devices/vehicles. As the number of connected devices/vehicles increases, the amount of data transmitted to the MEC is also rapidly increasing. This obviously introduces serious network limitations such as data processing performance limits, storage capacity limits, and the battery use of terminal devices. Under the consideration of these limitations and problems, the use of cloud computing server is more efficient to deal with big data (gathered from connected vehicles/devices via MEC edges) than the local computing on terminals. On the other hand, for any cases where the data cannot be handled in the cloud servers due to delay requirements and security reasons, MEC devices should be able to handle or process the data from the connected vehicles/devices. Therefore, we can observe the trade-off between cloud computing servers and MEC servers (local edge computing servers). In summary, cloud servers have more power and centralized computing benefits comparing to MEC servers, whereas, it may have limitations when we have delay requirements and security requirements/regulations. Figure 1 shows the combined architecture of local edge computing and cloud computing. As the transmission from connected vehicles/devices (i.e., data upload and download) delay goes down, it is much more suitable for real-time processing. It enables real-time data processing and transmission using high-quality data in real-time broadcasting or video streaming. Therefore, it is essential to design an algorithm which can handle the data transmission scheduling (i.e., offloading decision-making, i.e., centralized vs. local computing). Note that the data from connected vehicles/devices will be handled in cloud computing servers when offloading decision is made, whereas the MEC servers will handle the data when the offloading is not preferred. In order to handle this problem, many research contributions nowadays have been considering deep learning and machine learning based methodologies. Among them, this paper considers reinforcement learning based approaches because this given problem is for stochastic sequential offloading decision-making. Among a lot of deep reinforcement learning (DRL) methodologies such as Q-learning, Markov decision process (MDP) [5], deep Q-network (DQN), and deep deterministic policy gradient (DDPG) [6,7], this paper designs a sequential offloading decision-making algorithm based on DQN. The reason why this paper considers DQN is that it is the function approximation of Q-learning using deep neural network (DNN) in order to take care of large-scale problem setting. The proposed sequential offloading decision-making algorithm inspired by DQN aims at the maximization the cumulative summation of the rewards for sequential offloading actions (centralized vs. local computing). Therefore, it is obvious that the reinforcement learning is a way to proceed optimal learning even in a dynamic environment under uncertainty. The reinforcement learning is applied to a variety of environments such as artificial intelligence robotics development, automatic systems, and self-driving technology development. This paper utilizes this DQN technology for offloading decision-making (i.e., processing the transmitted data at centralized computing, i.e., cloud or local computing, i.e., MEC) for real-time seamless high quality data communications under uncertainty conditions, e.g., channel dynamics. The reason why DQN is selected among the given Q-learning, MDP, DQN, and DDPG is as follows. The Q-learning and MDP are not related to DNN based function approximation; and their solutions can be computed using dynamic programming. Therefore, the computational complexity of Q-learning and MDP solution computation can be very high when the search space becomes large. The DDPG based formulation is for continuous action domains. In this paper, our formulation is for offloading decision-making whether the offloading should be conducted or not. Thus, our action space is discrete (0 or 1), i.e., continuous domain DDPG is not suitable; and finally DQN is the best solution among given options.
In mobile edge computing research, several algorithms were proposed in order to optimize their own objectives [8][9][10][11][12]. However, none of them are considering offloading decision for determining whether it has to conduct offloading or not. The proposed algorithm in this paper is the improvement of our previous contributions in [13]. Comparing to our previous work in [13], the proposed algorithm in this paper includes more detailed mathematical formulations and descriptions. Furthermore, various data-intensive performance evaluation results are included in this paper.
These kinds of DRL-based algorithms including our proposed DQN-based method can be widely used and are practical due to the function approximation nature of DNN. With DQN, the states and actions are used as input and output values in DNN training. Once the training is done, we can simply infer our approximated optimal actions by putting current states into the trained DNN. Thus, the computation itself during the inference is light-weight (even though training itself is not) and thus using a DQN-based method is realistic in real-time decision-making systems, as also discussed in [14].
The rest of this paper is constituted as follows. Section 2 introduces preliminaries, i.e., deep learning foundations and related work, i.e., reinforcement learning algorithms for mobile edge computing. Section 3 proposes our system model. Section 4 describes the details of our algorithm with explanation about DQN. In Section 5, we describe the details of simulations and results. Finally, Section 6 concludes this paper and presents future research directions.

Deep Learning and Deep Reinforcement Learning Foundations
Nowadays, there are huge interests on deep learning algorithms and applications by industry and academia. The deep learning is essentially for conducting artificial intelligence (AI) tasks using neural network framework. The neural network is for enabling nonlinear computation (so called activation function) in terms of its own AI tasks. If the neural network has more and more hidden layers, the nonlinearity increases. Previously, having many hidden layers was not possible due to a gradient vanishing problem (i.e., the problem where the parameters in hidden layers are not trained and converge to zero). However, the problem was solved by utilizing other types of activation functions, e.g., rectified linear unit (ReLU). Finally, the neural network starts to have many hidden layers; thus, the nonlinearity increases a lot. Eventually, the performance is dramatically increased, even better than human beings in some areas. This neural network is called deep learning because it has deep many hidden layers. Once the neural network is trained with many data, the parameters within the neural network are trained. Thus, we can move to the inference procedure which takes inputs, then the input will be calculated with the trained parameters, and then the outputs can be derived.
Among many deep learning frameworks, if the inputs and outputs of neural network are states and their corresponding actions, respectively, it means we can train the neural network with states and their corresponding actions. Once the training is done, the neural network is now capable of deriving approximated actions depending on input states. Semantically, it means that the neural network can sequentially derive approximated optimal actions in current input states. Thus, this neural network is capable of sequential decision-making for time-series states inputs. This is deep learning based reinforcement learning, and thus it is called deep reinforcement learning, which is mainly discussed in this paper.

Related Work: Reinforcement Learning for Mobile Edge Computing
There are remarkable research results that are based on reinforcement learning techniques for sequential stochastic decision-making in various computing research domains. For the application of deep reinforcement learning to mobile edge computing, the research contributions in [8][9][10][11] had been discussed about the optimization for their own objective functions. Even if they considered many criteria for the offloading, there are not contributions that aim at the optimal sequential decision-making for offloading decisions, i.e., whether it has to conduct offloading (i.e., centralized computing) or not (i.e., local edge computing). For the design and implementation of intelligent and efficient systems, Feng et al. proposed a hybrid intelligent control mechanism which combines high-level time PetriNet and reinforcement learning frameworks [15]. The control system of the model is mathematically designed by hybrid time PetriNet and additional methods are utilized based on Q-learning, which is a special type of deep reinforcement learning for minimizing transitions delay time. Even though the paper is successful for delay minimization, the other important factors, e.g., energy consumption optimization and offloading, are not discussed in this paper. Furthermore, Bhagat et al. designed a self-sufficient intelligent agent that can be learned based on the information collected from the agent's environment by combining DRL algorithms and soft bio-inspired structures [16]. They also presented several examples in various real-world scenarios. Furthermore, the paper described various research results that conduct DRL methods in soft robotics research domains. Even though the algorithm in this paper improves the performance by utilizing bio-inspired methodologies, it is not discussing about communication-related performance improvements, e.g., delay and energy optimization. In autonomous driving research, Zhang et al. improved the instability seen by Q-learning which is using a double deep Q-learning network as an automatic stochastic sequential decision-making approach for vehicle controls [17]. This paper is novel and conducts efficient tasks in vehicle controls; the algorithm is not scalable for conventional mobile edge computing environment because the parameter setting and formulations are all based on application-specific vehicle controls. Xu et al. designed a sequential intelligent decision-making scheme for highway autonomous driving [18]. This decision-making policy is learned via Q-learning, which is one of the well-known reinforcement learning algorithms based on the series of simulated driving scenarios data. In addition, there are a number of control-related studies for safe autonomous driving in highway environments [19,20]. Shin et al. proposed a novel inverse-reinforcement and imitation learning algorithm, which is one of major reinforcement learning algorithms nowadays, for autonomous driving control under the benefits of augmented random search [21,22]. In addition, there exists RL applications to game playing. Silver et al. designed and implemented AlphaGo-Zero, which evaluates the location decision using adaptive tree search techniques and determines the movement using trained deep learning based on reinforcement learning [23]. The AlphaGo Zero conducted the learning procedures through its own self without a guide or knowledge of human data and game rules, and it improved the performance of the tree search techniques. Finally, the AlphaGo Zero has achieved more than 100 wins against AlphaGo, which won against world-champion human Go players. The algorithms in [18][19][20][21][22][23] conduct engineering for their specific applications well such as autonomous driving and game AI. Therefore, they are all independent of the communications and networks specific key consideration factors such as energy and delay optimization. Therefore, none of our related research literature papers are directly associated with ours. If we directly utilize their algorithms to our mobile edge computing systems without any modifications, the algorithm definitely presents poor performances due to the lack of mathematical considerations in terms of energy and delay optimization.

Reference System Model and Formulation
In this scenario, terminal devices (i.e., CCTV camera, drones equipped camera, and vehicle black-box) are wirelessly connected to an edge server which is located nearby as shown in Figure 2 . When the terminal device collects video continuously/seamlessly, there exists the limits of battery and storage capacity depending on the size of devices. Based on this limitation, it is not possible to store or compute all of the data on the device itself in any time. Furthermore, even if all data can be processed, a delay which is caused by the processing limit of the device occurs. This can make serious problems in environments where real-time communications is essentially necessary, e.g., mission-critical communications. For this reason, it is necessary to use virtual servers that have better performance than terminal devices. Moreover, using the edge server can be a way to overcome these problems related to the limitation of the terminal devices. However, using edge computing technique (i.e., data offloading) is always not the best solution. We have to consider the delay time of wireless data transmission between the terminal device and the edge server. For this reason, we propose a novel offloading decision algorithm that minimizes the combination of energy consumption and delay at each terminal device. The algorithm we proposed in this paper is based on DQN, which is one of reinforcement learning algorithms that maximizes cumulative summation of rewards. Using this DQN-based algorithm, each terminal device can decide whether it will do offload or not. For this purpose, the corresponding actions and rewards values in DQN-based formulation are as follows: where TD capacity, StandBy Q capacity, and ES capacity stand for the capacity of terminal device, stand-by Q capacity, and the capacity of edge server, respectively. Here, the standByQ means the waiting buffer of each terminal device. The data/information that is decided to be offloaded is stored the standByQ. Next, each action means whether it needs to be offloaded or not. Here, the value 1 means offloading happens at time step t, whereas the value 0 means processing the data locally at time step t (i.e., offloading does not happen). The details of the proposed algorithm are explained in Section 4.

Offloading Decision: Trade-Off between Energy Consumption and Delay
In mobile edge computing (MEC) scenarios, each terminal device has limitations in terms of storage capacity, computing performance, and battery lifetime. Each terminal can offload its data (or task) to its associated nearby edge server and also does not need to operated by itself. The MEC can (i) save the energy of each device, (ii) prevent data loss due to insufficient storage capacity, and (iii) reduce delay due to the lack of computing performance. To make good use of the advantages, it is necessary to decide what to offload and when to offload. The execution delay and energy consumption of each terminal device can be distinguished into two parts, i.e., local computing and edge computing as explained in (1) and (2). In edge computing architectures, the values include the transmission to edge server, processing on the edge server, and downloading from the edge server.
Delay o f f load = U ploadingdelay + ESprocessingdelay + Downloadingdelay, where Delay o f f load and Delay local are the delays when the offloading happens or not, respectively.
Energy o f f load = U ploadingenergy + ESprocessingenergy + Downloadingenergy, where Energy o f f load and Energy local are the energy consumption when the offloading happens or not (based on offloading decision-making by DQN), respectively. As discussed in the literature, offloading decisions are made depending on the following three criteria, i.e., (i) minimization of execution delay, (ii) minimization of energy consumption while satisfying tolerance delay, and (iii) trade-off between energy consumption and execution delay depending on the environment of the applications of offloading in each terminal device [24].
When the goal is the minimization of total execution delay regardless of energy consumption, each device will conduct the internal offloading policy and it decides whether offloading should happen or not, i.e., local computing or offloading computing [25]. If the goal is the minimization of the energy consumption of each device, it uses the smallest energy resources even if the delay increases. This paper considers both execution delay and energy consumption of each terminal. Furthermore, for real-time communications, more weights are allocated to the minimization of total delays than energy consumption. In this postulated environment, each terminal determines whether it has to offload or not, according to DQN-based offloading decision-making algorithm which is based on reinforcement learning instead of general offloading determination [26,27].

Proposed DQN-Based Offloading Decision Algorithm
In this section, a DQN-based novel offloading compression decision (OCD) algorithm is proposed. The brief introduction to the algorithm is explained in Figure 3.

Deep Reinforcement Learning
In theory, the reinforcement learning is a method where an agent learns the set of actions which purposes maximizing cumulative summation of rewards. The reward is defined as a return value that the agent can get. The agent receives the reward and information about the changed state at time step t + 1 from the environment. Using this reward and observation, the agent learns a policy π, and we say the policy as optimal policy π * when the reward is always optimal through the policy [28,29]. The reward value can be calculated using Q-function.Q(s, a) is a reward of current state and it is a summation of return value r andQ(s , a ) which is a maximum reward value expected to be received in the future [30]. The corresponding mathematical formulation is as follows: where (4) is an equation that contains a learning rate α and a discounting rate γ to (3).
Here, DQN is a way to learn the Q-function using a deep neural network. Through layers, the output informs all or the possible actions and the corresponding reward of each action [31]. The application of deep neural network overcomes the size of memory that has been the limitation of Q-table in Q-learning; and an efficient function output is possible for large amounts of data. The expressions (5) show that it minimizes a difference between the optimal value and real value (i.e., predicted value). The optimal value is a target value that the agent aims to get. The equation form is similar to the cost f unction of linear regression problem: However, since both the target and predicted values use the same network and similar learning data, DQN has a problem with correlation between samples and non-stationary target. The change of target value when performing network update in Q pre is the cause of the non-stationary target problem. Furthermore, because of that problem, it is impossible to make the predicted value similar to the target value, and the agent will learn policies in a different direction than the target. To solve this problem, we can use DQN. It can store data collected by agent in reply buffer and use only data that is extracted from the buffer. In addition, in DQN, the pretend network and the target network are separated. Therefore, the updates of the pretend network θ cannot make an impact to the target network. According to these characteristics of DQN, the agent can get appropriate policies [32][33][34].

DQN-Based Offloading and Compression Decision
For the DQN-based offloading decision algorithm, the states, actions, and rewards are defined as mentioned in Section 3.1. This proposed algorithm is formally described in Algorithm 1. In this subsection, we explain the details of the proposed algorithm. Each state is the observation that the agent gets after taking an action from the environment. We set the capacity of TD, StandBy Q, and ES as state, and it represents the capacities' changes over time. Likewise, the agent will get a reward (R combi ) in each step. The reward we set is the combination of three aspects of the reward, i.e., the reward of execution delay, the reward of energy consumption, and the reward of terminal device's capacity. The agent uses the state information and reward to make a better action decision: where R combi stands for the reward function definition which is the combination of R delay (reward under the consideration of execution delays), R energy (reward under the consideration of energy consumption), and R capacity (reward under the consideration of capacity) where the three reward components are defined as follows: where D dqn and E dqn are When offloading computing is decided in each step based on DQN, D local and E local stand for the execution delay and energy consumption when the tasks are processed in the local at time step t. Similarly, D o f f load and E o f f load are the values when the tasks are processed in the edge server. Finally, the total execution delay and energy consumption for all the tasks are represented as (10) and (11). Thus, semantically, the (10) and (11) are the linear combinations of execution delay and energy consumption depending on offloading decisions.
Based on the definitions of (10) and (11), the rewards R delay in (7) and R energy in (8) are defined as the difference between the value when the tasks are performed locally and the value by the decision algorithm based on DQN. As a result, the learning is carried out to minimize D dqn and E dqn while it maximizes R delay and R energy . Furthermore, the reward R capacity in (9) is a value to represent an overflow the terminal's capacity. Note that R capacity is meaningful only under certain conditions according to Algorithm 2. If local computing (i.e., no offloading) is decided by the algorithm and the capacity of terminal device is not sufficient to process task at time step t, the negative reward,R capacity , is applied. for t = 1, T do 7: With probability selects a random action a t

8:
Otherwise select a t = argmax a Q (φ (s t ) , a; θ) 9: Execute action a T in emulator and observe reward r e t , r d t and image x t+1 10: Set r t = α * r e t + (1 − α) * r d t + r c t with weight parameter α 11: Set s t+1 = s t ,a t ,x t+1 and process φ t+1 = φ (s t+1 )

13:
Sample random minibatch of transitions φ j , a j , r j , φ j+' from D 14: if Episode terminates at step j + 1 then 15: y j = r j 16: else 17: y j = r j + γmax a Q φ j+1 , a ;θ 18: end if 19: Compute a gradient descent step on y j − Q φ j , a j ; θ 2 with respect to the network parameters θ 20: Every C steps resetQ=Q 21: end for 22: end for Algorithm 2 Threshold-based compression decision if task is decided to offload then 2: if StandByQcap >= tasktodo then compression is not necessary, store the data in the Q 4: else compress and offload the compressed task to edge server 6: end if else if task is decided to process locally then 8: if TDcap >= tasktodo then perform the task at terminal device without any negative reward 10: else perform the task at terminal device 12: a negative reward R capacity (r c t ) is added end if 14: end if

Performance Evaluation
This section presents the simulation setting and the performance evaluation results of the proposed DRL-based dynamic offloading decision-making algorithm. We simulate the values of execution delay and energy consumption for the offloading decision tasks computed by the proposed algorithm in each time step. We calculate the values for the processing based on (10) and (11), and we compare the values with the cases of fully local computing and fully offloading.

Simulation Setting
We assume our simulation setting parameters as follows. Each terminal device has the maximal capacity of 200 (normalized); and also has an internal queue (i.e., standBy Q) which has the maximal capacity of 500. In addition, the capacity of edge server is 10,000. This assumption is reasonable because edge servers are generally powerful comparing to user terminal devices, i.e., order of 100 as used in [35]. The capacities of terminal device queue, stanBy Q, and the edge server are set between maximum capacities at the start of the episode. The episode is terminated when the agent performs 30,000 tasks. In addition, the task which the agent should do is selected between 50 and 350 continuously (i.e., uniform distribution) and, if there are any tasks left at the time step, the remaining tasks are accumulated in each time step. We also set the parameters to calculate the execution delay and energy consumption of the terminal device and edge server, as presented in Table 1. The values of each parameter are based on the 100 task executions and they are the representations of ratios for the local computing values when the execution time and energy consumption are set to 1. We also assume that the terminal device can process 100 tasks at one-time step and also the edge server can process 1250 tasks at the same time. In our DQN-based learning architecture, the input size of the model is set to 4. The input size stands for the number of dimension of observation (TD capacity, Q capacity, ES server capacity, and remaining tasks). The output size is 2, which stands for the number of dimensions of actions (offloading or not).
The assumed values can be updated depending on the development of hardware and computing systems. Despite these parameter updates, our scheme can be superior to the other static schemes. Furthermore, we can conduct more data-intensive simulations with various settings or real-world implementation as one of future research directions.

Simulation Results and Discussions
In this paper, the values of R delay and R energy are no more than 1. In addition, we set the ratio of results by local computing and the OCD algorithm as an indicator of the reward. Furthermore, we set R capacity as a negative reward in order to have 20% of R combi in (6). According to this setting, the maximum value of R combi is 0.8. We can see the rewards per time step in Figure 4. The red graph presents the case where y = 0. In the reward of delay and the reward of energy graph, D dqn and E dqn have the same values with D local and E local when the proposed OCD algorithm decides to take an action for non-offloading, i.e., local computing. As a result, the reward values are to be 0, when the OCD algorithm decides offloading D dqn ; and E dqn has a value other than zero. If the values of delay and energy consumption are less than D local and E local , the rewards are above the y = 0. On the other hand, they are below y = 0 if D dqn and E dqn are greater than the values by local computing. In Figure 4, the first graph is the combination of the below two graphs. It is as shown in Figure 5, and it stands for the reward that the agent receives at every time step. In Figure 5, the red line presents the accumulated reward. Although the OCD algorithm always makes optimal offloading decisions, the reward is not always positive. As a result, the total value of the reward can decrease when the negative reward is accumulated. In particular, there is a constant return of negative rewards from 130 to 150 intervals.
We compare the result of 30,000 tasks performed by OCD algorithm with the results performed by fully local computing (i.e., no offloading) and edge computing. The executed simulation gets results as shown in Figure 6 and Table 2. If there are tasks that are not processed at time step t due to capacity limit, the tasks are accumulated as the tasks to be handled at the next time step. Thus, in the execution delay and energy consumption side, local computing continues to increase the values. Furthermore, from the time step 30, the time to execute tasks is the longest and it reduces TD's energy as maximum. The full offloading has the biggest values at the first time, and it has the smallest growth rate until all the tasks are executed. The reason why the initial values of full offloading is a longer transmission time. The proposed OCD algorithm (DQN-based) has significantly better performance than the local computing by TD. In terms of execution delay, the value of the proposed OCD algorithm (DQN-based) is the lowest until time step 110. However, as the time step goes by, and the task size increases, the results of the proposed OCD algorithm (DQN-based) are better than the results of full offloading. In the energy consumption side, the proposed DQN-based algorithm always has the smallest values. These results show that the proposed OCD algorithm (DQN-based) is the most suitable in real-time wireless communication applications.     Lastly, we conduct the simulations in order to present the novelty of DQN-based algorithm than the other methods. One of the offloading decision algorithms is designed based on well-known greedy algorithms, as discussed in [36]. We remind readers that our proposed BRL-based offloading decisions will be made based on the DQN-based framework under the consideration of our rewards (i.e., combination of delay, energy, and capacity). On the other hand, the greedy algorithm can be used for offloading decision-making based on uploading data sizes (i.e., related to delays), i.e., this algorithm firstly sorts its own data to transmit at first. Then, it transmits the data sequentially.
As shown in Figure 7, this greedy-based algorithm introduces more average delays than our proposed DQN-based algorithm. Note that the average delay is defined as follows when N number of packets are simulated for the average delay calculation: where t Departure i and t Arrival i are the times when the i-th packet is left from the queue and when the i-th packet is arrived at the queue, respectively. As the average data size for offloading increases, the delays increase. In the DQN-based algorithm, optimal control will be made which considers delay, energy, and capacity, during offloading decisions. Thus, the delays are smaller than the greedy-based decision-making. As shown on the right-hand side of Figure 7, our proposed DQN-based decision-making algorithm outperforms the greedy; and the gaps become larger when the average data sizes increase.

Concluding Remarks and Future Work
This paper proposes a novel deep Q-network based offloading decision-making method in mobile edge computing systems. For utilizing real-time and high-quality communications in mobile edge networks, efficient offloading algorithms and task scheduling in MEC servers are essentially required. In this paper, therefore, a new dynamic offloading algorithm is proposed, and the algorithm improves its performance based on DQN. For performance evaluation, the results of our simulation can be changed due to environment settings or reward decision parameters. According to the simulation-based evaluation, we verify that the proposed DQN-based algorithm achieves desired performance.
As future research directions, various factors can be considered in order to conduct more realistic performance evaluation (e.g., channel allocation, channel quality, and mobility of devices) and real-world implementation (e.g., with Raspberry Pi). In addition, more detailed discussions about energy consumption and energy-efficiency are desired. Therefore, we will include this in our future investigation. Moreover, the computational overhead is definitely considerable for our future work because considering the overhead is meaningful in mobile edge computing research. Lastly, the proposed algorithm in this paper runs in a centralized manner. It would be much scalable and practical if the computation can be done in a distributed manner. Therefore, our next work should be dedicated to the design and implementation of fully distributed multi-agent reinforcement learning algorithm which does not require centralized computation.
Author Contributions: S.P., J.K., and S.C. were the main researchers who initiated and organized research reported in the paper, and all authors including S.P., D.K., and J.K. were responsible for analyzing the simulation results and writing the paper. Y.K.L. provided more realistic network setting and its related most suitable algorithm modification. All authors have read and agreed to the published version of the manuscript.