A Dynamic Service Placement based on Deep Reinforcement Learning in Mobile Edge Computing

: Mobile edge computing is an emerging paradigm that supplies computation, storage,


Introduction
The evolution of the Internet of Things (IoT) promotes the development of our soci-17 ety which requires highly scalable infrastructure to provide proper services for diverse 18 applications adaptively [1]. As a promising framework, mobile edge computing (MEC) sup-19 ports the exponential growth of emerging technologies, such as online interactive games, 20 augmented reality, real-time monitoring, and so on by pushing the computation, storage, 21 and networking resources to the base stations. However, users demand a higher quality- 22 of-service (QoS) with increased investment of resources, which is nontrivial to maintain 23 service performance under the erratic activities of end-users and limited capacities. In this 24 paper, we study the service placement problem by minimizing the total delay of multiple 25 users under the long-term cost constraint. 26

27
An illustration of the dynamic service placement problem is shown in Figure 1 to 28 represent the unique challenges under this problem. (i). Since there are no restrictions on the 29 locations of services, where are these services placed that can reach better utilization on the 30 physical resources of edge servers include the aspects on the computing, communication, 31 and storage in MEC is non-trivial. For example, suppose that the computing capacity of 32 edge server m 2 in area 2 is much higher than others with lower storage. When the movement 33 trajectories of users overlap with the areas nearby m 2 , the services that correspond to these 34 users expect to place a server that is close to them and has better performance. However, it 35 is obvious that the available storage capacity of m 2 cannot satisfy the requirements of all 36 users. Therefore, how can the system deal with the services that attempt to migrate over 37 requesting high computing capacity with limited storage resources is important. (ii). The 38 services serve users one-to-one, and the activities of users are erratic. It is non-trivial to find 39 an efficient strategy that adapts the erratic movements by considering minimizing the total 40 delay under the cost constraint. As shown in Figure 1, we suppose that users in areas 1, 3, 41 and 4 are on the move at time slot t. One of the simple solutions to maintain performance 42 is to migrate services in order to follow users, which produces lower latency. However, 43 frequent service migration will bring additional traffic load in the backhaul network and 44 higher operational costs. Therefore, it is challenging to deal with the services that can 45 realize dynamic adaptation with low latency under the limited cost. In this paper, we introduce a novel dynamic placement framework based on deep 48 reinforcement learning (DSP-DRL) to optimize the QoS for users under the constraints on 49 physical resources and operational costs. Our contributions can be summarized as follows: 50

•
We investigate the service placement problem in mobile edge computing with multiple 51 users, and we propose to minimize the total delay of users by considering the limitation 52 on physical resources and cost.

53
• We propose a decentralized dynamic placement framework based on the deep re-54 inforcement learning (DSP-DRL) by introducing the migration conflict resolution 55 mechanism during the learning process to maintain the service performance for users. 56 We formulate the service placement under the migration conflict into a mixed integer 57 linear programming (MILP) problem. Then we propose a migration conflict resolution 58 mechanism to avoid the invalid state and approximate the policy in the decision 59 modular according to the migration feasibility factor.

60
• Extensive evaluations demonstrate that the proposed dynamic service placement 61 framework outperforms baselines in terms of efficiency and overall latency.

62
The remainder of this paper is organized as follows. Section 2 surveys related works. 63 Section 3 describes the model and then formulates the problem. Section 4 investigates the 64 dynamic service placement framework based on deep reinforcement learning. Section 5 65 includes the experiments. Finally, Section 6 concludes the paper. 66 2. Related work 67 The concept of mobile edge computing is introduced to extend the cloud paradigm, 68 which enables a new breed of services and applications. It provides a service environment 69 closer to both users and IoT devices by deploying several mobile edge servers. Service 70 placement is a well investigated problem in mobile edge computing which advocates that 71 providing service offering at the users' side [2]. There are various works have studied 72 different aspects of this problem. A subset of existing work in this area relates to improving 73 the utilities and reducing the operational cost. Ning et al. [3] propose a dynamic storage-74 stable service placement strategy by using the Lyapunov optimization method to maximize 75 the system utility, while striking a balance between the overhead and stability. Pasteris et al. 76 [16] focus on the problem of service placement by considering a heterogeneous mobile edge 77 computing system, and they propose a deterministic approximation algorithm to maximize 78 the total revenue. Chen et al. [22] propose an efficient decentralized algorithm by exploiting 79 the graph coloring on the small cell network for performing collaborative service placement 80 in order to optimize the utility of operators. Yu et al. [6] investigate the collaborative service 81 placement problem in mobile edge computing by proposing an efficient decentralized 82 algorithm based on the matching theory. They try to minimize the traffic load to realize a 83 high utilization on computing and radio resources. Gu et al. [13] focus on the layer-aware 84 service placement and request scheduling problem, and they design an iterative greedy 85 algorithm by formulating it into an optimization problem with approximate submodularity. 86 In addition, quite a few works have been carried out on optimizing the quality of service 87 (QoS). Xu et al. [18] tackled it by proposing a trust-oriented IoT service placement method 88 for smart cities in edge computing, and they try to optimize the execution performance 89 with privacy preservation. Maia et al. [20] formulate the load distribution and placement 90 problem as an integer nonlinear programming, and they try to minimize the potential 91 violation to improve the QoS by using genetic algorithm. Fu et al. [23] propose a runtime 92 system that effectively deploys user-facing services in cloud-edge continuum to ensure 93 the QoS by jointly considering communication, contention, and load condition. However, 94 these works ignore the coupling relationship between service performance and operational 95 cost that caused by users' erratic movements.

96
In response to the challenge on users' mobilities across multiple timescales, some 97 works are based on service migration in mobile edge computing. There are a few works 98 which assume that the user mobility follows a Markovian process and apply the technique 99 of Markov Decision Process (MDP) [4]. Wang et al. [15] formulate the service migration 100 problem with minimum cost as an MDP and propose a new algorithm for computing the 101 optimal solution which is significantly faster than traditional methods based on standard 102 value or policy iteration. Gao et al. [10] jointly optimize the network selection and service 103 placement to improve the QoS by considering switching and communication delay, and 104 they propose to dynamically place and migrate the services according to the mobility of 105 users by introducing an iteration-based algorithm. Tao et al.
[4] study the mobile edge 106 service performance optimization problem by applying the Lyapunov optimization. They 107 design an approximation algorithm based on Markov approximation under long-term cost 108 budget constraint. Since the characteristics of mobile users are moving without a priori 109 knowledge, some researchers introduce deep reinforcement learning. 110 Rui et al. [17] propose a novel service migration method based on state adaptation and 111 deep reinforcement learning to overcome network failures, and they use the satisfiability 112 modulo theory to solve the candidate space of migration policies. Liu et al. [24,26] design 113 a reinforcement learning-based framework by using a deep Q-network for a single user 114 service migration system, which realized to choose the optimal migration strategy in 115 edge computing. Yuan et al. [12] study the service migration and mobility optimization 116 problem by proposing a two-branch convolution based deep Q-network to maximize the 117 composite utility. However, these works make decisions by calculating the Q-value of the 118 state and action, which are not precise since the trajectories of mobile users are uncertain 119 and dynamic in a timescale. Pan et al. [11] develop a novel hierarchical reinforcement 120 pricing by capturing both spatial and temporal dependencies based on deep deterministic 121 policy gradient (DDPG) [25]. Wei et al. [21] consider a more practice-relevant scenario 122 that multiple mobile users generally have a small size and can be easily moved around 123 and distributed at different edge servers for processing, and they propose a reinforcement 124 learning-based algorithm which leverages the learning capability of DDPG. However, these 125 works do not take into account the problems of resource limitation and migration conflict 126 under the case that multiple users own similar activities.

127
In this paper, we study the service placement problem under the continuous provi-128 sioning scenario in mobile edge computing. Our objective is to minimize the total delay 129 under the physical resources by considering to maintain service performance under the 130 erratic activities of multiple users.

132
In this paper, we study the service placement problem in mobile edge computing 133 while jointly considering the QoS of users and cost of service operators. Our objective is to 134 minimize the total delay of users and maintain the performance without overwhelming the 135 constraints on physical resources and operational cost. In this section, we start with the 136 descriptions of the system model and the QoS model. The problem is also formulated. Given a substrate distribution of MEC nodes M = {m j } that are supported by the 139 network operator. Each MEC node is attached to a base station with limited computing 140 and storage capacities, where R c m j denotes the computing capacity of m j , and R s m j denotes 141 the storage capacity of m j . We use a set U = {u i } to denote the users with mobilities 142 that are served by the MEC nodes. The users that subscribe the services from the MEC 143 operators are distributed over the coverage region of the base station. To better capture the 144 users' mobilities, the system is assumed to operate in a slotted structure and its timeline is 145 discretized into time frame t ∈ T = {0, 1, 2, ..., T} [4]. At all discrete time slots, each mobile 146 user sends a service request to the MEC node that can be accessed. We use V to denote 147 the set of services that are supported by the operators, where V = {v h }. We assume that 148 the services are deployed on the virtual machines, and each user can only be served by 149 one service on the MEC. To simplify the description, we use color squares to represent the 150 placed services. Each MEC has a service range shown in Figure 1. Here, we suppose that 151 the capacities of MECs are heterogeneous, and their service ranges are different. We use 152 light orange color circles with different sizes to represent the coverage ranges of each MEC. 153 Let x j ih (t) = 1 denote user u i using the service v h which is placed on edge server m j at time 154 slot t, otherwise x j ih (t) = 0. For each MEC node, we use V m j to denote the set of services 155 that are placed on edge server m j , where V m j = {v h | m j ←v h }. We suppose that each service 156 only serves one user at a time, and we use U(V m j ) to denote the set of users that served by 157 the services in set V m j . For the convenience of reference, we summarize the main notations 158 throughout this paper in Table 1.
Set of services which placing on edge server m j .
U(V m j ) Set of users are served by the services in set V m j .
x j ih (t) A boolean variable that indicates v h serving u i on edge server m j at time slot t.
The amount of required computing resource of Updating delay of u i during the dynamic migration.
Channel bandwidth of link between u i and m j .
The computing capacity of m j . We use D c u i (t) to denote the computing delay of user u i at time slot t. Let A i (t) denote the amount of computing resource required by the service request of user u i at time slot t. In this paper, we consider that each user shares the computing resource of the MEC sever evenly [4]. Here, the computing resources are measured by the number of CPU cycles.

162
The communication delay occurs when the service does not be placed in the user's area, which is determined by the data transmission and the network propagation. The network propagation is determined by the distance p u i ,m j (t) between user u i and service v i that placed on edge node m j , such as hops [5]. Let t u i ,m j denote the maximum transmission rate, where We use b u i ,m j to denote the channel bandwidth of the physical link, and τ denote the transmission power of the local mobile device of u i . Let g(u i , m j ) represent the channel gain between u i and MEC m j , where g(u i , m j ) = 127 + 30 · log p u i ,m j (t) [9]. Let N represent the noise power. The data transmission is determined by the bandwidth of the physical link b u i ,m j and the data size of the request d u i (t) when passes through the network devices between the connected MEC node and the service provided one. Therefore, the communication delay is

163
Due to the mobilities of users, it is inefficient to keep the locations of services unchanged all the time, which will increase the communication delay of users. Thus, we consider optimizing the user experience via dynamically migrating the services. We define a boolean variable α(v i ) to denote whether the service v i which is serving user u i , is under the migration or toggling state. Υ(v i ) is the updating delay of service v i , which includes service profiles transmission, rebooting software resources, and so on [10]. The updating delay D u u i (t) of user u i is defined as

164
In this paper, we consider achieving the dynamic service placement by minimizing the total delay of multiple mobile users under the physical resource and cost constraints. We suppose that the cost during the dynamic service placement process is produced by the migration of services across edge servers. In order to satisfy the quality of service (QoS) requirements of users under the erratic movement, the service should be dynamically migrated to adapt to the users' mobility; however, the resulting cost for the operators will be excessive. Let ρ denote the unit cost of v i during the service migration, and the cost is defined as Moreover, we use Γ to represent the higher bound of the maximum total cost that is afforded by the operators. The problem formulation is shown as follows: ∑ Our objective is to minimize the total delay of users in set U during a continuous 165 time period in Equation 6. Equations 7 to 9 are the constraints. Equations 7 states the 166 cost constraint, which means that the total cost of the provided services cannot exceed 167 the threshold Γ.  In this section, we show the detail of our novel decentralized dynamic service place-173 ment framework based on the deep reinforcement learning approach to realize the lower 174 delay under the constraints on physical resources and costs. There are two networks (main 175 network and target network) in our framework. In the main network, the critic network is 176 used to output real-time actions for actors to implement in reality, while the actor network 177 is used to update the value in the network system. In the target network, they are all 178 outputting the value of this state, but the inputs are different. The critic network will 179 analyze the action from the actor network plus the observation value of the state, and the 180 reply memory ( actor network will take the actor at that time. Figure 2 shows the overall architecture of 181 DSP-DRL framework. Since the decision-making during dynamic service placement is a stochastic optimiza-184 tion, our framework is studied based on the deep deterministic policy gradient (DDPG) 185 algorithm [19]. In this paper, the objective of the agent is to realize dynamic service place-186 ment for multiple mobile users while minimizing the total delay. We first summarize 187 the state and action spaces, reward function, and the state transition policy that are used 188 in our reinforcement learning framework. In order to describe the environment of edge 189 servers and mobile users for the agent concisely and correctly, the state space includes the 190 knowledge of services placed on the edge servers and the status of users that are supplied 191 by these services. To that end, the state is designed as follows.

192
Definition 1 (State). The state s t describes the environment of the edge network, which is a vector 193 consisting of s t = [r t ,û t ]. r t = (r 1 (t), r 2 (t), ..., r j (t), ..., r m (t)) is the vector of rest storages, where 194 r j (t) denotes the rest storage on edge server m j .û t = (û 1 (t),û 2 (t), ...,û i (t), ...,û n (t)) is the vector 195 of positions on each users' trajectories, whereû i (t) denotes the position of u i at time slot t. 196 We consider that the services on each edge server make decentralized decisions 197 according to the trajectories of mobile users by training the agent. The action a t is designed 198 as follows:  Since our problem is an online learning process, the value of the reward cannot deter-207 mine the final total delay for multiple mobile users in each time slot directly, however, it 208 will drive their behaviors to obtain a better performance. In order to realize a decentralized 209 dynamic service placement strategy, we minimize the total delay while completing the 210 processed tasks for the mobile users within a limited migration cost. Thus, here is the 211 specific definition of the reward function.

Definition 3 (Reward). The reward r is measured by the average delay feedback of multiple mobile 213
users comparing with r = ∑ u∈U . 214 Here, |U| denotes the total number of users and z u i (t) denotes the total delay of 215 multiple users according to the decisions by the deep neural network at time slot t. We use 216 z u i (t) to denote the total delay that the service stays on the original edge server without 217 migration. For each service, the decisions are made depending on the observation of the envi-220 ronment from their own perspectives during each episode. However, there is no prior 221 knowledge of the mobile edge computing system, which means the data size and trajec-222 tories are unknown to each server. Thus, the process is online and model-free. In order 223 to maintain the service performance, the migration of services and users' activities are 224 tightly coupled. Since the multiple users move erratically and autonomously, there will be 225 a conflict between multiple services due to similar or overlapping users' trajectories during 226 the learning process. We use the following example to illustrate this problem, which is 227 shown in Figure 3. Suppose that the activities of users u 1 , u 4 , and u 8 are all around area 228 4 (written in red text) at the same time slot. In this case, the chosen services have a high 229 probability of migrating to the same edge server v 4 in the learning framework. However, 230 the rest storage can only afford one service, which creates a migration conflict. We first formulate the service placement under migration conflict into a mixed integer linear programming problem. As shown in the objective function in equation 6, we aim to minimize the total delay of users in a time-varying period. We use z j (t) to denote the total delay of user u i at time slot t, where z u i (t) = D c u i (t) + D l u i (t) + D u u i (t). When service v i , which is serving u i migrates successfully, the communication delay D l u i (t) will decrease, otherwise, D u u i (t) = 0. Here, α(v i ) is a boolean variable that represents whether service v i migrates successfully. For the communication delay, the migration result produces two different values. We combine these two cases and transform the communication delay into

BS
. Thus,the total delay of user u i is transformed to

Input:
The action a t at time slot t; Output: The updated action a t of service placement decisions under the migration conflict edge servers; 1: for each service v h ∈ V do 2: Pre-migration according to a t ; 3: end for 4: for each edge server m j ∈ M do 5: Calculate the total number of services N m j pre-migrated to m j ; 6: if N m j > R(m j ) then 7: Construct conflict set C m j = {v Update set C m j = C m j /v h ; 10: for each service v h ′ in set C m j do 11: Migrate v h ′ to the nearest edge server that meets the storage resource; . When an edge server has a migration conflict, which means that multiple services choose it as a destination while its storage capacity cannot satisfy all services. The value of Z m j u (t) is divided into two parts, one is determined by services with conflicts shown as follow: Another one is produced by the placed services shown as follow: Since the first part Z ′m j u (t) is fixed, the optimization of Z m j u (t) will be transformed to the optimization of Z ′′m j u (t). Therefore, the problem of minimizing total delay for the service placement under migration conflicts in time slot t can be formulated as a mixed integer linear programming problem as follows, which has been proven to be NP-hard [7]. We propose a new migration conflict resolution mechanism to avoid the invalid state 235 and approximate the policy in the decision module. There are two main stages included in 236 our resolution mechanism: stage one is to find the edge servers with conflicting services; 237 stage two is to make migration decisions for the conflicting edge nodes. The details are 238 shown in Algorithm 1. The input is action a t at time slot t, and the output is the updated 239 action of service placement decisions while enabling conflict resolution for edge servers. 240 According to the action a t , which is produced during the learning process in time slot t, 241 we do the pre-migration for each service v h ∈ V. After that, we check the status of each 242 edge sever m j ∈ M by calculating the total number of services N m j pre-migrated to m j . We 243 compare the number of total requests N m j with the storage R(m i ) of destination server 244 m j . If N m j > R(m i ), m j is a conflicting edge server. Otherwise, the migration to m j is 245 successful. Based on that, we start to make migration decisions. For the conflicting edge 246 server m j , we first build conflict set C m j = {v m j h } which is composed of all the services 247 requesting to migrate on server m j at the same time. Then, we choose service v h in set C m j 248 with maximum ζ(v h ). Here, we introduce a novel definition: migration feasibility factor. 249 Definition 4 (migration feasibility factor). Let ζ(v h )(t) indicate the migration feasibility factor 250 and ϖ > 0. 251 We use D u u i (t) to denote the migration delay of the service v h that is serving user u h 252 at time slot t, and D l u h (t) to denote the communication delay produced when service v h is 253 not placed on the edge server within a user's area. These two parameters are negatively 254 correlated, which means that when the service migrates or is close to the users' area, the 255 value of communication delay will be D l u h (t) = 0 or less. Here, we use a constant ϖ to 256 adjust the relationship, where ϖ > 0. Therefore, the migration feasibility factor considers 257 the impact of these two parameters on users' delays. In line 9, we update the conflict set 258 with C m j = C m j /v h . For the rest of services in set C m j , we migrate v h ′ to the nearest edge 259 server that meets the storage resource requirements, as denoted in lines 10 to 12. In line 260 13, we record the current state of services placement in a ′ t and update the action a t = a ′ t . If 261 the storage resources are adequate, there will be no conflict, which means that the services 262 migrate to m j successfully. Finally, we will keep the original service placement decisions of 263 action a t in line 15. In this subsection, we propose a dynamic service placement strategy based on deep 266 reinforcement learning. According to the characteristic of the decision-making process, our 267 scheme studies based on the deep deterministic policy gradient (DDPG) algorithm. The 268 main idea is to use a deep reinforcement learning agent to perform the dynamic service 269 placement of multiple mobile users to minimize the total delay.

270
The specific steps are shown in Algorithm 2. We use the sets of edge nodes M, services V, and users U as the input. The output is the dynamic service placement scheme X. In lines 1 to 3, we initialize the preliminary parameters of the reinforcement learning agent which includes the main network, the target network, and the replay buffer. In line 4, we start to train the agent by running a number of κ episodes with our environment. Each edge server can learn to determine the placement strategy (migration or keeping the original position) of services gradually and independently after training for κ episodes. We start to initialize environmental parameters for edge servers and users, and we generate an initial state s 1 in line 5. The training process in one time period T starts from lines 6 to 15. For each time slot, we select an action a t = µ(s t |θ µ ) + δ t to determine the destination of migration by running the current policy network θ µ and exploration noise δ t . Since the movements of users are erratic and autonomous, we detect any migration conflicts and resolve them based on Algorithm 1 in line 8. For each user agent, we execute action a t and observe reward r t and new state s t+1 from the environment. Then, we store the transition tuple (s t , a t , r t , s t+1 ) into the replay buffer B in line 10. In lines 12 to 14, the actor and critic network of the user agent will be updated according to the mini-batch of I transitions from B. In line 11, we update the critic network which takes the state s t and action a t as input, and it outputs the action value [11]. Specifically, the critic approximates the action-value function Q(s, a|θ Q ) by minimizing the following loss function: In line 12, we update the actor network which represents the policy parameterized by θ. It maximizes ▽ θ µ J using stochastic gradient ascent which is given by: Finally, the target network is updated by Algorithm 2 Dynamic Service Placement based on DRL Input: Sets of edge nodes M, services V, and users U; Output: Dynamic service placement scheme X; 1: Randomly initialize the actor network µ(s|θ µ ) and critic network Q(s, a|θ Q ) with weight θ µ and θ Q ; 2: Initialize the target networks with weights θ µ ′ ← θ µ and θ Q ′ ← θ Q ; 3: Initialize replay buffer B; 4: for episode from 1 to κ do 5: Initialize environmental parameters for edge servers and users, and generate an initial state s 1 ; 6: for each time slot t from 1 to T do 7: Select an action a t = µ(s t |θ µ ) + δ t to determine the destination of migration by running the current policy network θ µ and exploration noise δ t ; 8: Detect migration conflicts and resolve via Algorithm 1; 9: Execute action a t of each user agent independently, and observe reward r t and new state s t+1 from the environment; 10: Store the transition tuple (s t , a t , r t , s t+1 ) into replay buffer B; 11: Randomly sample a mini-batch of I transitions {(s w , a w , r w , s ′ w )} from replay buffer B;

12:
Update the critic network Q(s, a|θ Q ) by minimizing the loss function L in equation 16; 13: Update the actor network µ(s, a|θ µ ) by using the sampled policy gradient ▽ θ µ J in equation 17; 14: Update the target networks: end for 16: end for

273
In this section, we conduct extensive simulations and experiments to study the dy-274 namic service placement problem in multiple mobile users. We develop a prototype of our 275 framework using python, which consists of the construction of the edge network and the 276 requests of multiple mobile users. After presenting the datasets and settings, the results are 277 shown from different perspectives to provide insightful conclusions.  Our prototype is built on a workstation Precision T7910 with Intel Xeon(R) E5-2620 280 CPU, NVIDIA RTX5000 GPU, 128Gb memory, and 2Tb hard disk, which runs a Linux 281 operating system using python. We simulate our edge computing architecture based on 282 the campus of Beijing University of Technology with a range of 500 × 500m 2 and set up 283 10 mobile edge servers in synthetic datasets as shown in Figure 4. For each server, the 284 setting of computing capacity randomly ranges from 20GHz to 25GHz. The storage of each 285 server ranges from 8GB to 16GB, and the bandwidth between each pair of edge servers is 286 0.2GHz. We set the transmission power to be tr = 0.5W, and the noise power N = 2 × 10 −3 287 [9]. In order to analyze the total delay with different numbers of users, we construct the 288 synthetic dataset into three groups of size 20, 30, and 40. The data size of uninterrupted 289 requests sent by users in a continuous timescale randomizes in [0.1GB, 0.5GB]. The settings 290 of hyperparameters are listed in Table 2. In addition to the proposed placement algorithm, 291 two state-of-the-art algorithms are used, dynamic service placement with no migration 292 (DSP-NM), dynamic service placement with all migration (DSP-AM).

298
We conduct the experiments of three algorithms under different groups which are 299 divided according to the numbers of users and the trajectories. For each group of users, we 300 collect the results under the same settings.

302
We investigate the convergence for three groups of mobile users (of size 20, 30, and 303 40), and where each user has 20 trajectories in a timescale. The results are shown in 304 Figures 5 to 7. We use a black dotted line to describe the convergence trend of delay with 305 the increasing number of iterations for each group of users. Additionally, we have the 306 following observations: (i). For the same group of users with the same trajectory, the delay 307 of users guided by the DSP-DRL framework is far greater than the other two comparison 308 algorithms. As shown in Figure 5, the red and yellow lines are the results of DSP-NM 309 and DSP-AM, which are much higher than the beginning of DSP-DRL. The relationship 310 between the results of DSP-NM and DSP-AM is influenced by the communication delay 311 and the migration delay, which relates to the size of users' data and the configuration files 312 of services. In our experiment, the users are set to send data packets uninterrupted at equal 313 time intervals. Therefore, the communication delay increases sharply when the users move 314 frequently, resulting in a very large delay under DSP-NM. (ii). The increasing number 315 of users has an influence on the convergence. As shown in Figures 5 to 7, the speed of 316 the convergence slows down as the number of users increases. As shown in Figure 5, the 317 total delay is close to convergence after 250 iterations. However, as shown in Figure 6 and 318 Figure 7, the groups with 30 and 40 users approach convergence after 400 and 420 iterations. 319 The reason is that an increase in the number of users means a corresponding scaling in 320 the number of services, and the probability of the migration conflict will increase, which 321 reduces the convergence speed. (iii). The total delay fluctuates within a relatively fixed 322 range for each group of users. Since the provisioning of edge servers is relatively dense, 323 there exist many cross-coverage areas which provide multiple choices for users. There are 324 many different placement results in the learning process of DSP-DRL, and the total delay 325 generated by these results will fluctuate among several relatively fixed values during the 326 convergence process. Therefore, the fluctuations are different under these three groups 327 of users, which is related to the user's activity trajectories and the placement deviation of 328 services.

330
According to the convergence obtained with different groups of users, we assess the 331 average of the delay among the three groups which are shown in Figure 8 and Figure 9. 332 Additionally, we have the following observations: (i). The number of users' activity 333 trajectories collected in a timescale affects the total delay. As shown in Figure 9, the highest 334 total delay of 40 users under DSP-NM in the case of 10 trajectories is much lower than the 335 case of 20 trajectories. (ii). The erratic activities of end-users make the delay under these 336 three algorithms quite different. For the users with 10 trajectories, the total delay of 20 337 users with DSP-NM is lower than DSP-AM. However, for groups with 30 and 40 users, the 338 total delay of DSP-NM is higher than that of DSP-AM. For the users with 20 trajectories, the 339 total delays under the DSP-NM of these three groups of users are all higher than DSP-AM. 340 For both cases, DSP-DRL always been able to obtain a lower latency for different numbers 341 of users. Compared with DSP-NM and DSP-AM, DSP-DRL can reduce the total delay by 342 41.2% and 32.9% under the constraints in the 10 trajectories case, and 35.4% and 20.5% 343 in the 20 trajectories case. In summary, DSP-DRL has better performance across different 344 scales of users in mobile edge computing.

346
In this paper, we study the service placement problem under the continuous pro-347 visioning scenario in mobile edge computing. We propose a novel dynamic placement 348 framework DSP-DRL based on deep reinforcement learning to optimize the total delay 349 without overwhelming the constraints on physical resources and operational costs. In the 350 learning framework, we propose a new migration conflict resolution mechanism to avoid 351 the invalid state in the decision module. We formulate the service placement under the 352 migration conflict into a mixed-integer linear programming (MILP) problem. Based on 353 that, we propose a new migration conflict resolution mechanism to avoid the invalid state 354 and approximate the policy in the decision module according to the introduced migration 355 feasibility factor. Finally, we conduct extensive evaluations under various scenarios to 356 demonstrate that our scheme outperforms existing state-of-the-art methods in terms of 357 delay of users under the constraints on resources and cost in edge computing. For future 358 work, we will investigate the dynamic service placement with multiple replications in 359 mobile edge computing, in which the constraints on physical resources and consistency are 360 also taken into consideration.