1. Introduction
Multi-access edge computing (MEC) has become an essential architectural paradigm for delivering low-latency, high-quality services by placing computation and storage closer to mobile users. This proximity reduces the round-trip time for data transmission, enhancing responsiveness, which is essential for applications such as augmented reality, real-time gaming, and autonomous driving. However, maintaining consistent service quality in MEC environments is challenging owing to user mobility. This necessitates efficient task migration and proper resource allocation across edge nodes [
1].
To address these challenges, proactive migration strategies have been widely investigated. These approaches aim to transfer tasks in advance to nearby edge nodes as users move across coverage areas, thereby minimizing handover delays and potential service disruptions [
2,
3]. Although such methods are beneficial for preserving service continuity, they often rely on accurate trajectory prediction, which can introduce computational overhead and inaccuracies in dynamic environments [
4,
5].
Additionally, effective caching mechanisms can support migration by storing frequently requested content locally at edge nodes, thereby reducing load on the core network. By keeping popular content close to users, service relocation can proceed with lower latency, as cached data reduces the need to retrieve information from distant servers. As MEC evolves towards cloud-native architectures [
6], application services are increasingly being developed as stateless, avoiding local session storage.
Within these architectures, caching plays a central role in managing the data locality independently of the service state. This enables stateless services to scale dynamically across edge nodes, thereby supporting efficient and latency-sensitive data access. Caching policies based on content popularity models, particularly those that employ Zipf distribution, have proven effective at improving cache utilization and alleviating congestion in core networks [
7,
8].
However, in practical MEC environments, decisions related to service migration and data caching, along with resource allocation, are inherently interdependent, particularly under dynamic user mobility and heterogeneous content demand across users and locations. This interdependence complicates decision-making, as optimizing one component may negatively affect the performance of others. For instance, migration decisions impact resource availability and cache utilization, while caching strategies influence both service placement efficiency and migration overhead. Furthermore, the dynamic nature of MEC systems, including time-varying workloads and fluctuating resource availability, makes it difficult to maintain consistent service performance. In this context, treating these components independently leads to suboptimal system behavior, underscoring the need for coordinated decision-making across multiple system dimensions.
1.1. Motivation
User mobility in MEC environments inherently results in frequent handovers as users transition between service areas. These handovers lead to fluctuating workloads and may cause service interruptions. Rather than explicitly modeling trajectory prediction within the decision-making process, this study assumes access to short-term mobility information regarding the user’s future location. This assumption enables adaptive preparation of edge nodes aiming to reduce handover-induced delays while remaining consistent with practical MEC systems, where user location can be estimated through prediction or tracking mechanisms. This strategy is particularly relevant for applications with strict latency and continuity requirements.
Reinforcement Learning (RL) has been widely used to address service migration and lifecycle-aware management in MEC environments [
9]. In particular, studies have considered the joint optimization of system components such as caching, task offloading, and resource allocation [
10,
11]. However, these approaches typically address such aspects in partially integrated or scenario-specific settings, limiting their ability to fully capture the interdependent nature of decision-making in dynamic MEC environments. This study aims to fill this gap by introducing an adaptive framework that integrates DRL-based decisions for service placement and migration, jointly optimized with cache-aware mechanisms under dynamic resource constraints. The framework operates under dynamic MEC conditions characterized by user mobility and dynamic content demand.
1.2. Contribution
The key contributions of this study are summarized as follows:
A unified MEC framework is proposed, employing a centralized DRL agent that enables state-aware, joint decision-making for service placement and migration under dynamic system conditions. The framework incorporates resource availability and cache state information (modeled based on content popularity distributions), capturing the interdependence among service location, resource utilization, and data locality.
A structured system-level state representation is introduced that integrates edge node resources, cache status, and user mobility information. This formulation enables cache- and resource-aware policy learning, guiding the agent to select edge nodes with locally available popular content and improve data locality.
The proposed framework is evaluated in a simulated MEC environment with mobile users, demonstrating consistent improvements in service latency, Cache Hit Ratio (CHR), and edge resource utilization across multiple DRL variants, deep Q-network (DQN), double deep Q-network (DDQN), and dueling DDQN (DDDQN), validating its effectiveness under dynamic operating conditions.
The remainder of this paper is organized as follows.
Section 2 reviews the related work.
Section 3 presents the system model, followed by the problem formulation in
Section 4.
Section 5 outlines the proposed DRL-based solution.
Section 6 presents and analyzes the evaluation results, including a detailed performance analysis, comparative evaluation with baseline methods, and a discussion of practical implications, as well as the limitations of the proposed framework and future research directions. Finally,
Section 7 concludes the paper.
2. Related Works
MEC environments require efficient service management strategies that can accommodate user mobility, limited computational resources, and stringent latency constraints. In this context, a substantial body of work has explored RL and its deep variants to enable adaptive decision-making for service placement, migration, and resource allocation.
Existing MEC optimization approaches can be characterized along multiple dimensions. From a system perspective, prior works adopt centralized, distributed, or hierarchical control architectures. Centralized approaches rely on a global system view to make sequential decisions, whereas distributed and multi-agent frameworks improve scalability under partial observability. Hierarchical models [
12] decompose the decision process to address complexity, particularly in highly complex and dynamic MEC environments. From a methodological standpoint, DRL-based solutions can be grouped into value-based methods (e.g., DQN and its variants), policy-based or actor–critic approaches, and hybrid formulations that combine discrete and continuous decision spaces. Finally, existing works differ in scope of optimization, ranging from single-component solutions (e.g., service placement or caching) to partially integrated frameworks that jointly consider multiple aspects of the system.
Early studies focus on RL-based service placement under dynamic conditions. One example is provided by [
13], which introduces a Q-learning approach for mobility-aware placement in vehicular networks, capturing user mobility and service demand variability while optimizing latency and resource utilization. Building upon this direction, ref. [
14] proposes a DRL-based dynamic service placement framework, in which the problem is formulated as a mixed-integer linear programming (MILP) model and enhanced with a migration conflict resolution mechanism to minimize delay under resource and cost constraints.
Recent studies extend the problem to include service migration and resource-aware decision-making. In particular, Liu et al. [
15] formulate the joint service migration and resource allocation (SMRA) problem as a Markov decision process (MDP), combining Long Short-Term Memory (LSTM) for user mobility prediction with Parameterized DQN (PDQN) to handle hybrid discrete–continuous action spaces. This formulation enables simultaneous decisions on service migration and resource allocation under dynamic conditions. Mobility-aware migration has also been addressed using advanced DRL variants. Hua et al. [
16] propose a mobility-aware DRL framework enhanced with a Hidden Markov Model (HMM) for vehicle behavior prediction, leveraging a DDDQN architecture to capture stochastic mobility patterns and guide migration decisions. By incorporating probabilistic mobility prediction, the model reduces service latency and energy consumption in highly dynamic Internet of Vehicles (IoV) scenarios. In addition, experimental studies analyze the performance of migration strategies in MEC-assisted 5G-V2X systems, highlighting trade-offs in service continuity, downtime, and state transfer mechanisms under realistic conditions [
17].
Related works have focused on the joint optimization of service caching and computation offloading. For example, ref. [
10] employs a combination of LSTM and Deep Deterministic Policy Gradient (DDPG) to capture temporal dynamics and optimize offloading and distributed caching decisions in multi-region MEC systems. Building on this direction, hierarchical reinforcement learning (HRL) approaches further extend this idea by optimizing caching, workload offloading, and resource allocation through multi-layer architectures, where lower-level DQN-based policies handle local caching and offloading decisions, while higher-level policies perform system-wide resource allocation and load balancing, accounting for the strong coupling between the two decision layers [
18]. Hierarchical DRL (HDRL) frameworks have also been proposed to explicitly address the strong coupling between caching and offloading decisions by decomposing the optimization problem into interrelated service caching and computation offloading processes, following a divide-and-conquer strategy [
11]. HDRL has also been investigated in broader MEC settings. For instance, ref. [
19] proposes a hierarchical multi-agent DRL framework for wireless-powered MEC systems, in which high-level agents manage energy provisioning and wireless power transfer, while low-level agents optimize computation offloading actions, which are closely related to service execution and placement decisions. Mobility-aware extensions have further enhanced these approaches. Specifically, ref. [
20] proposes a distributed service caching framework that integrates LSTM-based mobility prediction with a DDDQN model to optimize caching replacement decisions and reduce communication energy consumption in dynamic edge environments. Despite these advances, such methods primarily focus on specific aspects of MEC optimization, rather than addressing the joint interaction between caching, offloading, and migration within a unified framework.
Beyond service-level optimization, studies also examine infrastructure-oriented aspects of MEC environments, focusing on system-wide deployment and coordination. DRL-based approaches have been proposed for dynamic edge server placement under time-varying conditions and user mobility. For example, ref. [
21] formulates the placement problem as a sequential decision process and employs DRL techniques such as DDDQN and Proximal Policy Optimization (PPO) to adaptively determine server locations, improving resource utilization and system efficiency. Actor–critic-based DRL methods, such as DDPG, have also been applied to optimize server deployment to minimize access delay in dynamic MEC scenarios [
22].
Despite significant progress in the literature, existing approaches typically address individual aspects of MEC optimization, such as service migration, caching, resource allocation, or infrastructure placement, either in isolation or in partially integrated settings. In practical MEC environments, however, these components are tightly coupled, particularly under dynamic user mobility and heterogeneous content demand across users and locations. While several approaches consider subsets of these components, their interactions are often addressed in a decoupled or limited manner, highlighting the need for unified, user-centric frameworks that jointly model such interactions. In contrast, this work proposes a DRL-based framework that incorporates user mobility, edge resource availability, and cache state within a unified formulation. By jointly optimizing service placement, migration, and cache-aware decision-making, the proposed approach enables adaptive and context-aware service management, targeting reduced latency, improved cache efficiency, and balanced resource utilization under dynamic MEC conditions.
Table 1 provides a structured comparison of existing approaches across key system and methodological dimensions, highlighting the diversity of design choices and the targeted integration of components of the proposed framework.
3. System Model
The MEC system model supports dynamic adaptation to resource constraints and optimizes performance in mobile environments. The proposed architecture incorporates two main components: a resource-aware mechanism for service placement and a content caching model to facilitate low-latency content delivery.
3.1. Resource-Aware Model for Service Placement
The MEC infrastructure comprises a set of
N edge nodes, denoted by
, each located at a fixed position and covering a specific service area. Each edge node is equipped with a lightweight base station that enables content retrieval from neighboring nodes as well as a local storage unit for caching [
23].
Figure 1 illustrates the overall system model. As mobile users, such as vehicles or pedestrians, traverse coverage areas, maintaining seamless, low-latency service delivery requires timely service migration across edge nodes. Migration is triggered by user movement because location changes can significantly affect latency. To enhance responsiveness and minimize communication overhead, the placement strategy prioritizes edge nodes that have already locally cached the requested content. This service placement strategy considers the available Central Processing Unit (CPU) and memory resources to avoid overloading individual nodes. The design adheres to ETSI MEC specifications and supports both stateless and stateful migration paradigms [
24,
25].
3.2. Content Caching Model
Content caching complements proactive migration by enabling edge nodes to store popular content locally, thereby reducing access latency and alleviating network congestion. Each edge node maintains a cache with a fixed capacity C, capable of storing up to C data items. When a user issues a request, the availability of content in the local cache significantly improves response time and enhances service continuity, particularly during handovers.
The caching policy is based on observed access patterns and prioritizes frequently requested content. A Zipf distribution is adopted to model content popularity [
26,
27], as it effectively captures the heavy-tailed nature of real-world demand, where a small subset of items accounts for the majority of requests. This modeling approach enables realistic evaluation of caching performance and naturally favors the local storage of highly popular content, thereby improving the CHR and reducing service latency.
Assuming a total of
M unique content items, the probability
of requesting the
i-th item is given by
where
a is the Zipf exponent that governs the skewness of the distribution; higher values of
a indicate a stronger concentration of requests on a small subset of content. When the cache reaches its capacity limit, items are replaced based on popularity, with less frequently requested content evicted in favor of more popular items. This policy helps maintain a high CHR and minimizes retrieval delays.
Edge nodes prioritize the storage of high-demand content to improve access efficiency and reduce reliance on distant data sources. Less frequently requested items are not retained locally but may still be retrieved through service migration as users move. This design enables adaptation to dynamic user mobility and content demand patterns.
4. Problem Formulation
MEC environments present significant challenges due to user mobility and limited computational and storage resources at edge nodes. As users move between coverage areas, the quality of service (QoS) may deteriorate, particularly when the distance from the serving edge node increases. Adaptive resource management mechanisms are essential for dynamically repositioning services and managing content availability, thereby maintaining low latency and ensuring efficient service delivery.
The core objective is to maintain service continuity while minimizing latency, maximizing CHR, and ensuring balanced CPU and memory utilization across the edge nodes. The dynamic decision-making problem is formulated as an MDP defined by the tuple , where denotes the state space, the action space, the state transition probability, the reward function, and the discount factor. The discount factor regulates the contribution of future rewards in the Bellman update, encouraging the agent to seek long-term performance gains rather than greedy decisions.
The agent operates on a user-specific state and selects an action corresponding to the placement or migration of a service instance. The associated reward quantifies the impact of this decision on system performance, guiding the agent toward latency reduction and efficient resource utilization. In the proposed formulation, state transitions reflect the evolution of user location, cache status, and resource utilization at edge nodes after each placement decision. As requests are processed sequentially, each action updates the shared system state and influences subsequent decisions. The shared state information corresponds to system-level conditions within the considered MEC region, representing a localized cluster of edge nodes rather than a network-wide deployment.
4.1. State Space
The system state at each time step t is defined by the combined status of the edge nodes and the user position. For every edge node , the following quantities are included:
Memory and CPU Utilization: Represented as the ratio of used to total capacity
Cache Utilization: Defined as the percentage of occupied cache
Edge Node Coordinates: Each node is associated with a fixed position , which is used to estimate latency based on proximity to users.
At each decision step, the system state is represented as a vector
where
denote the normalized utilization of CPU, memory, and cache at edge node
. Each user
u is represented by its current position
. As summarized in
Table 2, each edge node contributes five attributes, while each user contributes two attributes. This yields a total state dimension of
.
The state representation is formulated as an abstracted and aggregated approximation of the system state, rather than a fully observable instantaneous snapshot, enabling a compact, low-dimensional, and scalable design. It reflects a user-centric decision process in which only the requesting user’s position is included, rather than the positions of all users. By combining user-specific context with edge nodes’ conditions within the considered MEC region, the state dimension remains fixed regardless of the number of active users, avoiding any increase in input dimensionality and model complexity. This perspective is consistent with feasible RL frameworks that emphasize practical observability constraints [
28].
It is important to note that the state representation does not assume perfect or instantaneous knowledge of all system variables (i.e., ENs resources and the user’s position). In practical MEC deployments, such information is obtained through monitoring and orchestration mechanisms (e.g., telemetry from edge nodes) and may be subject to delays or estimation inaccuracies. Therefore, the state vector should be interpreted as an abstraction of measurable system-level indicators, reflecting observed system conditions rather than exact real-time or predicted knowledge.
4.2. Action Space
At each decision step, for a given user request, the agent evaluates all available edge nodes as candidate actions and selects . Each action corresponds to the placement or migration of a service to a specific edge node , where . The action space is not constrained a priori by proximity or resource conditions. Instead, these factors are implicitly captured by the reward function, which guides the agent in selecting edge nodes that offer favorable trade-offs among latency, cache availability, and resource utilization.
4.3. Reward Function
The reward function is designed to jointly capture latency, resource utilization, and cache efficiency (CHR), reflecting their combined impact on service performance in MEC environments. Formally, the reward is defined as a multi-objective function
where the weights
control the relative importance of each objective. The individual reward components are defined as follows:
Latency Reward: Latency is modeled as a function of the distance between the user and the selected edge node. Higher latency is penalized as
where
L is a normalization factor. In practical MEC systems, end-to-end latency comprises multiple components, including propagation, transmission, queuing, and processing delays. In this work, the distance-based formulation is not intended to explicitly model physical propagation delay. Instead, it serves as a proxy for user-to-edge proximity, which correlates with access network delay, routing overhead, and handover-related effects in realistic deployments. The impact of processing and congestion is captured through the resource utilization terms of the reward formulation, which reflect CPU and memory load at the edge nodes and implicitly account for queuing and execution delays.
CPU and Memory Rewards: Resource over-utilization is penalized quadratically, increasing rapidly as utilization levels approach capacity
Cache Reward: Cache efficiency is encouraged through a binary reward
This reward structure promotes the selection of edge nodes that are both close to the user and lightly loaded, while accounting for cache availability. The normalization factor
L ensures comparability between latency and the remaining reward components, while the quadratic penalties promote balanced resource utilization across the infrastructure, avoiding bottlenecks and maintaining system responsiveness. As resource utilization evolves dynamically with incoming user requests, the formulation implicitly captures inter-user interactions, as decisions for earlier requests affect the resource availability observed by subsequent ones. Finally, the binary cache reward reflects the significant latency gap between local and remote content retrieval in MEC systems. In practice, cache hits can reduce latency by an order of magnitude compared to fetching content from the core network. The selected reward values are consistent with prior work [
7,
29], emphasizing the importance of cache awareness in latency-sensitive applications.
5. Proposed Solution
Building upon the MDP formulation presented in
Section 4, the proposed DRL framework utilizes the defined state, action, and reward components to learn optimal service placement policies. At each decision step, the agent observes the current user-specific state
and selects an action corresponding to the placement of the service on a specific edge node. The objective is to learn a policy that maximizes the expected discounted cumulative reward over time. By including both user and edge node coordinates, the agent implicitly captures spatial relationships that affect latency and caching decisions without the need for hardcoded rules.
Decisions are performed sequentially for individual user requests, with the system state being updated after each action. This sequential interaction model ensures that each decision reflects the most recent system conditions, enabling consistent resource allocation under dynamic workloads. The learning process is driven by interactions with the environment, in which the reward function provides feedback on latency, resource utilization, and cache performance. Accordingly, the agent operates on aggregated system-level observations, such as node resource states and user position, rather than requiring full instantaneous observability of all MEC entities.
To handle the dynamic characteristics of MEC environments, we employ three value-based DRL variants: the standard DQN and its extensions, namely DDQN and DDDQN, combining double Q-learning with a dueling network architecture. The use of value-based methods is motivated by the discrete nature of the action space, where each decision corresponds to selecting an edge node from a finite set of candidates. Although policy–gradient and actor–critic methods can also be applied to discrete action spaces, they often require more careful tuning and may exhibit higher variance during training, particularly in problems with discrete action choices. In contrast, value-based approaches provide a stable and sample-efficient learning framework for discrete action settings. Therefore, they offer an effective balance between performance, stability, and computational efficiency for the considered MEC scenario [
30].
DDQN improves upon standard DQN by mitigating overestimation bias through the decoupling of action selection and action evaluation, leading to more stable and reliable value estimates. Building on this, the DDDQN enhances representation learning by decomposing the Q-value into two components: a state-value function and an action-advantage function. This decomposition allows the model to evaluate the quality of a system state independently of specific actions, supporting improved representation learning and robust performance under varying system conditions. In the dueling architecture, the Q-value is expressed as
where
represents the overall value of the state and
the relative advantage of selecting action
. This formulation enables the model to capture how system conditions, including resource utilization, cache availability, and latency, influence placement decisions.
All three DRL variants share a common neural network structure, differing primarily in their value estimation mechanisms. The input layer receives the user-specific state , while the output layer produces one value per action, corresponding to the estimated Q-value for selecting a specific edge node as the service-hosting location. The reward signal provides the feedback required to update the network parameters during training. Each DRL agent is trained using observations of user mobility patterns and service request behavior, enabling the selection of appropriate edge nodes. Unlike static or rule-based approaches, the proposed DRL framework learns placement strategies that adapt to evolving content demand and user mobility.
The decision-making process is summarized in Algorithm 1. For each incoming request, the agent observes the current user-specific state, evaluates the candidate edge nodes, and selects the action that maximizes the expected long-term reward. Depending on cache availability, the request is either served locally or retrieved from the core network, followed by cache update and resource allocation. The agent then receives a reward reflecting the outcome of the decision and updates its policy accordingly. This iterative process enables continuous refinement of placement decisions, maintaining a balance between service quality, resource efficiency, and responsiveness in dynamic MEC environments.
| Algorithm 1 Adaptive Service Placement in Cached MEC using DRL Optimization |
Require: User requests with service demands; edge nodes with resource and cache states Ensure: Optimized service placement and updated edge node status
- 1:
Initialization: Configure the environment with resource constraints and initialize the DRL agent with a predefined reward structure - 2:
Preload frequently accessed content to edge node caches based on prior demand - 3:
for each user request do - 4:
Encode current user-specific state - 5:
Evaluate all candidate edge nodes and select the one maximizing expected long-term reward - 6:
if the requested content exists in the cache of then - 7:
Serve the request locally from the cache - 8:
else - 9:
Retrieve the content from the core network and update ’s cache - 10:
end if - 11:
Allocate computing resources on according to the service requirements - 12:
Compute reward based on latency, resource utilization, and cache status; update DRL policy - 13:
end for
|
The next section presents the experimental results obtained by simulating the proposed approach in a realistic MEC environment with mobile users. Performance metrics are used to compare the behavior of different DRL agents and evaluate the system’s ability to reduce latency, balance resources, and increase CHR.
6. Evaluation, Analysis, and Practical Considerations
In this section, we describe the experimental setup, methodology, and results of evaluating the proposed DDDQN model’s performance in an MEC environment with dynamically moving users. The objective of the experiment is to evaluate the model’s ability to prevent resource saturation at each edge node and to maximize the CHR. Additionally, the model’s ability to select the optimal proximal edge node, using a proximity metric to capture latency, is tested to ensure efficient service relocation under realistic user mobility patterns.
6.1. Environment Setup
The simulation involves a dynamic edge computing environment comprising 6 ENs distributed in a grid to cover a 100 × 100 area (as shown in
Figure 2). Each edge node is equipped with computational and memory resources, enabling it to handle tasks that require powerful processing and data storage. Additionally, each node features a cache system. To ensure a controlled and repeatable experimental setup, we assumed a stable short-range wireless communication model between users and edge nodes, without explicitly simulating channel fading or interference. This abstraction is consistent with urban MEC deployments, in which users often maintain line-of-sight or quasi-static connections within the coverage area of small cells. These assumptions are adopted to provide a controlled evaluation environment, isolating the impact of the DRL-based decision-making mechanism from lower-layer wireless channel variability. Furthermore, all edge nodes were provisioned with identical computational, memory, and caching capabilities. This symmetry across ENs eliminates architectural variability, isolating the learning behavior of the DRL agent and allowing a more straightforward interpretation of cache dynamics, resource constraints, and service relocation under mobility.
In this scenario, we assume a popularity-driven caching policy based on a Zipf distribution to optimize cache utilization by retaining the most frequently requested content. This setup not only improves CHR but also reduces latency by accessing popular items directly from the cache, thereby enhancing overall performance. Two mobile users, initialized within the coverage areas of multiple edge nodes, exhibited varying movement patterns. Each user’s movement followed a random mobility model within set boundaries, simulating simplified mobility behavior within an edge network. Users update their positions within the grid at each time step and generate content requests according to a Zipf distribution. As a result, individual requests may or may not be served by the nearest edge node, resulting in cache hits or misses. The combination of random user mobility and Zipf-distributed content requests introduces continuous variation in user-to-edge proximity, latency conditions, and cache access patterns throughout the simulation. As a result, the DRL agent is exposed to dynamically changing system states during both training and evaluation, reflecting varying resource utilization and service demand conditions. To ensure computational feasibility and an appropriate analysis of agent behavior, we adopted a focused setup of 6 edge nodes and 2 mobile users. Although this scenario enables a clear interpretation of resource dynamics and cache interactions under mobility, we acknowledge that it represents a small-scale environment compared to large-scale real-world MEC deployments. As such, the current study serves as a foundational validation of the DDDQN framework under controlled dynamic conditions. Future work will extend the evaluation to more complex, densely populated edge environments with larger user populations and greater node heterogeneity to more comprehensively assess the system’s scalability.
Although we adopted a random mobility model for its simplicity and analytical tractability, we recognize that user movements in real-world environments often follow more structured or habitual patterns. Therefore, further experimentation with realistic mobility traces (e.g., vehicular trajectories, pedestrian movement logs, or synthetic models such as Gauss-Markov or SLAW) will help validate the robustness of our learning framework under more diverse and realistic user dynamics. Moreover, the current simulation assumes access to user position information, which can be interpreted as an abstraction of a mobility prediction or tracking mechanism. In realistic deployments, such information would be derived from prediction algorithms, and the impact of prediction uncertainty constitutes an important direction for future work.
Our setup distinguishes between stateless and stateful services to account for consistency during task migration between edge nodes. Stateless services are migrated instantly without state synchronization, whereas for stateful services, we assume an eventual-consistency model. Critical session metadata and buffered states are asynchronously replicated between the source and destination ENs during migration. This reflects practical MEC deployments, where strict consistency is often infeasible owing to the latency constraints. Although our simulation does not explicitly model low-level synchronization delays, the overall latency experienced by the user implicitly captures the impact of such transfers.
6.2. DRL Baselines and Parameters
By comparing the DDDQN model with the standard DQN and DDQN baselines, we aim to demonstrate its effectiveness in handling the challenges posed by user movement across multiple edge nodes, including fluctuating distance-related latency, content requests, and resource overutilization. Key performance metrics, including the average cumulative reward per 100 episodes, resource utilization, and cache hits, provide insight into the operational advantages of the model in an edge environment. The scale factors of the reward function were adjusted empirically to balance competing objectives, such as latency reduction, CHR, and resource utilization, in the MEC environment. This iterative approach enables stable training and effective agent behavior.
As shown in
Table 3, the parameters for each DQN-based agent were configured to balance computational efficiency, cache hits, and learning stability.
To ensure the reproducibility and clarity of the experimental setup, each simulation episode spans 200-time steps. At each simulation time step, user requests are processed sequentially by the centralized agent. After each decision, the system state is updated before processing the next request. The system observes and acts at a fixed rate of one decision per simulation time step. Training is performed over 1000 episodes, resulting in over 200,000 agent-environment interactions across various user positions and cache states.
All edge nodes operate within a synchronized, noise-free simulation framework, assuming ideal wireless communication without transmission delays or interference, thereby isolating the DRL agent’s behavior from external variability. While MEC systems in practice may suffer from unstable learning owing to data delays or device limitations, our simulation abstracts the learning process into a centralized, stable training phase. This allowed consistent evaluation across runs and ensured that the collected dataset was sufficient to support convergence under the selected DQN-based models.
6.3. Reward Analysis
The provided curves in
Figure 3 illustrate the averaged reward comparisons across DQN, DDQN, and DDDQN agents over 1000 episodes and with a Zipf exponent of
(which reflects a moderately skewed distribution, where some content is significantly more popular but not overwhelmingly so) for cache content distribution in ENs.
Initially, the DQN agent shows a steady increase in reward, which gradually converges as the training progresses. This trend reflects the conservative learning behavior of DQN, which is known to overestimate Q-values, resulting in slower yet stable reward convergence. The DDQN agent demonstrates improved stability and faster convergence compared to the DQN, with rewards reaching a higher level and maintaining a relatively narrow range, reflecting more reliable performance. Meanwhile, the DDDQN agent achieves the highest stability and fastest convergence, consistently maintaining rewards within the optimal range across episodes. This stability suggests that the combined advantages of the dueling architecture and double Q-learning enable the DDDQN to maximize rewards effectively while minimizing volatility. Consequently, DDDQN proves to be the most efficient and reliable model among the three agents in this environment.
These results highlight the advantage of combining both dueling and double Q-learning techniques to enhance agent performance in complex, dynamic environments, such as MEC.
6.4. Resource Utilization Comparison
Figure 4 shows the average CPU and memory utilization measured at intervals of 100 episodes over a total of 1000 episodes for the DQN, DDQN, and DDDQN agents across edge nodes. This plot highlights the differences in resource management efficiency among the models, with each demonstrating distinct levels of resource balancing and responsiveness to demand. The interval-based observations offer insight into how each agent maintains resource stability over time, highlighting the relative strengths of DDQN and DDDQN in optimizing resource allocation and mitigating overutilization under dynamic user requests.
In DQN, CPU and memory utilization fluctuate and often approach higher usage levels. This suggests that the DQN model, although responsive, does not consistently balance resource demands among edge nodes, leading to periods of high utilization and potential overloading. The frequent spikes indicate that the DQN may not fully avoid resource saturation, especially under dynamic user demands. The DDQN agent exhibits more stable CPU and memory usage, with fewer instances of near-capacity utilization. This pattern reflects DDQN’s ability to manage resources more effectively than DQN by reducing bias in Q-value overestimation. However, resource levels occasionally approach the upper threshold, suggesting that while improved, the DDQN may still face challenges with load distribution under high demand.
In contrast, the DDDQN agent achieves consistently controlled resource utilization, maintaining both CPU and memory usage within a stable operating range. Notably, utilization remains predominantly below approximately 80% of capacity, indicating that the learned policy effectively avoids resource saturation. This utilization level is not enforced as a strict constraint but serves as a reference point for evaluating the agent’s ability to maintain balanced resource allocation. This behavior reflects the advantage of combining dueling and double Q-learning mechanisms, which enable more accurate value estimation and improved differentiation between state quality and action impact.
6.5. Impact of Zipf-Based Popularity on Cache Efficiency
To investigate the impact of content popularity skewness on caching efficiency (CHR), we vary the Zipf exponent
. These values span from near-uniform content demand (
) to highly skewed distributions (
), with
representing a typical intermediate case observed in MEC workloads. This variation affects the likelihood of cache hits and misses, allowing us to examine how each DRL agent performs under content access patterns ranging from balanced to highly concentrated popularity profiles.
Figure 5 illustrates the CHR across the different values of
a for the three DQN-based models, evaluated after convergence. As
a increases, the CHR improves for all models, reflecting the higher concentration of requests on a smaller subset of popular content, which enables more requests to be served directly from the cache. The DDDQN model consistently achieves the highest CHR, followed by DDQN, while DQN exhibits the lowest performance across all values of
a. This result indicates that the combined dueling and double Q-learning architecture of DDDQN enables more effective cache-aware decision-making compared to the standard DQN and DDQN approaches.
6.6. DRL-Based vs. Baseline Methods and Practical Implications
Finally, to further validate the effectiveness of our DRL-based strategies, we compared them against two baseline service placement methods: (i)
Random Placement, where services are assigned to randomly selected ENs regardless of proximity, cache state, or load; and (ii)
Proximity-Only, where the nearest EN is selected without cache or resource awareness.
Table 4 summarizes this comparison using four core metrics: average latency, CHR, and average CPU and memory utilization. In addition, we report results for the case
, which serves as a representative example of a moderately skewed popularity distribution. This allows direct comparison of latency, CHR, and resource utilization across baseline and DRL-based strategies.
The results clearly demonstrate the performance gap between the proposed learning-based agents and the traditional baselines. Both random placement and proximity-based selection resulted in a significantly higher latency and lower CHRs, reflecting their inability to adapt to dynamic workloads and content popularity. Furthermore, these baseline methods result in excessive CPU and memory usage, frequently exceeding 80% utilization, indicating suboptimal resource allocation and recurrent node overloads. In contrast, all DRL-based agents, particularly the DDDQN, exhibit superior decision-making by effectively balancing the service latency, CHR, and resource availability. The observed improvements highlight the benefits of intelligent, context-aware placement in MEC environments with user mobility and skewed content popularity.
The considered baselines serve as reference strategies that isolate individual decision criteria, enabling a clearer assessment of the benefits of joint, state-aware optimization. Specifically, random placement ignores all system-level information, whereas proximity-based selection considers only spatial factors and does not account for cache availability or resource constraints. This comparison highlights the limitations of single-factor decision mechanisms and emphasizes the importance of adaptive policies that jointly consider multiple system dimensions.
Beyond performance evaluation, the proposed DRL-based framework also provides insights into its potential deployment in practical MEC environments. In the current formulation, the DRL agent is assumed to operate as a centralized controller within the MEC infrastructure (e.g., an edge orchestrator), with access to system-level information across a localized cluster of edge nodes, consistent with widely adopted MEC management and orchestration frameworks [
31]. In practical deployments, the framework can be extended in several directions:
Additional contextual information can be incorporated into the state representation, such as predicted user mobility, temporal variations in workload, or content popularity dynamics.
The centralized decision-making process can be adapted to distributed implementations, where inference is performed locally at edge nodes using partial system observations, enabling scalability and reducing coordination overhead. Such extensions may further support distributed multi-agent learning based on localized observations, facilitating asynchronous service placement across multiple administrative domains.
Furthermore, the proposed approach can be integrated with MEC orchestration mechanisms and mobility management procedures, allowing for coordination with handover events and resource scheduling.
These extensions enable the framework to support adaptive, context-aware service placement in practical MEC systems.
6.7. Limitations and Future Work
Despite these outcomes, the current framework abstracts several control-plane functionalities, including policy enforcement, network-level mobility handling, and orchestration signaling. In operational MEC deployments, procedures such as Xn- or N2-based handovers, interactions with network schedulers, and coordination through network function managers or radio intelligent controller applications play a critical role in ensuring uninterrupted service continuity. In this work, the evaluation focuses on the agent-side inference process, assuming seamless service migration with negligible signaling delay. In addition, although model training is performed offline using episodic simulations, a detailed evaluation of training convergence time, resource footprint, and inference latency under varying workloads remains an open issue.
Future extensions will integrate MEC-aware control-plane functions and standardized orchestration layers, enabling the system to respond to handover events and policy changes in real time. Aligning with 3GPP-compliant mobility management procedures will further support end-to-end coordination between user trajectory prediction, cache placement, and service migration policies [
32]. In this context, digital twin-assisted MEC frameworks [
33] offer a promising extension by enhancing system observability through real-time monitoring and virtual representations of the network state. By enabling a more accurate representation of network dynamics, such approaches can complement service-level decision-making frameworks and provide DRL agents with more reliable and timely system information, thereby reducing uncertainty and improving decision robustness under user mobility and dynamic resource conditions.
The current experimental setup is based on a small-scale simulated MEC topology comprising six edge nodes and two mobile users. While this setup enables controlled and interpretable benchmarking, it does not fully capture the computational and coordination complexity of large-scale deployments. Nevertheless, the consistent performance gains observed across different DRL variants suggest that the proposed framework exhibits stable performance under dynamic conditions within the evaluated setup. Future work will extend the evaluation to larger-scale environments, incorporate realistic mobility traces, and system- and network-level dynamics, including communication delays, task execution variability, and wireless channel effects (e.g., fading, interference, and transmission variability), to better assess deployment feasibility in realistic MEC environments. In addition, comparisons with more advanced service placement and caching strategies, spanning both optimization-based and learning-based approaches, under more unified evaluation settings will enable a more comprehensive and consistent performance assessment.
Although the proposed DRL agents are trained offline and do not affect real-time system operation, the computational cost of training remains an important consideration. In the current implementation, each agent (DQN, DDQN, and DDDQN) converges within approximately 500 training episodes, with a total training time of 35–40 min on an NVIDIA RTX 3080 GPU. Post-training inference requires less than 10 ms per decision step, indicating suitability for real-time execution. Future work will include detailed profiling of training efficiency, runtime behavior under dynamic workloads, and resource footprint at scale, as well as testbed-based implementation and empirical validation to assess the framework under real-time MEC conditions.
7. Conclusions
This study presents a unified framework for service placement in MEC environments, designed to be resource-aware and cache-enabled. The proposed system addresses the challenges posed by user mobility and dynamic content demands by incorporating DRL agents, namely, DQN, DDQN, and DDDQN, to select the most appropriate edge node for each request. Decisions are based on the joint consideration of latency minimization, resource availability, and content popularity, with caching guided by a Zipf-based distribution model.
Unlike previous methods that treat migration and caching as isolated tasks, the proposed approach captures their interdependence within a unified DRL-based decision framework. This perspective enhances service continuity and responsiveness in mobile user scenarios, where traditional placement policies often lead to service interruptions and low CHR. Experimental results in a controlled MEC simulation environment confirm that the proposed approach achieves lower service latency, higher CHR, and improved resource utilization across edge nodes, while also promoting placement stability and limiting unnecessary service relocations. In particular, the DDDQN model consistently outperforms baseline and non-learning policies, validating the effectiveness of the proposed decision-making framework.
While the proposed approach demonstrates consistent performance gains, it is subject to limitations related to the availability and accuracy of system-level information in dynamic MEC environments. Furthermore, although the simulation setup reflects realistic edge infrastructure constraints, it does not fully capture the complexity of large-scale heterogeneous deployments. Scalability of training and inference, as well as model generalizability under diverse conditions, remain open challenges.
Nevertheless, the proposed framework establishes a flexible and adaptive foundation for intelligent service placement in MEC environments, highlighting the potential of DRL-based approaches for managing complex, dynamic edge systems.