Resource-Aware Deep Reinforcement Learning for Joint Caching and Service Placement in Multi-Access Edge Computing

Dritsas, Elias; Trigka, Maria

doi:10.3390/electronics15102074

Open AccessArticle

Resource-Aware Deep Reinforcement Learning for Joint Caching and Service Placement in Multi-Access Edge Computing

by

Elias Dritsas

^*

and

Maria Trigka

Department of Informatics and Computer Engineering, University of West Attica, Egaleo Park Campus, 12243 Athens, Greece

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(10), 2074; https://doi.org/10.3390/electronics15102074

Submission received: 17 November 2025 / Revised: 30 April 2026 / Accepted: 10 May 2026 / Published: 13 May 2026

(This article belongs to the Special Issue Machine Learning Approach for Prediction: Cross-Domain Applications)

Download

Browse Figures

Versions Notes

Abstract

Multi-access edge computing (MEC) enables low-latency service provisioning by placing computation closer to mobile users. However, efficient service placement remains challenging due to dynamic user mobility, limited edge resources, and the need to manage service migration as system conditions evolve. This study proposes a resource-aware, cache-enabled service placement framework based on deep reinforcement learning (DRL) to dynamically select edge nodes for hosting services. The approach jointly considers user location, resource availability, and cache status within a unified decision framework, enabling efficient and adaptive service placement in dynamic MEC environments. The problem is formulated as a Markov decision process (MDP) and solved using deep Q-network (DQN)-based methods, with a reward function that balances latency, resource utilization, and cache efficiency. The proposed framework is evaluated in a simulated MEC environment with mobile users and multiple edge nodes. Experimental results demonstrate that the approach achieves lower latency, improved resource utilization, and enhanced cache efficiency compared to baseline strategies. Among the evaluated models, the dueling double deep Q-network (DDDQN) achieves the most balanced overall performance. The proposed framework provides an adaptive and scalable solution for service management in dynamic MEC environments.

Keywords:

multi-access edge computing; service migration; latency optimization; user mobility; caching; resource management; deep Q-network; deep reinforcement learning

1. Introduction

Multi-access edge computing (MEC) has become an essential architectural paradigm for delivering low-latency, high-quality services by placing computation and storage closer to mobile users. This proximity reduces the round-trip time for data transmission, enhancing responsiveness, which is essential for applications such as augmented reality, real-time gaming, and autonomous driving. However, maintaining consistent service quality in MEC environments is challenging owing to user mobility. This necessitates efficient task migration and proper resource allocation across edge nodes [1].

To address these challenges, proactive migration strategies have been widely investigated. These approaches aim to transfer tasks in advance to nearby edge nodes as users move across coverage areas, thereby minimizing handover delays and potential service disruptions [2,3]. Although such methods are beneficial for preserving service continuity, they often rely on accurate trajectory prediction, which can introduce computational overhead and inaccuracies in dynamic environments [4,5].

Additionally, effective caching mechanisms can support migration by storing frequently requested content locally at edge nodes, thereby reducing load on the core network. By keeping popular content close to users, service relocation can proceed with lower latency, as cached data reduces the need to retrieve information from distant servers. As MEC evolves towards cloud-native architectures [6], application services are increasingly being developed as stateless, avoiding local session storage.

Within these architectures, caching plays a central role in managing the data locality independently of the service state. This enables stateless services to scale dynamically across edge nodes, thereby supporting efficient and latency-sensitive data access. Caching policies based on content popularity models, particularly those that employ Zipf distribution, have proven effective at improving cache utilization and alleviating congestion in core networks [7,8].

However, in practical MEC environments, decisions related to service migration and data caching, along with resource allocation, are inherently interdependent, particularly under dynamic user mobility and heterogeneous content demand across users and locations. This interdependence complicates decision-making, as optimizing one component may negatively affect the performance of others. For instance, migration decisions impact resource availability and cache utilization, while caching strategies influence both service placement efficiency and migration overhead. Furthermore, the dynamic nature of MEC systems, including time-varying workloads and fluctuating resource availability, makes it difficult to maintain consistent service performance. In this context, treating these components independently leads to suboptimal system behavior, underscoring the need for coordinated decision-making across multiple system dimensions.

1.1. Motivation

User mobility in MEC environments inherently results in frequent handovers as users transition between service areas. These handovers lead to fluctuating workloads and may cause service interruptions. Rather than explicitly modeling trajectory prediction within the decision-making process, this study assumes access to short-term mobility information regarding the user’s future location. This assumption enables adaptive preparation of edge nodes aiming to reduce handover-induced delays while remaining consistent with practical MEC systems, where user location can be estimated through prediction or tracking mechanisms. This strategy is particularly relevant for applications with strict latency and continuity requirements.

Reinforcement Learning (RL) has been widely used to address service migration and lifecycle-aware management in MEC environments [9]. In particular, studies have considered the joint optimization of system components such as caching, task offloading, and resource allocation [10,11]. However, these approaches typically address such aspects in partially integrated or scenario-specific settings, limiting their ability to fully capture the interdependent nature of decision-making in dynamic MEC environments. This study aims to fill this gap by introducing an adaptive framework that integrates DRL-based decisions for service placement and migration, jointly optimized with cache-aware mechanisms under dynamic resource constraints. The framework operates under dynamic MEC conditions characterized by user mobility and dynamic content demand.

1.2. Contribution

The key contributions of this study are summarized as follows:

A unified MEC framework is proposed, employing a centralized DRL agent that enables state-aware, joint decision-making for service placement and migration under dynamic system conditions. The framework incorporates resource availability and cache state information (modeled based on content popularity distributions), capturing the interdependence among service location, resource utilization, and data locality.
A structured system-level state representation is introduced that integrates edge node resources, cache status, and user mobility information. This formulation enables cache- and resource-aware policy learning, guiding the agent to select edge nodes with locally available popular content and improve data locality.
The proposed framework is evaluated in a simulated MEC environment with mobile users, demonstrating consistent improvements in service latency, Cache Hit Ratio (CHR), and edge resource utilization across multiple DRL variants, deep Q-network (DQN), double deep Q-network (DDQN), and dueling DDQN (DDDQN), validating its effectiveness under dynamic operating conditions.

The remainder of this paper is organized as follows. Section 2 reviews the related work. Section 3 presents the system model, followed by the problem formulation in Section 4. Section 5 outlines the proposed DRL-based solution. Section 6 presents and analyzes the evaluation results, including a detailed performance analysis, comparative evaluation with baseline methods, and a discussion of practical implications, as well as the limitations of the proposed framework and future research directions. Finally, Section 7 concludes the paper.

2. Related Works

MEC environments require efficient service management strategies that can accommodate user mobility, limited computational resources, and stringent latency constraints. In this context, a substantial body of work has explored RL and its deep variants to enable adaptive decision-making for service placement, migration, and resource allocation.

Existing MEC optimization approaches can be characterized along multiple dimensions. From a system perspective, prior works adopt centralized, distributed, or hierarchical control architectures. Centralized approaches rely on a global system view to make sequential decisions, whereas distributed and multi-agent frameworks improve scalability under partial observability. Hierarchical models [12] decompose the decision process to address complexity, particularly in highly complex and dynamic MEC environments. From a methodological standpoint, DRL-based solutions can be grouped into value-based methods (e.g., DQN and its variants), policy-based or actor–critic approaches, and hybrid formulations that combine discrete and continuous decision spaces. Finally, existing works differ in scope of optimization, ranging from single-component solutions (e.g., service placement or caching) to partially integrated frameworks that jointly consider multiple aspects of the system.

Early studies focus on RL-based service placement under dynamic conditions. One example is provided by [13], which introduces a Q-learning approach for mobility-aware placement in vehicular networks, capturing user mobility and service demand variability while optimizing latency and resource utilization. Building upon this direction, ref. [14] proposes a DRL-based dynamic service placement framework, in which the problem is formulated as a mixed-integer linear programming (MILP) model and enhanced with a migration conflict resolution mechanism to minimize delay under resource and cost constraints.

Recent studies extend the problem to include service migration and resource-aware decision-making. In particular, Liu et al. [15] formulate the joint service migration and resource allocation (SMRA) problem as a Markov decision process (MDP), combining Long Short-Term Memory (LSTM) for user mobility prediction with Parameterized DQN (PDQN) to handle hybrid discrete–continuous action spaces. This formulation enables simultaneous decisions on service migration and resource allocation under dynamic conditions. Mobility-aware migration has also been addressed using advanced DRL variants. Hua et al. [16] propose a mobility-aware DRL framework enhanced with a Hidden Markov Model (HMM) for vehicle behavior prediction, leveraging a DDDQN architecture to capture stochastic mobility patterns and guide migration decisions. By incorporating probabilistic mobility prediction, the model reduces service latency and energy consumption in highly dynamic Internet of Vehicles (IoV) scenarios. In addition, experimental studies analyze the performance of migration strategies in MEC-assisted 5G-V2X systems, highlighting trade-offs in service continuity, downtime, and state transfer mechanisms under realistic conditions [17].

Related works have focused on the joint optimization of service caching and computation offloading. For example, ref. [10] employs a combination of LSTM and Deep Deterministic Policy Gradient (DDPG) to capture temporal dynamics and optimize offloading and distributed caching decisions in multi-region MEC systems. Building on this direction, hierarchical reinforcement learning (HRL) approaches further extend this idea by optimizing caching, workload offloading, and resource allocation through multi-layer architectures, where lower-level DQN-based policies handle local caching and offloading decisions, while higher-level policies perform system-wide resource allocation and load balancing, accounting for the strong coupling between the two decision layers [18]. Hierarchical DRL (HDRL) frameworks have also been proposed to explicitly address the strong coupling between caching and offloading decisions by decomposing the optimization problem into interrelated service caching and computation offloading processes, following a divide-and-conquer strategy [11]. HDRL has also been investigated in broader MEC settings. For instance, ref. [19] proposes a hierarchical multi-agent DRL framework for wireless-powered MEC systems, in which high-level agents manage energy provisioning and wireless power transfer, while low-level agents optimize computation offloading actions, which are closely related to service execution and placement decisions. Mobility-aware extensions have further enhanced these approaches. Specifically, ref. [20] proposes a distributed service caching framework that integrates LSTM-based mobility prediction with a DDDQN model to optimize caching replacement decisions and reduce communication energy consumption in dynamic edge environments. Despite these advances, such methods primarily focus on specific aspects of MEC optimization, rather than addressing the joint interaction between caching, offloading, and migration within a unified framework.

Beyond service-level optimization, studies also examine infrastructure-oriented aspects of MEC environments, focusing on system-wide deployment and coordination. DRL-based approaches have been proposed for dynamic edge server placement under time-varying conditions and user mobility. For example, ref. [21] formulates the placement problem as a sequential decision process and employs DRL techniques such as DDDQN and Proximal Policy Optimization (PPO) to adaptively determine server locations, improving resource utilization and system efficiency. Actor–critic-based DRL methods, such as DDPG, have also been applied to optimize server deployment to minimize access delay in dynamic MEC scenarios [22].

Despite significant progress in the literature, existing approaches typically address individual aspects of MEC optimization, such as service migration, caching, resource allocation, or infrastructure placement, either in isolation or in partially integrated settings. In practical MEC environments, however, these components are tightly coupled, particularly under dynamic user mobility and heterogeneous content demand across users and locations. While several approaches consider subsets of these components, their interactions are often addressed in a decoupled or limited manner, highlighting the need for unified, user-centric frameworks that jointly model such interactions. In contrast, this work proposes a DRL-based framework that incorporates user mobility, edge resource availability, and cache state within a unified formulation. By jointly optimizing service placement, migration, and cache-aware decision-making, the proposed approach enables adaptive and context-aware service management, targeting reduced latency, improved cache efficiency, and balanced resource utilization under dynamic MEC conditions. Table 1 provides a structured comparison of existing approaches across key system and methodological dimensions, highlighting the diversity of design choices and the targeted integration of components of the proposed framework.

3. System Model

The MEC system model supports dynamic adaptation to resource constraints and optimizes performance in mobile environments. The proposed architecture incorporates two main components: a resource-aware mechanism for service placement and a content caching model to facilitate low-latency content delivery.

3.1. Resource-Aware Model for Service Placement

The MEC infrastructure comprises a set of N edge nodes, denoted by

{E N_{i}}_{i = 1}^{N}

, each located at a fixed position and covering a specific service area. Each edge node is equipped with a lightweight base station that enables content retrieval from neighboring nodes as well as a local storage unit for caching [23].

Figure 1 illustrates the overall system model. As mobile users, such as vehicles or pedestrians, traverse coverage areas, maintaining seamless, low-latency service delivery requires timely service migration across edge nodes. Migration is triggered by user movement because location changes can significantly affect latency. To enhance responsiveness and minimize communication overhead, the placement strategy prioritizes edge nodes that have already locally cached the requested content. This service placement strategy considers the available Central Processing Unit (CPU) and memory resources to avoid overloading individual nodes. The design adheres to ETSI MEC specifications and supports both stateless and stateful migration paradigms [24,25].

3.2. Content Caching Model

Content caching complements proactive migration by enabling edge nodes to store popular content locally, thereby reducing access latency and alleviating network congestion. Each edge node maintains a cache with a fixed capacity C, capable of storing up to C data items. When a user issues a request, the availability of content in the local cache significantly improves response time and enhances service continuity, particularly during handovers.

The caching policy is based on observed access patterns and prioritizes frequently requested content. A Zipf distribution is adopted to model content popularity [26,27], as it effectively captures the heavy-tailed nature of real-world demand, where a small subset of items accounts for the majority of requests. This modeling approach enables realistic evaluation of caching performance and naturally favors the local storage of highly popular content, thereby improving the CHR and reducing service latency.

Assuming a total of M unique content items, the probability

P (O_{i})

of requesting the i-th item is given by

P (O_{i}) = \frac{i^{- a}}{\sum_{j = 1}^{M} j^{- a}}, i \in {1, 2, \dots, M},

(1)

where a is the Zipf exponent that governs the skewness of the distribution; higher values of a indicate a stronger concentration of requests on a small subset of content. When the cache reaches its capacity limit, items are replaced based on popularity, with less frequently requested content evicted in favor of more popular items. This policy helps maintain a high CHR and minimizes retrieval delays.

Edge nodes prioritize the storage of high-demand content to improve access efficiency and reduce reliance on distant data sources. Less frequently requested items are not retained locally but may still be retrieved through service migration as users move. This design enables adaptation to dynamic user mobility and content demand patterns.

4. Problem Formulation

MEC environments present significant challenges due to user mobility and limited computational and storage resources at edge nodes. As users move between coverage areas, the quality of service (QoS) may deteriorate, particularly when the distance from the serving edge node increases. Adaptive resource management mechanisms are essential for dynamically repositioning services and managing content availability, thereby maintaining low latency and ensuring efficient service delivery.

The core objective is to maintain service continuity while minimizing latency, maximizing CHR, and ensuring balanced CPU and memory utilization across the edge nodes. The dynamic decision-making problem is formulated as an MDP defined by the tuple

(S, A, P, R, γ)

, where

S

denotes the state space,

A

the action space,

P (s_{t + 1}^{u} ∣ s_{t}^{u}, a_{t})

the state transition probability,

R (s_{t}^{u}, a_{t})

the reward function, and

γ \in [0, 1)

the discount factor. The discount factor regulates the contribution of future rewards in the Bellman update, encouraging the agent to seek long-term performance gains rather than greedy decisions.

The agent operates on a user-specific state

s_{t}^{u} \in S

and selects an action

a_{t} \in A

corresponding to the placement or migration of a service instance. The associated reward quantifies the impact of this decision on system performance, guiding the agent toward latency reduction and efficient resource utilization. In the proposed formulation, state transitions reflect the evolution of user location, cache status, and resource utilization at edge nodes after each placement decision. As requests are processed sequentially, each action updates the shared system state and influences subsequent decisions. The shared state information corresponds to system-level conditions within the considered MEC region, representing a localized cluster of edge nodes rather than a network-wide deployment.

4.1. State Space $S$

The system state at each time step t is defined by the combined status of the edge nodes and the user position. For every edge node

{EN}_{i}

, the following quantities are included:

Memory and CPU Utilization: Represented as the ratio of used to total capacity

$U_{mem} = \frac{memory_usage}{memory_capacity}, U_{cpu} = \frac{cpu_usage}{cpu_capacity}$

(2)
Cache Utilization: Defined as the percentage of occupied cache

$U_{cache} = \frac{cache_usage}{cache_size}$

(3)
Edge Node Coordinates: Each node $i \in {1, 2, \dots, N}$ is associated with a fixed position $(x^{(i)}, y^{(i)})$ , which is used to estimate latency based on proximity to users.

At each decision step, the system state is represented as a vector

s_{t}^{u} = [\underset{{EN}_{1}}{\underset{︸}{U_{cpu}^{(1, t)}, U_{mem}^{(1, t)}, U_{cache}^{(1, t)}, x^{(1)}, y^{(1)}}}, \dots, \underset{{EN}_{N}}{\underset{︸}{U_{cpu}^{(N, t)}, U_{mem}^{(N, t)}, U_{cache}^{(N, t)}, x^{(N)}, y^{(N)}}}, \underset{{user}_{u}}{\underset{︸}{x^{(u, t)}, y^{(u, t)}}}],

(4)

where

U_{cpu}^{(i, t)}, U_{mem}^{(i, t)}, U_{cache}^{(i, t)}

denote the normalized utilization of CPU, memory, and cache at edge node

E N_{i}

. Each user u is represented by its current position

(x^{(u, t)}, y^{(u, t)})

. As summarized in Table 2, each edge node contributes five attributes, while each user contributes two attributes. This yields a total state dimension of

N \times 5 + 2

.

The state representation is formulated as an abstracted and aggregated approximation of the system state, rather than a fully observable instantaneous snapshot, enabling a compact, low-dimensional, and scalable design. It reflects a user-centric decision process in which only the requesting user’s position is included, rather than the positions of all users. By combining user-specific context with edge nodes’ conditions within the considered MEC region, the state dimension remains fixed regardless of the number of active users, avoiding any increase in input dimensionality and model complexity. This perspective is consistent with feasible RL frameworks that emphasize practical observability constraints [28].

It is important to note that the state representation does not assume perfect or instantaneous knowledge of all system variables (i.e., ENs resources and the user’s position). In practical MEC deployments, such information is obtained through monitoring and orchestration mechanisms (e.g., telemetry from edge nodes) and may be subject to delays or estimation inaccuracies. Therefore, the state vector should be interpreted as an abstraction of measurable system-level indicators, reflecting observed system conditions rather than exact real-time or predicted knowledge.

4.2. Action Space $A$

At each decision step, for a given user request, the agent evaluates all available edge nodes as candidate actions and selects

a_{t} \in A_{t}

. Each action corresponds to the placement or migration of a service to a specific edge node

E N_{i}

, where

i \in {1, 2, \dots, N}

. The action space is not constrained a priori by proximity or resource conditions. Instead, these factors are implicitly captured by the reward function, which guides the agent in selecting edge nodes that offer favorable trade-offs among latency, cache availability, and resource utilization.

4.3. Reward Function

The reward function is designed to jointly capture latency, resource utilization, and cache efficiency (CHR), reflecting their combined impact on service performance in MEC environments. Formally, the reward is defined as a multi-objective function

R (s_{t}^{u}, a_{t}) = w_{latency} \cdot R_{latency} + w_{cpu} \cdot R_{cpu} + w_{memory} \cdot R_{memory} + w_{cache} \cdot R_{cache},

(5)

where the weights

w_{latency}, w_{cpu}, w_{memory}, w_{cache}

control the relative importance of each objective. The individual reward components are defined as follows:

Latency Reward: Latency is modeled as a function of the distance between the user and the selected edge node. Higher latency is penalized as

$R_{latency} = - \frac{latency}{L},$

(6)

where L is a normalization factor. In practical MEC systems, end-to-end latency comprises multiple components, including propagation, transmission, queuing, and processing delays. In this work, the distance-based formulation is not intended to explicitly model physical propagation delay. Instead, it serves as a proxy for user-to-edge proximity, which correlates with access network delay, routing overhead, and handover-related effects in realistic deployments. The impact of processing and congestion is captured through the resource utilization terms of the reward formulation, which reflect CPU and memory load at the edge nodes and implicitly account for queuing and execution delays.
CPU and Memory Rewards: Resource over-utilization is penalized quadratically, increasing rapidly as utilization levels approach capacity

$R_{cpu} = - U_{cpu}^{2}, R_{memory} = - U_{memory}^{2} .$

(7)
Cache Reward: Cache efficiency is encouraged through a binary reward

$R_{cache} = \{\begin{matrix} 10 & if cache_hit \\ - 10 & otherwise . \end{matrix}$

(8)

This reward structure promotes the selection of edge nodes that are both close to the user and lightly loaded, while accounting for cache availability. The normalization factor L ensures comparability between latency and the remaining reward components, while the quadratic penalties promote balanced resource utilization across the infrastructure, avoiding bottlenecks and maintaining system responsiveness. As resource utilization evolves dynamically with incoming user requests, the formulation implicitly captures inter-user interactions, as decisions for earlier requests affect the resource availability observed by subsequent ones. Finally, the binary cache reward reflects the significant latency gap between local and remote content retrieval in MEC systems. In practice, cache hits can reduce latency by an order of magnitude compared to fetching content from the core network. The selected reward values are consistent with prior work [7,29], emphasizing the importance of cache awareness in latency-sensitive applications.

5. Proposed Solution

Building upon the MDP formulation presented in Section 4, the proposed DRL framework utilizes the defined state, action, and reward components to learn optimal service placement policies. At each decision step, the agent observes the current user-specific state

s_{t}^{u}

and selects an action corresponding to the placement of the service on a specific edge node. The objective is to learn a policy that maximizes the expected discounted cumulative reward over time. By including both user and edge node coordinates, the agent implicitly captures spatial relationships that affect latency and caching decisions without the need for hardcoded rules.

Decisions are performed sequentially for individual user requests, with the system state being updated after each action. This sequential interaction model ensures that each decision reflects the most recent system conditions, enabling consistent resource allocation under dynamic workloads. The learning process is driven by interactions with the environment, in which the reward function provides feedback on latency, resource utilization, and cache performance. Accordingly, the agent operates on aggregated system-level observations, such as node resource states and user position, rather than requiring full instantaneous observability of all MEC entities.

To handle the dynamic characteristics of MEC environments, we employ three value-based DRL variants: the standard DQN and its extensions, namely DDQN and DDDQN, combining double Q-learning with a dueling network architecture. The use of value-based methods is motivated by the discrete nature of the action space, where each decision corresponds to selecting an edge node from a finite set of candidates. Although policy–gradient and actor–critic methods can also be applied to discrete action spaces, they often require more careful tuning and may exhibit higher variance during training, particularly in problems with discrete action choices. In contrast, value-based approaches provide a stable and sample-efficient learning framework for discrete action settings. Therefore, they offer an effective balance between performance, stability, and computational efficiency for the considered MEC scenario [30].

DDQN improves upon standard DQN by mitigating overestimation bias through the decoupling of action selection and action evaluation, leading to more stable and reliable value estimates. Building on this, the DDDQN enhances representation learning by decomposing the Q-value into two components: a state-value function and an action-advantage function. This decomposition allows the model to evaluate the quality of a system state independently of specific actions, supporting improved representation learning and robust performance under varying system conditions. In the dueling architecture, the Q-value is expressed as

Q (s_{t}^{u}, a_{t}) = V (s_{t}^{u}) + A (s_{t}^{u}, a_{t}) - \frac{1}{| A |} \sum_{a^{'} \in A} A (s_{t}^{u}, a^{'}),

(9)

where

V (s_{t}^{u})

represents the overall value of the state and

A (s_{t}^{u}, a_{t})

the relative advantage of selecting action

a_{t}

. This formulation enables the model to capture how system conditions, including resource utilization, cache availability, and latency, influence placement decisions.

All three DRL variants share a common neural network structure, differing primarily in their value estimation mechanisms. The input layer receives the user-specific state

s_{t}^{u}

, while the output layer produces one value per action, corresponding to the estimated Q-value for selecting a specific edge node as the service-hosting location. The reward signal provides the feedback required to update the network parameters during training. Each DRL agent is trained using observations of user mobility patterns and service request behavior, enabling the selection of appropriate edge nodes. Unlike static or rule-based approaches, the proposed DRL framework learns placement strategies that adapt to evolving content demand and user mobility.

The decision-making process is summarized in Algorithm 1. For each incoming request, the agent observes the current user-specific state, evaluates the candidate edge nodes, and selects the action that maximizes the expected long-term reward. Depending on cache availability, the request is either served locally or retrieved from the core network, followed by cache update and resource allocation. The agent then receives a reward reflecting the outcome of the decision and updates its policy accordingly. This iterative process enables continuous refinement of placement decisions, maintaining a balance between service quality, resource efficiency, and responsiveness in dynamic MEC environments.

Algorithm 1 Adaptive Service Placement in Cached MEC using DRL Optimization

Require: User requests with service demands; edge nodes

{E N_{i}}_{i = 1}^{N}

with resource and cache states
Ensure: Optimized service placement and updated edge node status

1:: Initialization: Configure the environment with resource constraints and initialize the DRL agent with a predefined reward structure
2:: Preload frequently accessed content to edge node caches based on prior demand
3:: for each user request do
4:: Encode current user-specific state $s_{t}^{u}$
5:: Evaluate all candidate edge nodes $E N_{i}$ and select the one maximizing expected long-term reward
6:: if the requested content exists in the cache of $E N_{i}$ then
7:: Serve the request locally from the cache
8:: else
9:: Retrieve the content from the core network and update $E N_{i}$ ’s cache
10:: end if
11:: Allocate computing resources on $E N_{i}$ according to the service requirements
12:: Compute reward based on latency, resource utilization, and cache status; update DRL policy
13:: end for

The next section presents the experimental results obtained by simulating the proposed approach in a realistic MEC environment with mobile users. Performance metrics are used to compare the behavior of different DRL agents and evaluate the system’s ability to reduce latency, balance resources, and increase CHR.

6. Evaluation, Analysis, and Practical Considerations

In this section, we describe the experimental setup, methodology, and results of evaluating the proposed DDDQN model’s performance in an MEC environment with dynamically moving users. The objective of the experiment is to evaluate the model’s ability to prevent resource saturation at each edge node and to maximize the CHR. Additionally, the model’s ability to select the optimal proximal edge node, using a proximity metric to capture latency, is tested to ensure efficient service relocation under realistic user mobility patterns.

6.1. Environment Setup

The simulation involves a dynamic edge computing environment comprising 6 ENs distributed in a grid to cover a 100 × 100 area (as shown in Figure 2). Each edge node is equipped with computational and memory resources, enabling it to handle tasks that require powerful processing and data storage. Additionally, each node features a cache system. To ensure a controlled and repeatable experimental setup, we assumed a stable short-range wireless communication model between users and edge nodes, without explicitly simulating channel fading or interference. This abstraction is consistent with urban MEC deployments, in which users often maintain line-of-sight or quasi-static connections within the coverage area of small cells. These assumptions are adopted to provide a controlled evaluation environment, isolating the impact of the DRL-based decision-making mechanism from lower-layer wireless channel variability. Furthermore, all edge nodes were provisioned with identical computational, memory, and caching capabilities. This symmetry across ENs eliminates architectural variability, isolating the learning behavior of the DRL agent and allowing a more straightforward interpretation of cache dynamics, resource constraints, and service relocation under mobility.

In this scenario, we assume a popularity-driven caching policy based on a Zipf distribution to optimize cache utilization by retaining the most frequently requested content. This setup not only improves CHR but also reduces latency by accessing popular items directly from the cache, thereby enhancing overall performance. Two mobile users, initialized within the coverage areas of multiple edge nodes, exhibited varying movement patterns. Each user’s movement followed a random mobility model within set boundaries, simulating simplified mobility behavior within an edge network. Users update their positions within the grid at each time step and generate content requests according to a Zipf distribution. As a result, individual requests may or may not be served by the nearest edge node, resulting in cache hits or misses. The combination of random user mobility and Zipf-distributed content requests introduces continuous variation in user-to-edge proximity, latency conditions, and cache access patterns throughout the simulation. As a result, the DRL agent is exposed to dynamically changing system states during both training and evaluation, reflecting varying resource utilization and service demand conditions. To ensure computational feasibility and an appropriate analysis of agent behavior, we adopted a focused setup of 6 edge nodes and 2 mobile users. Although this scenario enables a clear interpretation of resource dynamics and cache interactions under mobility, we acknowledge that it represents a small-scale environment compared to large-scale real-world MEC deployments. As such, the current study serves as a foundational validation of the DDDQN framework under controlled dynamic conditions. Future work will extend the evaluation to more complex, densely populated edge environments with larger user populations and greater node heterogeneity to more comprehensively assess the system’s scalability.

Although we adopted a random mobility model for its simplicity and analytical tractability, we recognize that user movements in real-world environments often follow more structured or habitual patterns. Therefore, further experimentation with realistic mobility traces (e.g., vehicular trajectories, pedestrian movement logs, or synthetic models such as Gauss-Markov or SLAW) will help validate the robustness of our learning framework under more diverse and realistic user dynamics. Moreover, the current simulation assumes access to user position information, which can be interpreted as an abstraction of a mobility prediction or tracking mechanism. In realistic deployments, such information would be derived from prediction algorithms, and the impact of prediction uncertainty constitutes an important direction for future work.

Our setup distinguishes between stateless and stateful services to account for consistency during task migration between edge nodes. Stateless services are migrated instantly without state synchronization, whereas for stateful services, we assume an eventual-consistency model. Critical session metadata and buffered states are asynchronously replicated between the source and destination ENs during migration. This reflects practical MEC deployments, where strict consistency is often infeasible owing to the latency constraints. Although our simulation does not explicitly model low-level synchronization delays, the overall latency experienced by the user implicitly captures the impact of such transfers.

6.2. DRL Baselines and Parameters

By comparing the DDDQN model with the standard DQN and DDQN baselines, we aim to demonstrate its effectiveness in handling the challenges posed by user movement across multiple edge nodes, including fluctuating distance-related latency, content requests, and resource overutilization. Key performance metrics, including the average cumulative reward per 100 episodes, resource utilization, and cache hits, provide insight into the operational advantages of the model in an edge environment. The scale factors of the reward function were adjusted empirically to balance competing objectives, such as latency reduction, CHR, and resource utilization, in the MEC environment. This iterative approach enables stable training and effective agent behavior.

As shown in Table 3, the parameters for each DQN-based agent were configured to balance computational efficiency, cache hits, and learning stability.

To ensure the reproducibility and clarity of the experimental setup, each simulation episode spans 200-time steps. At each simulation time step, user requests are processed sequentially by the centralized agent. After each decision, the system state is updated before processing the next request. The system observes and acts at a fixed rate of one decision per simulation time step. Training is performed over 1000 episodes, resulting in over 200,000 agent-environment interactions across various user positions and cache states.

All edge nodes operate within a synchronized, noise-free simulation framework, assuming ideal wireless communication without transmission delays or interference, thereby isolating the DRL agent’s behavior from external variability. While MEC systems in practice may suffer from unstable learning owing to data delays or device limitations, our simulation abstracts the learning process into a centralized, stable training phase. This allowed consistent evaluation across runs and ensured that the collected dataset was sufficient to support convergence under the selected DQN-based models.

6.3. Reward Analysis

The provided curves in Figure 3 illustrate the averaged reward comparisons across DQN, DDQN, and DDDQN agents over 1000 episodes and with a Zipf exponent of

a = 1.1

(which reflects a moderately skewed distribution, where some content is significantly more popular but not overwhelmingly so) for cache content distribution in ENs.

Initially, the DQN agent shows a steady increase in reward, which gradually converges as the training progresses. This trend reflects the conservative learning behavior of DQN, which is known to overestimate Q-values, resulting in slower yet stable reward convergence. The DDQN agent demonstrates improved stability and faster convergence compared to the DQN, with rewards reaching a higher level and maintaining a relatively narrow range, reflecting more reliable performance. Meanwhile, the DDDQN agent achieves the highest stability and fastest convergence, consistently maintaining rewards within the optimal range across episodes. This stability suggests that the combined advantages of the dueling architecture and double Q-learning enable the DDDQN to maximize rewards effectively while minimizing volatility. Consequently, DDDQN proves to be the most efficient and reliable model among the three agents in this environment.

These results highlight the advantage of combining both dueling and double Q-learning techniques to enhance agent performance in complex, dynamic environments, such as MEC.

6.4. Resource Utilization Comparison

Figure 4 shows the average CPU and memory utilization measured at intervals of 100 episodes over a total of 1000 episodes for the DQN, DDQN, and DDDQN agents across edge nodes. This plot highlights the differences in resource management efficiency among the models, with each demonstrating distinct levels of resource balancing and responsiveness to demand. The interval-based observations offer insight into how each agent maintains resource stability over time, highlighting the relative strengths of DDQN and DDDQN in optimizing resource allocation and mitigating overutilization under dynamic user requests.

In DQN, CPU and memory utilization fluctuate and often approach higher usage levels. This suggests that the DQN model, although responsive, does not consistently balance resource demands among edge nodes, leading to periods of high utilization and potential overloading. The frequent spikes indicate that the DQN may not fully avoid resource saturation, especially under dynamic user demands. The DDQN agent exhibits more stable CPU and memory usage, with fewer instances of near-capacity utilization. This pattern reflects DDQN’s ability to manage resources more effectively than DQN by reducing bias in Q-value overestimation. However, resource levels occasionally approach the upper threshold, suggesting that while improved, the DDQN may still face challenges with load distribution under high demand.

In contrast, the DDDQN agent achieves consistently controlled resource utilization, maintaining both CPU and memory usage within a stable operating range. Notably, utilization remains predominantly below approximately 80% of capacity, indicating that the learned policy effectively avoids resource saturation. This utilization level is not enforced as a strict constraint but serves as a reference point for evaluating the agent’s ability to maintain balanced resource allocation. This behavior reflects the advantage of combining dueling and double Q-learning mechanisms, which enable more accurate value estimation and improved differentiation between state quality and action impact.

6.5. Impact of Zipf-Based Popularity on Cache Efficiency

To investigate the impact of content popularity skewness on caching efficiency (CHR), we vary the Zipf exponent

a \in {0.5, 1.5, 2.0, 2.5}

. These values span from near-uniform content demand (

a = 0.5

) to highly skewed distributions (

a = 2.5

), with

a = 1.5

representing a typical intermediate case observed in MEC workloads. This variation affects the likelihood of cache hits and misses, allowing us to examine how each DRL agent performs under content access patterns ranging from balanced to highly concentrated popularity profiles. Figure 5 illustrates the CHR across the different values of a for the three DQN-based models, evaluated after convergence. As a increases, the CHR improves for all models, reflecting the higher concentration of requests on a smaller subset of popular content, which enables more requests to be served directly from the cache. The DDDQN model consistently achieves the highest CHR, followed by DDQN, while DQN exhibits the lowest performance across all values of a. This result indicates that the combined dueling and double Q-learning architecture of DDDQN enables more effective cache-aware decision-making compared to the standard DQN and DDQN approaches.

6.6. DRL-Based vs. Baseline Methods and Practical Implications

Finally, to further validate the effectiveness of our DRL-based strategies, we compared them against two baseline service placement methods: (i) Random Placement, where services are assigned to randomly selected ENs regardless of proximity, cache state, or load; and (ii) Proximity-Only, where the nearest EN is selected without cache or resource awareness. Table 4 summarizes this comparison using four core metrics: average latency, CHR, and average CPU and memory utilization. In addition, we report results for the case

a = 1.1

, which serves as a representative example of a moderately skewed popularity distribution. This allows direct comparison of latency, CHR, and resource utilization across baseline and DRL-based strategies.

The results clearly demonstrate the performance gap between the proposed learning-based agents and the traditional baselines. Both random placement and proximity-based selection resulted in a significantly higher latency and lower CHRs, reflecting their inability to adapt to dynamic workloads and content popularity. Furthermore, these baseline methods result in excessive CPU and memory usage, frequently exceeding 80% utilization, indicating suboptimal resource allocation and recurrent node overloads. In contrast, all DRL-based agents, particularly the DDDQN, exhibit superior decision-making by effectively balancing the service latency, CHR, and resource availability. The observed improvements highlight the benefits of intelligent, context-aware placement in MEC environments with user mobility and skewed content popularity.

The considered baselines serve as reference strategies that isolate individual decision criteria, enabling a clearer assessment of the benefits of joint, state-aware optimization. Specifically, random placement ignores all system-level information, whereas proximity-based selection considers only spatial factors and does not account for cache availability or resource constraints. This comparison highlights the limitations of single-factor decision mechanisms and emphasizes the importance of adaptive policies that jointly consider multiple system dimensions.

Beyond performance evaluation, the proposed DRL-based framework also provides insights into its potential deployment in practical MEC environments. In the current formulation, the DRL agent is assumed to operate as a centralized controller within the MEC infrastructure (e.g., an edge orchestrator), with access to system-level information across a localized cluster of edge nodes, consistent with widely adopted MEC management and orchestration frameworks [31]. In practical deployments, the framework can be extended in several directions:

Additional contextual information can be incorporated into the state representation, such as predicted user mobility, temporal variations in workload, or content popularity dynamics.
The centralized decision-making process can be adapted to distributed implementations, where inference is performed locally at edge nodes using partial system observations, enabling scalability and reducing coordination overhead. Such extensions may further support distributed multi-agent learning based on localized observations, facilitating asynchronous service placement across multiple administrative domains.
Furthermore, the proposed approach can be integrated with MEC orchestration mechanisms and mobility management procedures, allowing for coordination with handover events and resource scheduling.
These extensions enable the framework to support adaptive, context-aware service placement in practical MEC systems.

6.7. Limitations and Future Work

Despite these outcomes, the current framework abstracts several control-plane functionalities, including policy enforcement, network-level mobility handling, and orchestration signaling. In operational MEC deployments, procedures such as Xn- or N2-based handovers, interactions with network schedulers, and coordination through network function managers or radio intelligent controller applications play a critical role in ensuring uninterrupted service continuity. In this work, the evaluation focuses on the agent-side inference process, assuming seamless service migration with negligible signaling delay. In addition, although model training is performed offline using episodic simulations, a detailed evaluation of training convergence time, resource footprint, and inference latency under varying workloads remains an open issue.

Future extensions will integrate MEC-aware control-plane functions and standardized orchestration layers, enabling the system to respond to handover events and policy changes in real time. Aligning with 3GPP-compliant mobility management procedures will further support end-to-end coordination between user trajectory prediction, cache placement, and service migration policies [32]. In this context, digital twin-assisted MEC frameworks [33] offer a promising extension by enhancing system observability through real-time monitoring and virtual representations of the network state. By enabling a more accurate representation of network dynamics, such approaches can complement service-level decision-making frameworks and provide DRL agents with more reliable and timely system information, thereby reducing uncertainty and improving decision robustness under user mobility and dynamic resource conditions.

The current experimental setup is based on a small-scale simulated MEC topology comprising six edge nodes and two mobile users. While this setup enables controlled and interpretable benchmarking, it does not fully capture the computational and coordination complexity of large-scale deployments. Nevertheless, the consistent performance gains observed across different DRL variants suggest that the proposed framework exhibits stable performance under dynamic conditions within the evaluated setup. Future work will extend the evaluation to larger-scale environments, incorporate realistic mobility traces, and system- and network-level dynamics, including communication delays, task execution variability, and wireless channel effects (e.g., fading, interference, and transmission variability), to better assess deployment feasibility in realistic MEC environments. In addition, comparisons with more advanced service placement and caching strategies, spanning both optimization-based and learning-based approaches, under more unified evaluation settings will enable a more comprehensive and consistent performance assessment.

Although the proposed DRL agents are trained offline and do not affect real-time system operation, the computational cost of training remains an important consideration. In the current implementation, each agent (DQN, DDQN, and DDDQN) converges within approximately 500 training episodes, with a total training time of 35–40 min on an NVIDIA RTX 3080 GPU. Post-training inference requires less than 10 ms per decision step, indicating suitability for real-time execution. Future work will include detailed profiling of training efficiency, runtime behavior under dynamic workloads, and resource footprint at scale, as well as testbed-based implementation and empirical validation to assess the framework under real-time MEC conditions.

7. Conclusions

This study presents a unified framework for service placement in MEC environments, designed to be resource-aware and cache-enabled. The proposed system addresses the challenges posed by user mobility and dynamic content demands by incorporating DRL agents, namely, DQN, DDQN, and DDDQN, to select the most appropriate edge node for each request. Decisions are based on the joint consideration of latency minimization, resource availability, and content popularity, with caching guided by a Zipf-based distribution model.

Unlike previous methods that treat migration and caching as isolated tasks, the proposed approach captures their interdependence within a unified DRL-based decision framework. This perspective enhances service continuity and responsiveness in mobile user scenarios, where traditional placement policies often lead to service interruptions and low CHR. Experimental results in a controlled MEC simulation environment confirm that the proposed approach achieves lower service latency, higher CHR, and improved resource utilization across edge nodes, while also promoting placement stability and limiting unnecessary service relocations. In particular, the DDDQN model consistently outperforms baseline and non-learning policies, validating the effectiveness of the proposed decision-making framework.

While the proposed approach demonstrates consistent performance gains, it is subject to limitations related to the availability and accuracy of system-level information in dynamic MEC environments. Furthermore, although the simulation setup reflects realistic edge infrastructure constraints, it does not fully capture the complexity of large-scale heterogeneous deployments. Scalability of training and inference, as well as model generalizability under diverse conditions, remain open challenges.

Nevertheless, the proposed framework establishes a flexible and adaptive foundation for intelligent service placement in MEC environments, highlighting the potential of DRL-based approaches for managing complex, dynamic edge systems.

Author Contributions

E.D. and M.T. conceived of the idea, designed and performed the experiments, analyzed the results, drafted the initial, and revised the final manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data is contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Xu, Y.; He, Z.; Li, K. Resource Allocation and Placement in Multi-access Edge Computing. In Resource Management in Distributed Systems; Springer: Berlin/Heidelberg, Germany, 2024; pp. 39–62. [Google Scholar]
Singh, R.; Sukapuram, R.; Chakraborty, S. A survey of mobility-aware multi-access edge computing: Challenges, use cases and future directions. Ad Hoc Netw. 2023, 140, 103044. [Google Scholar] [CrossRef]
He, Z.; Li, L.; Lin, Z.; Dong, Y.; Qin, J.; Li, K. Joint Optimization of Service Migration and Resource Allocation in Mobile Edge–Cloud Computing. Algorithms 2024, 17, 370. [Google Scholar] [CrossRef]
Yuan, Q.; Li, J.; Zhou, H.; Lin, T.; Luo, G.; Shen, X. A joint service migration and mobility optimization approach for vehicular edge computing. IEEE Trans. Veh. Technol. 2020, 69, 9041–9052. [Google Scholar] [CrossRef]
Wang, P.; Ouyang, T.; Liao, G.; Gong, J.; Yu, S.; Chen, X. Edge intelligence in motion: Mobility-aware dynamic DNN inference service migration with downtime in mobile edge computing. J. Syst. Archit. 2022, 130, 102664. [Google Scholar] [CrossRef]
Sabella, D.; Li, A.; Lee, H.; Cominardi, L.; Huang, Q.; Pateromichelakis, E.; Kashyap, V.; Costa, C.; Granelli, F.; Featherstone, W.; et al. MEC Support Towards Edge Native Design; White Paper; ETSI: Sophia Antipolis, France, 2023. [Google Scholar]
Liu, P.; Xu, G.; Yang, K.; Wang, K.; Meng, X. Jointly optimized energy-minimal resource allocation in cache-enhanced mobile edge computing systems. IEEE Access 2018, 7, 3336–3347. [Google Scholar] [CrossRef]
Li, Q.; Zhang, Y.; Pandharipande, A.; Ge, X.; Zhang, J. D2D-assisted caching on truncated Zipf distribution. IEEE Access 2019, 7, 13411–13421. [Google Scholar] [CrossRef]
Dritsas, E.; Ramantas, K.; Verikoukis, C. A mobility-aware reinforcement learning proactive solution for state data migration in edge computing. In 2024 IEEE 29th International Workshop on Computer Aided Modeling and Design of Communication Links and Networks (CAMAD); IEEE: Athens, Greece, 2024; pp. 1–6. [Google Scholar]
Xie, M.; Ye, J.; Zhang, G.; Ni, X. Deep reinforcement learning-based computation offloading and distributed edge service caching for mobile edge computing. Comput. Netw. 2024, 250, 110564. [Google Scholar] [CrossRef]
Sun, C.; Li, X.; Wang, C.; He, Q.; Wang, X.; Leung, V.C. Hierarchical deep reinforcement learning for joint service caching and computation offloading in mobile edge-cloud computing. IEEE Trans. Serv. Comput. 2024, 17, 1548–1564. [Google Scholar] [CrossRef]
Famitafreshi, G.; Trigka, M.; Selis, D.; Vardakas, J.; Verikoukis, C. An Innovative Multi-Scale Strategy-Based Decision Engine for Zero-Touch Management and Orchestration in 6G. In 2024 IEEE 29th International Workshop on Computer Aided Modeling and Design of Communication Links and Networks (CAMAD); IEEE: Athens, Greece, 2024; pp. 1–6. [Google Scholar]
Talpur, A.; Gurusamy, M. Reinforcement learning-based dynamic service placement in vehicular networks. In 2021 IEEE 93rd Vehicular Technology Conference (VTC2021-Spring); IEEE: Helsinki, Finland, 2021; pp. 1–7. [Google Scholar]
Lu, S.; Wu, J.; Shi, J.; Lu, P.; Fang, J.; Liu, H. A dynamic service placement based on deep reinforcement learning in mobile edge computing. Network 2022, 2, 106–122. [Google Scholar] [CrossRef]
Liu, F.; Yu, H.; Huang, J.; Taleb, T. Joint service migration and resource allocation in edge IoT system based on deep reinforcement learning. IEEE Internet Things J. 2023, 11, 11341–11352. [Google Scholar] [CrossRef]
Hua, K.; Su, S.; Wang, Y. Intelligent service migration for the internet of vehicles in edge computing: A mobility-aware deep reinforcement learning framework. Comput. Netw. 2025, 257, 111021. [Google Scholar] [CrossRef]
Hathibelagal, M.A.; Garroppo, R.G.; Nencioni, G. Experimental comparison of migration strategies for MEC-assisted 5G-V2X applications. Comput. Commun. 2023, 197, 1–11. [Google Scholar] [CrossRef]
Zhang, W.; Zhang, G.; Mao, S. Deep-reinforcement-learning-based joint caching and resources allocation for cooperative MEC. IEEE Internet Things J. 2023, 11, 12203–12215. [Google Scholar] [CrossRef]
Liu, X.; Chen, A.; Zheng, K.; Chi, K.; Yang, B.; Taleb, T. Distributed Computation Offloading for Energy Provision Minimization in WP-MEC Networks With Multiple HAPs. IEEE Trans. Mob. Comput. 2025, 24, 2673–2689. [Google Scholar] [CrossRef]
Liu, W.; Bilal, M.; Shi, Y.; Xu, X. Distributed service caching with deep reinforcement learning for sustainable edge computing in large-scale AI. Digit. Commun. Netw. 2024, 11, 1447–1456. [Google Scholar] [CrossRef]
Jiang, X.; Hou, P.; Zhu, H.; Li, B.; Wang, Z.; Ding, H. Dynamic and intelligent edge server placement based on deep reinforcement learning in mobile edge computing. Ad Hoc Netw. 2023, 145, 103172. [Google Scholar] [CrossRef]
Guo, Y.; Chen, C. Drlo: Optimizing edge server placement in dynamic mec scenarios using deep reinforcement learning. Comput. Netw. 2025, 268, 111377. [Google Scholar] [CrossRef]
Qu, B.; Li, D.; Li, X. Optimizing Dynamic Cache Allocation in Vehicular Edge Networks: A Method Combining Multi-Source Data Prediction and Deep Reinforcement Learning. IEEE Internet Things J. 2023, 11, 9955–9968. [Google Scholar] [CrossRef]
Giust, F.; Costa-Perez, X.; Reznik, A. Multi-access edge computing: An overview of ETSI MEC ISG. IEEE 5G Tech Focus 2017, 1, 4. [Google Scholar]
Noferi, A.; Nardini, G.; Stea, G.; Virdis, A. Rapid prototyping and performance evaluation of ETSI MEC-based applications. Simul. Model. Pract. Theory 2023, 123, 102700. [Google Scholar] [CrossRef]
Musa, S.S.; Zennaro, M.; Libsie, M.; Pietrosemoli, E. Mobility-aware proactive edge caching optimization scheme in information-centric iov networks. Sensors 2022, 22, 1387. [Google Scholar] [CrossRef]
Safavat, S.; Sapavath, N.N.; Rawat, D.B. Recent advances in mobile edge computing and content caching. Digit. Commun. Netw. 2020, 6, 189–194. [Google Scholar] [CrossRef]
Miuccio, L.; Riolo, S.; Samarakoon, S.; Bennis, M.; Panno, D. On learning generalized wireless MAC communication protocols via a feasible multi-agent reinforcement learning framework. IEEE Trans. Mach. Learn. Commun. Netw. 2024, 2, 298–317. [Google Scholar] [CrossRef]
Aghazadeh, R.; Shahidinejad, A.; Ghobaei-Arani, M. Proactive content caching in edge computing environment: A review. Softw. Pract. Exp. 2023, 53, 811–855. [Google Scholar] [CrossRef]
Wang, X.; Wang, S.; Liang, X.; Zhao, D.; Huang, J.; Xu, X.; Dai, B.; Miao, Q. Deep reinforcement learning: A survey. IEEE Trans. Neural Netw. Learn. Syst. 2022, 35, 5064–5078. [Google Scholar] [CrossRef] [PubMed]
Slamnik-Kriještorac, N.; de Britto e Silva, E.; Municio, E.; Carvalho de Resende, H.C.; Hadiwardoyo, S.A.; Marquez-Barja, J.M. Network service and resource orchestration: A feature and performance analysis within the MEC-enhanced vehicular network context. Sensors 2020, 20, 3852. [Google Scholar] [CrossRef]
Hassan, N.; Yau, K.L.A.; Wu, C. Edge computing in 5G: A review. IEEE Access 2019, 7, 127276–127289. [Google Scholar] [CrossRef]
Li, B.; Liu, Y.; Tan, L.; Pan, H.; Zhang, Y. Digital twin assisted task offloading for aerial edge computing and networks. IEEE Trans. Veh. Technol. 2022, 71, 10863–10877. [Google Scholar] [CrossRef]

Figure 1. System model: caching and service migration for mobile users in MEC architecture.

Figure 2. A use case with two mobile users, MU1 and MU2, in an edge environment of 6 ENs with fixed topology.

Figure 3. Average total reward comparisons across DQN, DDQN, and DDDQN agents over 1000 episodes,

a = 1.1

for 2 mobile users.

Figure 3. Average total reward comparisons across DQN, DDQN, and DDDQN agents over 1000 episodes,

a = 1.1

for 2 mobile users.

Figure 4. Average CPU and memory usage across edge nodes in 1000 episodes.

Figure 5. CHR for different values of a across DQN, DDQN, and DDDQN models.

Table 1. Comparative analysis of recent DRL-based approaches for service placement, migration, caching, and offloading in MEC environments.

Ref.	Objective	Environment	DRL Type	Migration	Caching	Offloading	Mobility	Resource	Key Limitation
[13]	Mobility-aware service placement	Vehicular networks	Q-learning	×	×	×	✓	×	No caching or migration support
[14]	Dynamic service placement	MEC	DQN	×	×	×	×	Partial	No mobility-aware caching or unified migration handling
[15]	Joint migration and resource allocation	Edge IoT	LSTM + PDQN	✓	×	×	✓	✓	No caching strategy
[16]	Mobility-aware service migration	IoV	HMM + DDDQN	✓	×	×	✓	Partial	No caching or placement integration
[17]	Migration strategy evaluation	5G-V2X MEC	–	✓	×	×	✓	×	Experimental evaluation only; no learning-based optimization
[10]	Joint caching and offloading	Multi-region MEC	LSTM + DDPG	×	✓	✓	×	✓	No service migration support
[18]	Joint caching, offloading, and resource allocation	Cooperative MEC	HRL/DQN	×	✓	✓	×	✓	No mobility-aware migration modeling
[11]	Joint caching and computation offloading	Mobile edge-cloud	HDRL/ DQN + SAC	×	✓	✓	×	✓	Focuses on caching–offloading coupling only
[19]	Energy-aware computation offloading	WP-MEC	Hierarchical multi-agent DRL	×	×	✓	×	✓	Network/task-level optimization, not service lifecycle management
[20]	Mobility-aware distributed service caching	Edge/IoV	LSTM + DDDQN	×	✓	×	✓	×	Focuses on caching replacement decisions only
[21]	Dynamic edge server placement	MEC	DDDQN/PPO	×	×	×	✓	✓	Infrastructure placement rather than service-level management
[22]	Edge server deployment optimization	MEC	DDPG	×	×	×	×	✓	Focuses on deployment/access delay, not service migration or caching
This work	Resource-aware and cache-aided service placement with implicit migration	MEC with mobile users	DQN/DDQN/ DDDQN	✓	✓	×	✓	✓	No multi-user contention modeling

Table 2. State vector components.

Component	Description
$U_{cpu}^{(i)}$	Normalized CPU utilization at edge node $E N_{i}$
$U_{mem}^{(i)}$	Normalized memory utilization at $E N_{i}$
$U_{cache}^{(i)}$	Normalized cache utilization at $E N_{i}$
$x^{(i)}, y^{(i)}$	Coordinates of edge node $E N_{i}$
$x^{(u)}, y^{(u)}$	Current position of user u

Table 3. Configuration parameters for DQN, DDQN, and DDDQN agents. All edge nodes are provisioned with identical computational and caching capabilities under a uniform wireless communication model, ensuring that the DRL agent’s decisions are evaluated independently of architectural variability.

Parameter	Value
Model Architecture
Number of Edge Nodes (ENs)	6
Number of Users	2
Per User State Size	32
Action Size	6
Hidden Layers	2 (128 neurons each)
Output Layer Activation	Linear
DDQN & DDDQN
Value Stream	Dense (128, ReLU)
Advantage Stream	Dense (128, ReLU)
Hyperparameters
Learning Rate	0.001
Discount Factor $γ$	0.99
Replay Buffer Size	10,000
Batch Size	64
Epsilon Decay	0.995
Minimum Epsilon	0.01
Target Network Update	After each episode
Reward Function
$w_{l a t e n c y}, w_{c p u}, w_{m e m o r y}$	1.5, 2.0, 2.0
$w_{c a c h e}$	1.2
Latency Factor L	50
CPU/Memory Utilization Threshold	80%
Additional for DDDQN
Double Q-Learning Mechanism	True
Target Network Decoupling	True

Table 4. Comparison of performance metrics between baseline and DQN-based placement strategies for Zipf exponent

a = 1.1

.

Table 4. Comparison of performance metrics between baseline and DQN-based placement strategies for Zipf exponent

a = 1.1

.

Model	Avg. Latency (ms)	CHR (%)	CPU Utilization (%)	Memory Utilization (%)
Random Placement	102.3	38.7	85.4	84.2
Proximity-Only Policy	88.5	45.2	81.1	79.5
DQN	65.2	66.2	75.4	73.2
DDQN	58.8	71.3	70.2	69.7
DDDQN	54.0	76.2	68.5	66.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Dritsas, E.; Trigka, M. Resource-Aware Deep Reinforcement Learning for Joint Caching and Service Placement in Multi-Access Edge Computing. Electronics 2026, 15, 2074. https://doi.org/10.3390/electronics15102074

AMA Style

Dritsas E, Trigka M. Resource-Aware Deep Reinforcement Learning for Joint Caching and Service Placement in Multi-Access Edge Computing. Electronics. 2026; 15(10):2074. https://doi.org/10.3390/electronics15102074

Chicago/Turabian Style

Dritsas, Elias, and Maria Trigka. 2026. "Resource-Aware Deep Reinforcement Learning for Joint Caching and Service Placement in Multi-Access Edge Computing" Electronics 15, no. 10: 2074. https://doi.org/10.3390/electronics15102074

APA Style

Dritsas, E., & Trigka, M. (2026). Resource-Aware Deep Reinforcement Learning for Joint Caching and Service Placement in Multi-Access Edge Computing. Electronics, 15(10), 2074. https://doi.org/10.3390/electronics15102074

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Resource-Aware Deep Reinforcement Learning for Joint Caching and Service Placement in Multi-Access Edge Computing

Abstract

1. Introduction

1.1. Motivation

1.2. Contribution

2. Related Works

3. System Model

3.1. Resource-Aware Model for Service Placement

3.2. Content Caching Model

4. Problem Formulation

4.1. State Space $S$

4.2. Action Space $A$

4.3. Reward Function

5. Proposed Solution

6. Evaluation, Analysis, and Practical Considerations

6.1. Environment Setup

6.2. DRL Baselines and Parameters

6.3. Reward Analysis

6.4. Resource Utilization Comparison

6.5. Impact of Zipf-Based Popularity on Cache Efficiency

6.6. DRL-Based vs. Baseline Methods and Practical Implications

6.7. Limitations and Future Work

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Resource-Aware Deep Reinforcement Learning for Joint Caching and Service Placement in Multi-Access Edge Computing

Abstract

1. Introduction

1.1. Motivation

1.2. Contribution

2. Related Works

3. System Model

3.1. Resource-Aware Model for Service Placement

3.2. Content Caching Model

4. Problem Formulation

4.1. State Space S

4.2. Action Space A

4.3. Reward Function

5. Proposed Solution

6. Evaluation, Analysis, and Practical Considerations

6.1. Environment Setup

6.2. DRL Baselines and Parameters

6.3. Reward Analysis

6.4. Resource Utilization Comparison

6.5. Impact of Zipf-Based Popularity on Cache Efficiency

6.6. DRL-Based vs. Baseline Methods and Practical Implications

6.7. Limitations and Future Work

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.1. State Space $S$

4.2. Action Space $A$