Next Article in Journal
Cognitive Bias Mitigation in Executive Decision-Making: A Data-Driven Approach Integrating Big Data Analytics, AI, and Explainable Systems
Previous Article in Journal
An Explainable Deep Learning Framework with Adaptive Feature Selection for Smart Lemon Disease Classification in Agriculture
Previous Article in Special Issue
Watermarking Tiny MLCommons Image Applications Without Extra Deployability Costs
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

DAPO: Mobility-Aware Joint Optimization of Model Partitioning and Task Offloading for Edge LLM Inference

1
School of Computer and Information Security, Guilin University of Electronic Technology, Guilin 541004, China
2
Guangxi Institute of Digital Technology, Nanning 530000, China
3
Guangxi Guanyang Digital Technology, Nanning 530000, China
*
Author to whom correspondence should be addressed.
Electronics 2025, 14(19), 3929; https://doi.org/10.3390/electronics14193929
Submission received: 14 August 2025 / Revised: 18 September 2025 / Accepted: 1 October 2025 / Published: 3 October 2025
(This article belongs to the Special Issue Towards Efficient and Reliable AI at the Edge)

Abstract

Deploying Large Language Models (LLMs) in edge environments faces two major challenges: (i) the conflict between limited device resources and high computational demands, and (ii) the dynamic impact of user mobility on model partitioning and task offloading decisions. To address these challenges, this paper proposes the Dynamic Adaptive Partitioning and Offloading (DAPO) framework, an intelligent solution for multi-user, multi-edge Mobile Edge Intelligence (MEI) systems. DAPO employs a Deep Deterministic Policy Gradient (DDPG) algorithm to jointly optimize the model partition point and the task offloading destination. By mapping continuous policy outputs onto valid discrete actions, DAPO efficiently addresses the high-dimensional hybrid action space and dynamically adapts to user mobility. Through extensive simulations, we demonstrate that DAPO outperforms baseline strategies and mainstream RL methods, achieving up to 27% lower latency and 18% lower energy consumption compared to PPO and A2C, while maintaining fast convergence and scalability in dynamic mobile environments.

1. Introduction

With the rapid proliferation of Internet of Things (IoT) devices and the continuous evolution of Artificial Intelligence (AI) technologies, the deep integration of edge computing and AI has become a pivotal trend in contemporary information technology [1]. A particularly promising research frontier is the construction of on-device intelligent agents capable of autonomous perception, decision-making, and action, thereby enabling more personalized and context-aware services. In recent years, foundation models with tens of billions of parameters—exemplified by GPT-4, Claude 3, and Llama 3—have begun to migrate from cloud platforms to edge environments, accelerating the shift from centralized cloud intelligence to distributed edge intelligence.
However, this transition introduces new challenges. Due to their high complexity and computational intensity, Large Language Models demand substantial computational resources to ensure high-precision inference [2], posing a severe challenge for mobile devices with limited computational and energy resources. Although edge servers provide relatively abundant computational capabilities [3], offloading the entire inference task to an edge server can lead to a surge in communication overhead, and the resulting communication latency is unacceptable for delay-sensitive applications such as autonomous driving and augmented reality (AR). Consequently, balancing the trade-off between computational resource constraints and communication overhead has become a critical bottleneck that urgently needs to be addressed in collaborative inference between mobile devices and edge servers.
To effectively mitigate this bottleneck, Edge Intelligence (EI) [4,5] has emerged as a promising solution. EI partitions a large AI model into several segments, executing the less computationally demanding parts locally on the mobile device while offloading the computationally intensive parts to an edge server. This device–edge collaborative inference approach enables efficient resource utilization and latency reduction. In this process, the model’s partition point determines the computational load on both the device and the edge server, which in turn affects the overall inference latency and energy consumption. Excessive on-device computation significantly increases energy consumption, whereas excessive offloading to the edge can incur high communication latency. Therefore, dynamically selecting the optimal partition point is crucial for achieving efficient collaborative inference.
While a substantial body of existing research [6,7,8,9,10,11,12,13] focuses on optimizing the selection of the model partition point, these studies often overlook user mobility in Mobile Edge Computing (MEC) scenarios. In practical applications, users are often in constant motion, frequently traversing the coverage areas of different edge servers. This implies that, in addition to the model partitioning decision, it is also necessary to dynamically determine the target edge server for the inference task to ensure its continuity and efficiency. Particularly in mobile scenarios, the challenge lies in how to effectively readjust the model partition point and task offloading strategy when a user moves into the service range of a new edge server to maintain optimal latency and energy performance. This presents a research problem that is both theoretically challenging and practically relevant.
To address the aforementioned issues, this paper focuses on a multi-user, multi-edge server collaborative inference scenario. We propose the Dynamic Adaptive Partitioning and Offloading (DAPO) framework, which dynamically determines the optimal model partition point and edge server destination. By comprehensively considering a mobile device’s resource constraints, energy limitations, and communication latency, DAPO aims to minimize the overall latency and energy overhead of the collaborative inference process. A key challenge is that the simultaneous optimization of partition points and server selections creates a high-dimensional, mixed-action space, leading to the curse of dimensionality. To tackle this, DAPO employs a Deep Deterministic Policy Gradient (DDPG) algorithm, which utilizes an actor network to navigate this complex space by mapping its continuous outputs to discrete decisions.
The main contributions of this article are as follows:
(1)
We formulate the joint optimization problem of model partitioning and multi-server offloading in mobility-aware LLM inference as a Mixed-Integer Nonlinear Programming (MINLP) problem, explicitly modeling user mobility, multi-user competition, and system constraints.
(2)
We design the DAPO framework based on the DDPG algorithm, which maps continuous policy outputs into valid discrete actions, thereby effectively handling the high-dimensional hybrid action space of partitioning and offloading decisions.
(3)
We implement DAPO in a simulation platform and evaluate it against multiple baselines, providing detailed analyses of convergence, scalability, and robustness under varying user densities and bandwidth conditions. And the experimental results demonstrate that DAPO significantly outperforms static strategies and mainstream RL baselines in terms of reducing latency and energy consumption.

2. Related Work

In recent years, the deep integration of MEC and AI has emerged as a prominent research area. Extensive research has focused on optimizing inference task offloading and model partitioning strategies to balance performance and overhead on resource-constrained mobile devices.
The existing literature has thoroughly explored various architectures and paradigms for collaborative inference. One prominent approach is vertical collaboration across the device–edge–cloud continuum. For instance, DADS [11] enables efficient collaborative inference in dynamic networks by adaptively partitioning DNNs structured as Directed Acyclic Graphs (DAGs). JointDNN [9] develops a hierarchical optimization engine for device–cloud synergy, facilitating the efficient execution of not only inference, but also training tasks. To enhance flexibility in computation distribution, the framework in [8] introduces an adaptive multi-point partitioning method, allowing for computation tasks to be flexibly deployed on the end device and multiple edge nodes. An alternative paradigm is horizontal collaboration among peer devices. CoopAI [7] enables parallel inference across multiple devices by employing model tiling and pre-fetching mechanisms. MeDNN [14] integrates adaptive partitioning with structured pruning to achieve efficient distributed inference among a group of mobile devices. Furthermore, DeepThings [15] presents an adaptive distributed CNN inference framework tailored for highly resource-constrained IoT edge clusters. To better cope with network volatility, a more recent trend involves the deep integration of model partitioning with dynamic early-exit mechanisms. For example, MAMO [6] achieves efficient collaboration in dynamic networks by jointly optimizing exit point selection, model partitioning, and resource allocation. Likewise, DPDS [10] improves the collaborative efficiency of multi-exit DNNs at the edge through dynamic exit design and fine-grained partitioning. The Edgent [13] framework integrates adaptive model partitioning with early-exit mechanisms to dynamically adjust inference strategies in response to network changes. Moreover, the SPINN [16] system employs device–cloud co-design and progressive inference to dynamically and jointly optimize model partitioning and early-exit policies, ensuring both efficient and robust performance under volatile network conditions. In addition, Ref. [17] has summarized edge intelligence techniques for resource-limited environments, highlighting model compression, compiler optimization, and hardware–software co-design for efficient edge deployment.
To enable these collaborative frameworks to achieve optimal decision-making in complex environments, advanced optimization algorithms and learning methods have been widely applied. In vehicular edge computing, Ref. [18] proposes a digital twin-assisted task offloading framework which integrates clustering with the TD3 algorithm to optimize latency and energy efficiency. Among these, Deep Reinforcement Learning (DRL) has emerged as a mainstream technique for addressing dynamic environments. For instance, DPTO [19] leverages DRL to find a jointly optimal partitioning and offloading strategy for mobile devices that minimizes both energy consumption and latency under dynamic communication conditions. EdgeM [20], on the other hand, combines task offloading with dynamic DNN pruning and utilizes reinforcement learning to adaptively adjust the model execution policy, thereby achieving a co-optimization of latency, accuracy, and energy. To address the challenges of Large Language Model inference, the work in [21] innovatively proposes a model-based reinforcement learning framework to optimize partition points, which efficiently evaluates different partitioning strategies via a surrogate reward model. Beyond DRL, other optimization methods have also been employed in specific scenarios. For example, the IAO algorithm [12] employs iterative alternating optimization to address the problem of DNN partitioning and resource allocation in multi-user resource-constrained environments. Concurrently, some studies have begun to incorporate more specific real-world factors. For instance, MCSA [22] accounts for both user mobility and energy costs, designing the Li-GD and MLi-GD optimization algorithms for stationary and mobile scenarios, respectively. Ref. [23] summarize optimization techniques, such as pruning and quantization, and highlight applications across domains, such as healthcare and energy, on intelligent edge computing.
Despite the significant progress achieved by these works, they often overlook the complexity of real-world mobile scenarios or face challenges in handling joint decision-making. In particular, mobility-aware studies such as [19,22] rely on conventional RL algorithms that struggle with inefficient exploration and the curse of dimensionality when jointly optimizing discrete model split points and discrete edge server selections in a high-dimensional hybrid action space. Furthermore, these studies have predominantly focused on conventional models such as CNNs. A unified and efficient end-to-end decision-making framework for optimizing the collaborative inference of Transformer-based LLMs in scenarios with continuous user mobility remains a notable gap. This represents the core challenge that our work aims to address. In addition, we also investigated several non-RL optimization approaches, such as genetic algorithms and heuristic-based search methods, during the early stage of our research. While these approaches can provide feasible solutions in small-scale or static settings, they are essentially offline optimization algorithms that require extensive search and prior knowledge of the environment. As a result, they often incur high computational overhead and fail to adapt to highly dynamic MEC scenarios with user mobility. Our preliminary experiments confirmed that such non-RL approaches yielded inferior performance in terms of latency and scalability compared to RL-based methods. Therefore, we focus on reinforcement learning, which is inherently more suitable for real time, adaptive decision-making in dynamic mobile edge intelligence systems.

3. System Model and Problem Formulation

We consider a MEI system that supports device–edge collaborative inference. In this system, resource-constrained Mobile Devices (MDs), denoted by the set N = { 1 , 2 , , N } , move within the coverage area of a set of Edge Servers (ESs), denoted by M = { 1 , 2 , , M } . Each MD has an independent and computationally intensive inference task to execute. As illustrated in Figure 1, each mobile device can dynamically select an edge server for collaborative inference based on real time channel conditions and computational resources.
This work focuses on edge inference for large-scale Transformer models (such as Llama or GPT). We assume the model has a total of L layers. For the inference task of MD n, s n [ 0 , L ] denotes its model split point, and a n M denotes its task offloading point. That is, layers 1– s n of the model are computed locally on MD n, while layers ( s n + 1 ) L are offloaded to ES a n for inference computation. For convenience, we introduce a task allocation variable:
a n , m = 1 , a n = m 0 , a n m
where a n , m = 1 means that the last L s n layers of MD n’s inference task are assigned to ES m for execution.
In this section, we introduce the system model of this paper from four aspects, a user mobility model, communication model, and computation offloading model, and then formulate the problem. Table 1 lists the notations used in this paper.

3.1. User Mobility Model

To capture the dynamic characteristics of the mobile scenario, we adopt the Gauss–Markov (GM) mobility model [24] to describe the user’s trajectory, as this model effectively captures both the continuity and randomness of a user’s velocity and direction. At a discrete time step t, the velocity v n ( t ) and direction θ n ( t ) of MD n are updated as follows:
v n ( t ) = ϵ 1 v n ( t 1 ) + ( 1 ϵ 1 ) v ¯ + 1 ϵ 1 2 Υ n
θ n ( t ) = ϵ 2 θ n ( t 1 ) + ( 1 ϵ 2 ) θ ¯ n + 1 ϵ 2 2 Ψ n
where 0 ϵ 1 , ϵ 2 1 are the memory parameters that adjust the influence of the previous state, v ¯ is the mean velocity for all MDs, and θ ¯ n is the mean direction for MD n. Specifically, we assume that all MDs share the same mean velocity, but each device has a distinct mean direction. Furthermore, Υ n and Ψ n are two independent Gaussian random variables. For the n-th MD, their means and variances are given by ξ v n ¯ , ς v n 2 and ξ θ n ¯ , ς θ n 2 , respectively, which reflect the randomness of motion for different MDs. Here, ξ v n ¯ and ξ θ n ¯ denote the mean velocity and mean direction of MD n, while ς v n 2 and ς θ n 2 represent their variances.
We denote the location of the MD n at time t as Loc n ( t ) = [ x n ( t ) , y n ( t ) ] . The position is then updated according to Equations (1) and (2) as follows:
x n ( t ) = x n ( t 1 ) + v n ( t 1 ) cos ( θ n ( t 1 ) ) Δ t
y n ( t ) = y n ( t 1 ) + v n ( t 1 ) sin ( θ n ( t 1 ) ) Δ t
Based on the updated positions in Equations (3) and (4), the distance between MD n and the stationary ES m at time t is given by
d n , m ( t ) = | | Loc n ( t ) Loc m | |
where Loc n ( t ) is the location of the MD n at time t, Loc m is the location of the ES m.

3.2. Communication Model

When MD n offloads its task to ES m, the intermediate data is transmitted over the wireless channel. This paper employs a block fading channel model that considers path loss, small-scale fading, and large-scale shadowing. According to Equation (5), the channel gain between MD n and ES m is modeled as follows [25]:
g n , m = β 0 d n , m α | h | 2 ζ
where β 0 is the path loss constant, α is the path loss exponent, d n , m is the distance between MD n and ES m, h is the Rayleigh fading coefficient, and ζ is a log-normal random variable representing shadowing.
We assume that all MDs have the same task priority. When multiple MDs simultaneously offload tasks to the same ES m, according to Equations (5) and (6), the uplink transmission rate between MD n and ES m is given by
R n , m = B K m log 2 1 + p n · g n B K m N 0
where B is the bandwidth of the wireless channel, K m = n = 1 N a n , m denotes the number of inference tasks offloaded to ES m, p n is the transmit power of MD n, and N 0 is the variance of the complex Gaussian white noise.

3.3. Computation Offloading Model

The Transformer layers of a LLM primarily consist of a self-attention layer and a feed-forward network layer. For an input sequence of length d i n and hidden dimension d m i d , using a multi-head attention mechanism with κ heads, the computational workload (in FLOPs) of the j-th layer can be expressed as [21]
W j = 3 d i n d m i d κ + 2 d i n 2 d m i d κ + 9 d i n d m i d 2
For the inference task on MD n, given a model partition decision at layer s n , the inference latency for executing the first s n layers locally is given by
T n loc = j = 1 s n W j F n
where F n is the computational capability of MD n.
Similarly, the remote inference latency for executing layers s n + 1 L on the selected ES m is
T n remote = m = 1 M a n , m j = s n + 1 L W j F m , n
where F m , n is the computation resource assigned to MD n by ES m. The total computation resource allocated by ES m cannot exceed its capacity: n = 1 N a n , m F m , n F m .
During collaborative inference between MD n and ES m with split point s n , the intermediate data generated at layer s n is transmitted to ES m. Let D n be the amount of intermediate data at layer s n , according to Equation (7), then the transmission delay is
T n , m trans = m = 1 M a n , m D n R n , m
Based on Equations (9)–(11), we define the total inference delay for MD n in collaborative inference as
T n tot = T n local + T n remote + T n , m trans
Correspondingly, the energy consumption for MD n during collaborative inference consists of three components: the local inference energy, the transmission energy for intermediate data, and the edge server’s computation energy, given by
E n tot = E n local + E n , m trans + E m remote = P n · T n local + p n · T n , m trans + P m · T n remote
where P n is the operating power consumption of MD n, p n is the transmission power consumption for MD n, and P m is the operating power consumption of ES m.

3.4. Problem Formulation

In this work, our objective is to minimize the total inference latency while simultaneously reducing the inference energy consumption of the mobile devices. The objective function and constraints are as follows:
min s n , a n n = 1 N T n tot
subject to
E n tot E n all _ local , n N
T n tot T n all _ local , n N
0 s n L , s n Z , n N
a n { 0 , 1 , , M } , a n Z , n N
a n , m = 1 , if a n = m 0 , if a n m , n N , m M
n = 1 N a n , m F m , n F m , m M
Here, Equation (14) is our objective function, with the model partitioning decision s n and the task offloading decision a n as the optimization variables. The constraints in Equations (15) and (16) ensure that the inference energy and latency resulting from the decision cannot exceed the energy and latency produced by executing the entire inference task locally on MD n. Here, T n all _ local and E n all _ local are the latency and energy consumption, respectively, for executing the complete inference task locally on MD n. In Equation (17), s n is the model partition point and L is the total number of layers in the model. Constraint (18) indicates that the offloading decision a n is selected from the set of available servers M , where a n = 0 signifies local computation. Constraint (19) defines the indicator variable, where a n , m = 1 signifies that the remaining L s n layers of the inference task for MD n are allocated to ES m for execution. Constraint (20) ensures that the computational resources allocated by ES m to all offloaded tasks do not exceed its total computational capacity.
The problem of model partitioning and resource allocation for device–edge collaborative inference proposed in this paper is a Mixed-Integer Nonlinear Programming (MINLP) problem. It consists of two parts: first, the selection of the model partition point (a finite discrete variable); second, the binary decision variables for assigning the inference task to an edge server. In [26], it is shown that if only the task allocation subproblem is considered, the complexity of finding the optimal solution increases exponentially with the number of users, and the problem has been proven to be NP-hard. Therefore, our system, which jointly optimizes partition points and task allocation, can be considered an extension of the NP-hard problem, and thus inherits its NP-hard nature. This makes it difficult to obtain a globally optimal solution in practice using polynomial time algorithms. Hence, this paper proposes the DAPO framework, which utilizes the DDPG algorithm to solve it.

4. Problem Solution

In this section, we will provide a detailed definition of the state space, action space, and reward function within the DDPG algorithm.

4.1. State Space

In this MEI system, the state space needs to reflect the entire system’s state, which includes the number and dynamic locations of mobile devices, inference task attributes, and the computational resources of edge servers. This requires a very large state space dimension. As the number of mobile devices increases, this can easily lead to a state space explosion. To align with our optimization objective of minimizing both energy consumption and inference latency, we define the state space of the MEI system as S = n = 1 N T n tot , where S denotes the system state, represented by the total inference latency.

4.2. Action Space

In this MEI system, for each mobile device’s inference task, we need to optimize its model partition point and task allocation variable, thereby optimizing the overall system’s latency and energy consumption. Therefore, to describe the computation offloading strategies for all mobile devices, we define the action space as the collection of all individual offloading strategies, represented by the action vector A = [ A 1 , A 2 , , A N ] , where A n = ( s n , a n ) represents the model partition point s n and the offloading server destination a n for the n-th MD’s inference task.

4.3. Reward Function

We use R ( S , A ) to represent the reward function, which is the reward value the agent obtains for executing the computation offloading strategy A in state S. Our objective is to minimize the total inference latency while reducing the energy consumption of the mobile devices. Therefore, based on the objective function and its constraints, we define the reward function as
R ( S , A ) = n = 1 N T n tot α · Δ T β · Δ E
where Δ T = n = 1 N ( T n tot T n all _ local ) serves as the latency penalty term and Δ E = n = 1 N ( E n tot E n all _ local ) serves as the energy penalty term. This ensures that the inference energy and latency resulting from the chosen policy do not exceed those of executing the entire task locally on the mobile device. α and β are the penalty weighting factors for latency and energy, respectively. This reward function implies that a better computation offloading strategy leads to lower total system latency and energy consumption and, consequently, a higher reward value.

4.4. Algorithm Design

In traditional reinforcement learning methods, such as PPO and DQN, when the state or action space becomes very large, the so-called “curse of dimensionality” is encountered. In our MEI system, as the number of mobile devices (MDs) increases, the joint action space—which includes each user’s model split point s n and server selection a n —grows exponentially, making it impractical to maintain and query the Q-table. Furthermore, our optimization objective requires making optimal discrete decisions in continuous system states, which presents challenges for algorithm design.
To address these challenges, we adopt the DDPG algorithm. DDPG is a deep reinforcement learning algorithm based on the actor–critic framework, which is particularly suitable for decision-making problems in continuous state spaces. It leverages neural networks to approximate the policy and value functions, thus effectively mitigating the curse of dimensionality.
However, the standard DDPG algorithm is designed for continuous action spaces, whereas in our problem, both the model partition point s n and the offloading server selection a n are discrete integers. To resolve this conflict, we have adaptively modified the DDPG framework. Specifically, the actor network outputs continuous action values during training, which are then converted into discrete, valid actions through a deterministic mapping rule when interacting with the environment.
The DDPG algorithm consists of four neural networks: an actor network μ ( s | θ μ ) and its target network μ ( s | θ μ ) , and a critic network Q ( s , a | θ Q ) and its target network Q ( s , a | θ Q ) . The actor network, acting as the policy network, is responsible for generating a deterministic action based on the current system state s to perform model partitioning and computation offloading. The critic network, acting as the value network, is responsible for evaluating the long-term cumulative reward (Q-value) that can be obtained by taking action a in state s. DDPG utilizes an experience replay buffer to store records of the agent’s interactions with the environment, i.e., transition tuples ( s t , a t , r t , s t + 1 ) . Here, ( s t , a t , r t , s t + 1 ) denotes a transition tuple stored in the replay buffer. Specifically, s t is the system state at time step t, a t is the action selected by the agent, r t is the reward received after executing a t in state s t , and s t + 1 is the next state observed. During training, the target critic network and target actor network use samples from the experience replay buffer to evaluate the target Q-value and the critic network is responsible for calculating the Q-value. The target Q-value is calculated as follows:
Q target = r t + γ Q ( s t + 1 , μ ( s t + 1 | θ μ ) | θ Q )
where γ is the discount factor for future rewards. θ Q and θ μ denote the parameters of the target critic network Q and the target actor network μ . μ ( s t + 1 | θ μ ) is the action selected by the target actor network in the next state s t + 1 . The critic network updates its parameters θ Q by minimizing the mean squared error (MSE) loss function L ( θ Q ) :
L ( θ Q ) = E ( Q target Q ( s t , a t | θ Q ) ) 2
The update objective for the actor network is to adjust its policy to generate actions that can achieve a higher Q-value. In DDPG, the policy gradient for the actor network is given by
θ μ J E a Q ( s , a | θ Q ) | s = s t , a = μ ( s t | θ μ ) θ μ μ ( s | θ μ )
To maintain training stability, the DDPG algorithm uses soft updates for the parameters of the target actor and critic networks. The soft update formulas are as follows:
θ Q τ θ Q + ( 1 τ ) θ Q
θ μ τ θ μ + ( 1 τ ) θ μ
where τ 1 is the soft update rate for the target networks, ensuring their stable evolution. Algorithm 1 presents the DAPO based on DDPG algorithm for the MEI system.
Algorithm 1: The DAPO based on the DDPG Algorithm
Electronics 14 03929 i001
To address the optimization problem proposed in this paper, we designed a corresponding neural network architecture for the DDPG algorithm. The actor network is responsible for mapping the system state to a specific joint action. Its input layer receives a scalar state S t , which is the total inference latency of the system from the previous step. To realize the mixed action output defined in this paper, the dimension of the actor network’s output layer is designed to be N × ( M + 2 ) . For each MD n’s inference task, one neuron uses a Sigmoid activation function followed by a linear transformation to output a continuous value in the interval [ 0 , L ] , representing the model partition point s n . The remaining M + 1 neurons utilize a Softmax activation function to generate a probability distribution, which is used to decide whether the task should be executed locally or offloaded to one of the M edge servers. The final discrete offloading location is determined via an argmax operation, yielding the offloading decision a n for each MD. In this way, the actor network simultaneously determines both the model partition point ( s n ) and the task offloading destination ( a n ), thereby explicitly incorporating task offloading into the decision-making process. Figure 2 illustrates the DDPG-based edge intelligence computation offloading framework.

5. Performance Evaluation

In this section, to validate the effectiveness of the proposed DAPO framework, we establish a simulation environment. The algorithm is implemented in Python (3.10) using PyTorch (2.0.1) and runs on an Intel Core i7-12700KF CPU (3.6 GHz), 32 GB RAM, an NVIDIA GeForce RTX 3060 GPU, and Windows 11 64-bit operating system. In this simulation environment, the network area is 500 m × 500 m, and the area contains N = 15 MDs and M = 5 ESs. The mobility of the mobile devices follows the Gauss–Markov model to simulate the continuity and randomness of the user’s velocity and direction during movement. The mobility trajectories of UDs are shown in Figure 3. The mean velocity v ¯ is set to 2.0 m/s, the memory factors for velocity and direction, ϵ 1 and ϵ 2 , are both set to 0.8, and the time step Δ t is 1.0 s. The system’s wireless channel bandwidth is set to B = 10 MHz, the background noise power spectral density N 0 is −174 dBm/Hz, and the uplink transmission power of the mobile devices p n is 500 mW. The computational capacity of the edge servers is randomly selected from the range [ 1 × 10 15 , 2 × 10 15 ] FLOPS, and the computational capacity of the mobile devices is from [ 2 × 10 13 , 2.5 × 10 13 ] FLOPS. The inference task for the mobile devices is set as the inference of a LLM, with its structural parameters referencing the Llama-2-7B model. The model has a total of L = 32 layers, a hidden dimension d m i d = 4096, and κ = 32 attention heads. We adopt Llama-2-7B as the inference model because it is openly available and has manageable computational demand, which facilitates reproducible experimentation. Importantly, the underlying principles of partitioning and offloading remain the same for larger LLMs, and thus the conclusions are not limited to this specific model size. The dataset for the tasks is the WikiText-2 dataset [27], which contains 4355 sentences with an average length of 20 words.
The parameters for the DAPO algorithm are shown in Table 2. These values were selected by combining common practices in DDPG implementations with empirical tuning in our simulation environment. For example, the learning rates of the actor and critic networks were set to 0.001, which is widely adopted in continuous control tasks to ensure stable convergence. The replay buffer size (100,000) and batch size (64) follow standard settings that balance memory cost and sample diversity. The discount factor was fixed at 0.99 to capture long-term optimization goals, while the soft update rate τ = 0.005 was chosen to provide stable target network updates. Finally, the number of episodes (200) and steps per episode (1000) were determined through preliminary experiments, where we observed that these values were sufficient for convergence in our scenario. For a fair comparison, the hyperparameters of PPO and A2C were set according to widely used implementations in continuous control tasks. Specifically, the learning rate was set to 0.001, the discount factor to 0.99, and the batch size to 64. The entropy coefficient for PPO was 0.01 to encourage exploration, while A2C used five parallel workers for training stability.
Figure 4 and Figure 5 show the impact of the number of mobile devices on the total system latency and energy consumption. As the number of users increases, the static full offloading strategy suffers from sharply deteriorating latency due to competition for computational resources, while the local computing strategy results in high energy consumption due to its low energy efficiency. In contrast, our proposed DAPO framework demonstrates significant advantages in both latency and energy consumption through intelligent decision-making. Notably, the performance of DAPO is superior to the PPO and A2C baseline. Specifically, when the number of mobile devices reaches 12, DAPO reduces the total latency by about 27% compared to PPO and 34% compared to A2C, while lowering energy consumption by approximately 18% and 22%, respectively. This is because, in our model, the joint optimization of the model partition point and the computation offloading location results in a high-dimensional discrete action space. The PPO and A2C algorithm are prone to getting stuck in inefficient exploration and policy updates for such problems, whereas DAPO solves the problem by transforming it into a continuous action space under a deterministic policy and then mapping the actions to discrete decisions, allowing for more effective optimization. The experimental results confirm that DAPO achieves the lowest system latency in all scenarios and maintains a near-optimal level of energy consumption, demonstrating its superior performance in handling complex decision-making problems and the excellent scalability of the system.
In Figure 6, we investigate the impact of the system’s wireless bandwidth on the total system latency. From the experimental results, it is clear that the latency of the full local computation strategy remains unchanged, as it does not involve network communication. Conversely, the performance of the full offloading strategy is highly sensitive to bandwidth; its latency is extremely high at low bandwidths and decreases rapidly as bandwidth increases, eventually being limited by the computational resources of the edge servers. Our proposed DAPO framework exhibits optimal performance and strong robustness across all bandwidth conditions. For example, under limited bandwidth, DAPO achieves up to 30% lower latency compared to full offloading. Even when bandwidth is limited, DAPO can intelligently adjust the model partition point to reduce the amount of offloaded data, thereby avoiding high communication latency. As bandwidth improves, it dynamically increases the offloading ratio to further reduce the total latency. This experiment fully demonstrates the adaptive capability of our framework to make efficient decisions based on changing network conditions.
In Figure 7, we evaluate the convergence performance of the algorithms. There is a significant difference between the training reward curves of DAPO and PPO. The DAPO algorithm demonstrates extremely fast convergence, rapidly reaching a high and stable reward plateau in the early stages of training. Quantitatively, DAPO converges within about 50 episodes, while PPO requires around 150 episodes, representing a nearly threefold increase in convergence speed. In contrast, the learning process of the PPO algorithm is very slow, taking approximately 150 episodes to gradually converge, and its final policy reward is much lower than that of DAPO. This performance gap primarily stems from the core mechanisms of the two algorithms and their adaptability to our scenario. First, in terms of sample efficiency, DAPO, as an off-policy algorithm, utilizes an experience replay buffer to efficiently reuse historical data, thus enabling rapid learning. PPO, being an on-policy algorithm, needs to discard old data and resample after each update, leading to lower learning efficiency, which is directly reflected in its slow convergence curve. Second, in terms of handling high-dimensional action spaces, this problem involves jointly optimizing model partition points and offloading destinations, forming a vast combinatorial action space. PPO explores through a random policy, which is inefficient in such a high-dimensional space and struggles to find the optimal solution. In contrast, DAPO, through its deterministic policy and guiding critic network, performs a more targeted search in a continuous action space before mapping to a specific decision, which is demonstrably a more effective way to solve such complex decision-making problems. Therefore, DAPO not only finds a better policy, but its superior learning efficiency also makes it more suitable for dynamic scenarios that require rapid adaptation to environmental changes.
As shown in Figure 4, Figure 5, Figure 6 and Figure 7, the proposed DAPO framework consistently outperforms local execution, full offloading, and RL baselines such as PPO and A2C in terms of latency, energy consumption, and convergence speed. These results demonstrate that DAPO effectively addresses the dual challenges of resource constraints and user mobility by jointly optimizing model partitioning and task offloading, leading to significant improvements over baseline methods.

6. Conclusions and Future Work

This paper addresses the challenges of collaborative LLM inference in MEC environments by proposing the DAPO framework based on DDPG. The framework jointly optimizes model partition points and offloading decisions, aiming to minimize system-wide latency and energy consumption in multi-user mobile scenarios. The main contributions of this work are threefold. First, we formulate the joint optimization problem of model partitioning and task offloading under resource constraints and user mobility. Second, we design the DAPO algorithm, which leverages the advantages of DDPG to efficiently handle the high-dimensional mixed action space. Third, extensive experiments demonstrate that DAPO significantly outperforms static strategies and RL baselines such as PPO and A2C, achieving faster convergence and better trade-offs between latency and energy consumption. These findings confirm that DAPO effectively address the identified challenges and provides new insights into dynamic, mobility-aware edge intelligence systems.
Our future research will focus on expanding the decision space by incorporating more fine-grained resource allocation (such as CPU cores, data compression rates) into the optimization scope in order to achieve a deeper level of end-to-end optimization. Concurrently, we will enhance the realism of our environmental model by considering network handover and task migration issues that arise from high-speed user mobility. Furthermore, exploring the Multi-Agent Reinforcement Learning (MARL) paradigm to construct a distributed collaborative decision-making system capable of adapting to future larger-scale and more complex edge intelligence networks will be a key focus of our subsequent work.

Author Contributions

Conceptualization, H.F. and G.H.; methodology, H.F. and G.H.; software, G.H.; validation, G.H.; formal analysis, G.H.; investigation, X.Z. and J.L.; resources, X.Z. and J.L.; data curation, X.Z. and J.L.; writing—original draft preparation, G.H.; writing—review and editing, N.Z. and G.H.; visualization, G.H.; supervision, Y.L. and H.F.; project administration, F.Z. and H.F.; funding acquisition, F.Z. and H.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Key R&D Program of Guangxi (AB25069302), the Nanning Yongjiang Talent Program (No. 2023018), Guilin S&T Program (20230119-1), Technology Major Project of Guangxi No. AA18118031, and Guangxi Natural Science Foundation No. 2018GXNSFAA281318.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The WikiText-2 dataset used in this study is publicly available at https://huggingface.co/datasets/mindchain/wikitext2 (accessed on 8 June 2025).

Acknowledgments

We would like to thank the anonymous reviewers for their constructive feedback, which helped improve the quality of this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Deng, S.; Zhao, H.; Fang, W.; Yin, J.; Dustdar, S.; Zomaya, A.Y. Edge Intelligence: The Confluence of Edge Computing and Artificial Intelligence. IEEE Internet Things J. 2020, 7, 7457–7469. [Google Scholar] [CrossRef]
  2. Mao, Y.; Yu, X.; Huang, K.; Zhang, Y.-J.A.; Zhang, J. Green Edge AI: A Contemporary Survey. Proc. IEEE 2024, 112, 880–911. [Google Scholar] [CrossRef]
  3. Bianchini, R.; Fontoura, M.; Cortez, E.; Bonde, A.; Muzio, A.; Constantin, A.-M.; Moscibroda, T.; Magalhaes, G.; Bablani, G.; Russinovich, M. Toward ML-Centric Cloud Platforms. Commun. ACM 2020, 63, 50–59. [Google Scholar] [CrossRef]
  4. Huang, Y.; Qiao, X.; Lai, W.; Dustdar, S.; Zhang, J.; Li, J. Enabling DNN Acceleration with Data and Model Parallelization over Ubiquitous End Devices. IEEE Internet Things J. 2021, 9, 15053–15065. [Google Scholar] [CrossRef]
  5. Zhang, S.; Zhang, S.; Qian, Z.; Wu, J.; Jin, Y.; Lu, S. DeepSlicing: Collaborative and Adaptive CNN Inference with Low Latency. IEEE Trans. Parallel Distrib. Syst. 2021, 32, 2175–2187. [Google Scholar] [CrossRef]
  6. Dong, F.; Wang, H.; Shen, D.; Huang, Z.; He, Q.; Zhang, J.; Wen, L.; Zhang, T. Multi-Exit DNN Inference Acceleration Based on Multi-Dimensional Optimization for Edge Intelligence. IEEE Trans. Mob. Comput. 2022, 22, 5389–5405. [Google Scholar] [CrossRef]
  7. Zeng, L.; Chen, X.; Zhou, Z.; Yang, L.; Zhang, J. CoEdge: Cooperative DNN Inference with Adaptive Workload Partitioning over Heterogeneous Edge Devices. IEEE/ACM Trans. Netw. 2020, 29, 595–608. [Google Scholar] [CrossRef]
  8. Mohammed, T.; Joe-Wong, C.; Babbar, R.; Di Francesco, M. Distributed Inference Acceleration with Adaptive DNN Partitioning and Offloading. In Proceedings of the IEEE INFOCOM 2020—IEEE Conference on Computer Communications, Toronto, ON, Canada, 6–9 July 2020; IEEE: New York, NY, USA, 2020; pp. 854–863. [Google Scholar]
  9. Eshratifar, A.E.; Abrishami, M.S.; Pedram, M. JointDNN: An Efficient Training and Inference Engine for Intelligent Mobile Cloud Computing Services. IEEE Trans. Mob. Comput. 2019, 20, 565–576. [Google Scholar] [CrossRef]
  10. Zhou, M.; Zhou, B.; Wang, H.; Dong, F.; Zhao, W. Dynamic Path Based DNN Synergistic Inference Acceleration in Edge Computing Environment. In Proceedings of the 2021 IEEE 27th International Conference on Parallel and Distributed Systems (ICPADS), Beijing, China, 14–16 December 2021; IEEE: New York, NY, USA, 2021; pp. 567–574. [Google Scholar]
  11. Liang, H.; Sang, Q.; Hu, C.; Cheng, D.; Zhou, X.; Wang, D.; Bao, W.; Wang, Y. DNN Surgery: Accelerating DNN Inference on the Edge through Layer Partitioning. IEEE Trans. Cloud Comput. 2023, 11, 3111–3125. [Google Scholar] [CrossRef]
  12. Tang, X.; Chen, X.; Zeng, L.; Yu, S.; Chen, L. Joint Multiuser DNN Partitioning and Computational Resource Allocation for Collaborative Edge Intelligence. IEEE Internet Things J. 2020, 8, 9511–9522. [Google Scholar] [CrossRef]
  13. Li, E.; Zeng, L.; Zhou, Z.; Chen, X. Edge AI: On-Demand Accelerating Deep Neural Network Inference via Edge Computing. IEEE Trans. Wirel. Commun. 2019, 19, 447–457. [Google Scholar] [CrossRef]
  14. Mao, J.; Yang, Z.; Wen, W.; Wu, C.; Song, L.; Nixon, K.W.; Chen, X.; Li, H.; Chen, Y. MeDNN: A Distributed Mobile System with Enhanced Partition and Deployment for Large-Scale DNNs. In Proceedings of the 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Irvine, CA, USA, 13–16 November 2017; IEEE: New York, NY, USA, 2017; pp. 751–756. [Google Scholar]
  15. Zhao, Z.; Barijough, K.M.; Gerstlauer, A. DeepThings: Distributed Adaptive Deep Learning Inference on Resource-Constrained IoT Edge Clusters. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2018, 37, 2348–2359. [Google Scholar] [CrossRef]
  16. Laskaridis, S.; Venieris, S.I.; Almeida, M.; Leontiadis, I.; Lane, N.D. SPINN: Synergistic Progressive Inference of Neural Networks over Device and Cloud. In Proceedings of the MobiCom ’20: The 26th Annual International Conference on Mobile Computing and Networking, London, UK, 21–25 September 2020; ACM: New York, NY, USA, 2020; pp. 1–15. [Google Scholar]
  17. Ngo, D.; Park, H.-C.; Kang, B. Edge Intelligence: A Review of Deep Neural Network Inference in Resource-Limited Environments. Electronics 2025, 14, 2495. [Google Scholar] [CrossRef]
  18. Wang, Y.; Xue, H.; Zhou, M. A Digital Twin-Assisted VEC Intelligent Task Offloading Approach. Electronics 2025, 14, 3444. [Google Scholar] [CrossRef]
  19. Zhang, J.; Ma, S.; Yan, Z.; Huang, J. Joint DNN Partitioning and Task Offloading in Mobile Edge Computing via Deep Reinforcement Learning. J. Cloud Comput. 2023, 12, 116. [Google Scholar] [CrossRef]
  20. Zhao, Z.; Wang, K.; Ling, N.; Xing, G. EdgeML: An AutoML Framework for Real-Time Deep Learning on the Edge. In Proceedings of the IoTDI ’21: International Conference on Internet-of-Things Design and Implementation, Charlottesvle, VA, USA, 18–21 May 2021; ACM: New York, NY, USA, 2021; pp. 133–144. [Google Scholar]
  21. Chen, Y.; Li, R.; Yu, X.; Zhao, Z.; Zhang, H. Adaptive Layer Splitting for Wireless Large Language Model Inference in Edge Computing: A Model-Based Reinforcement Learning Approach. Front. Inf. Technol. Electron. Eng. 2025, 26, 278–292. [Google Scholar] [CrossRef]
  22. Yuan, X.; Li, N.; Wei, K.; Xu, W.; Chen, Q.; Chen, H.; Guo, S. Mobility and Cost Aware Inference Accelerating Algorithm for Edge Intelligence. IEEE Trans. Mob. Comput. 2024, 24, 1530–1549. [Google Scholar] [CrossRef]
  23. Ordóñez, S.A.C.; Samanta, J.; Suárez-Cetrulo, A.L.; Carbajo, R.S. Intelligent Edge Computing and Machine Learning: A Survey of Optimization and Applications. Future Internet 2025, 17, 417. [Google Scholar] [CrossRef]
  24. Li, B.; Liu, Y.; Tan, L.; Pan, H.; Zhang, Y. Digital Twin Assisted Task Offloading for Aerial Edge Computing and Networks. IEEE Trans. Veh. Technol. 2022, 71, 10863–10877. [Google Scholar] [CrossRef]
  25. Saleem, U.; Liu, Y.; Jangsher, S.; Li, Y.; Jiang, T. Mobility-Aware Joint Task Scheduling and Resource Allocation for Cooperative Mobile Edge Computing. IEEE Trans. Wirel. Commun. 2020, 20, 360–374. [Google Scholar] [CrossRef]
  26. Deng, X.; Yin, J.; Guan, P.; Xiong, N.N.; Zhang, L.; Mumtaz, S. Intelligent Delay-Aware Partial Computing Task Offloading for Multiuser Industrial Internet of Things through Edge Computing. IEEE Internet Things J. 2021, 10, 2954–2966. [Google Scholar] [CrossRef]
  27. Merity, S.; Xiong, C.; Bradbury, J.; Socher, R. Pointer Sentinel Mixture Models. arXiv 2016, arXiv:1609.07843. [Google Scholar] [CrossRef]
Figure 1. Architecture for dynamic collaborative LLM inference in the MEI system. The RL agent adaptively partitions the LLM into device-side and edge-side components for execution.
Figure 1. Architecture for dynamic collaborative LLM inference in the MEI system. The RL agent adaptively partitions the LLM into device-side and edge-side components for execution.
Electronics 14 03929 g001
Figure 2. The DAPO framework for collaborative inference offloading in MEI, based on DDPG.
Figure 2. The DAPO framework for collaborative inference offloading in MEI, based on DDPG.
Electronics 14 03929 g002
Figure 3. The mobility trajectories of UDs.
Figure 3. The mobility trajectories of UDs.
Electronics 14 03929 g003
Figure 4. Total delay of different numbers of MD.
Figure 4. Total delay of different numbers of MD.
Electronics 14 03929 g004
Figure 5. Total energy of different numbers of MD.
Figure 5. Total energy of different numbers of MD.
Electronics 14 03929 g005
Figure 6. Total delay of different bandwidth of MEC server.
Figure 6. Total delay of different bandwidth of MEC server.
Electronics 14 03929 g006
Figure 7. Convergence performance of PPO and DAPO.
Figure 7. Convergence performance of PPO and DAPO.
Electronics 14 03929 g007
Table 1. Notation definition.
Table 1. Notation definition.
NotationDescription
N The set of mobile devices (MDs)
M The set of edge servers (ESs)
BBandwidth of the wireless channel
N 0 Power spectral density of Gaussian noise
F n Total computing capability of MD n
F m Total computing capability of ES m
LTotal number of layers in the LLM
W j Computational workload of layer j
s n Model split point for MD n
a n Offloading decision (target server) for MD n
D n Size of the intermediate data generated at split point s n
p n Transmission power of MD n
P n The operating power consumption of MD n
P m The operating power consumption of ES m
F m , n Computing resources allocated by ES m to MD n
R m , n The uplink transmission rate between MD n and ES m
T n a l l _ l o c a l The delay of full local execution of MD n
E n a l l _ l o c a l The energy of full local execution of MD n
T n loc Local computation delay for the task of MD n
T n remote Remote computation delay for the task of MD n
T n , m trans Transmission delay for offloading from MD n to ES m
T n tot Total end-to-end inference delay for MD n
E n tot Total energy consumption for MD n
Table 2. Parameter Setting in the DAPO Algorithm.
Table 2. Parameter Setting in the DAPO Algorithm.
ParameterValueDescription
Episode200The size of the main loop
T1000The size of the secondary loop
Buffer size100,000The size of the replay buffer
Batch size64The size of the sample in the replay buffer
γ 0.99Reward discount factor
Actor learning rate0.001The learning rate of Adam optimizer in Actor network
Critic learning rate0.001The learning rate of Adam optimizer in Critic network
τ 0.005The soft update speed of the target network
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Feng, H.; Huang, G.; Zhou, N.; Zhang, F.; Liu, Y.; Zhou, X.; Liu, J. DAPO: Mobility-Aware Joint Optimization of Model Partitioning and Task Offloading for Edge LLM Inference. Electronics 2025, 14, 3929. https://doi.org/10.3390/electronics14193929

AMA Style

Feng H, Huang G, Zhou N, Zhang F, Liu Y, Zhou X, Liu J. DAPO: Mobility-Aware Joint Optimization of Model Partitioning and Task Offloading for Edge LLM Inference. Electronics. 2025; 14(19):3929. https://doi.org/10.3390/electronics14193929

Chicago/Turabian Style

Feng, Hao, Gan Huang, Nian Zhou, Feng Zhang, Yuming Liu, Xiumin Zhou, and Junchen Liu. 2025. "DAPO: Mobility-Aware Joint Optimization of Model Partitioning and Task Offloading for Edge LLM Inference" Electronics 14, no. 19: 3929. https://doi.org/10.3390/electronics14193929

APA Style

Feng, H., Huang, G., Zhou, N., Zhang, F., Liu, Y., Zhou, X., & Liu, J. (2025). DAPO: Mobility-Aware Joint Optimization of Model Partitioning and Task Offloading for Edge LLM Inference. Electronics, 14(19), 3929. https://doi.org/10.3390/electronics14193929

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop