1. Introduction
With the rapid development of the Internet of Things (IoT) and mobile communication technologies, the network edge is facing the dual challenges of massive terminal access and explosive data growth [
1]. Traditional mobile cloud computing supports resource-constrained terminals via cloud servers, but faces a critical bottleneck: its centralized model requires long-distance data transmission to remote clouds, make it fail to meet real-time demands of applications, such as autonomous driving and remote surgery [
2]. To address this issue, European Telecommunications Standards Institute (ETSI) proposed Mobile Edge Computing (MEC) architecture [
3]. By sinking cloud computing capabilities to network edge nodes, MEC enables nearby services for terminal devices, effectively solving the latency bottleneck of traditional architectures. Recently, MEC architecture demonstrates significant application value in fields such as smart healthcare, intelligent driving, and environmental monitoring [
4].
However, with the surging number of IoT devices and the scarcity of spectrum resources, conventional orthogonal multiple access (OMA) techniques are increasingly unable to satisfy the dual requirements of low latency and low energy consumption in MEC networks. Fortunately, non-orthogonal multiple access (NOMA) has been regarded as an emerging technology. By enabling multiple users to share the same time-frequency resources, NOMA can significantly enhance the system spectral efficiency [
5]. Accordingly, the synergy between NOMA and MEC (referred to as NOMA-MEC) offers significant advantages, such as enabling massive connectivity, achieving low latency, enhancing energy efficiency, and providing the flexibility [
6,
7,
8].
While the NOMA-MEC technology enhances edge device processing capabilities, rapid growth in sensors and service demands creates critical resource scheduling challenges [
9,
10,
11]. In especial, vision-based applications particularly strain wireless networks due to frequent transmission of high-resolution images and videos [
12]. Compression-based offloading strategies address this issue by compressing data before transmission to edge servers. This reduces data size, saving energy and minimizing latency [
13]. Therefore, optimal compression and resource allocation strategies are essential for balancing communication efficiency with computational performance in edge computing networks.
1.1. Related Works
1.1.1. NOMA-Enabled MEC
Recently, numerous studies have concentrated on optimizing the latency-energy performance of MEC systems with OMA/NOMA. For example, in heterogeneous cellular MEC networks, ref. [
14] considered both task accuracy requirements and the parallel computing capabilities of MEC servers, aiming to maximize the system efficiency by jointly optimizing computation offloading strategies and resource allocation. The authors of [
15] designed a collaborative optimization framework for UAV trajectory, task offloading, and communication resource allocation, to minimize the computational latency and energy consumption of tasks in MEC systems. Moreover, considering the uncertainties in real-world MEC networks, ref. [
16] investigated energy-efficient computation offloading strategies. For NOMA-MEC networks, the work [
17] proposed a joint resource management method based on DRL frameworks to reduce task computation latency. Furthermore, ref. [
18] achieved the minimization of computational energy consumption in NOMA-MEC networks, by jointly optimizing power allocation and the time slot duration. For multi-cell NOMA-MEC networks, the authors of [
19] introduced a game theory-based resource optimization approach that minimizes the weighted sum of delay and energy consumption. However, these works have not exploited data compression for computation offloading.
1.1.2. Compression-Assisted MEC
While a high compression rate can reduce channel resource requirement in task offloading, it frequently results in significant accuracy degradation for task services. On the other hand, the offloading with low compression rate consumes more network resources and might even lead to transmission failures caused by excessive latency. Therefore, in order to strike a balance between communication efficiency and computation performance in edge computing networks, it is vital to develop an appropriate compression offloading and resource allocation strategy. Existing research has made some progress in tackling these challenges. For instance, to minimize the task energy consumption, ref. [
20] proposed a collaborative optimization scheme for data compression ratio and resource configuration in MEC networks. The work [
21] investigated an efficient and secure multi-user multi-task computation offloading model for MEC. Moreover, for hierarchical MEC systems, ref. [
22] designed a three-step framework algorithm for data compression and resource allocation, aiming to reduce the weighted system cost. For low earth orbit satellite networks, the work [
23] jointly optimized the data compression and task scheduling strategy to maximize the system utilization. Furthermore, for industrial internet of things assisted by UAV, the authors of [
24] proposed an energy efficient trajectory and scheduling optimization based on data compression. However, the above-mentioned work often treated compression ratio as a fixed value, one-sided benefit or apply it indiscriminately. They lack the mechanism to adaptively decide whether to compress, and to what degree, based on real-time system states.
1.1.3. DRL for Continuous Control
For Deep Reinforcement Learning (DRL) algorithms, the work [
25] proposed an optimization framework based on Deep Deterministic Policy Gradient (DDPG), which dynamically coordinates the offloading decisions and resource allocation strategies in MEC systems. Ref. [
26] developed a hybrid DRL algorithm to minimize the weighted cost of energy consumption and service latency, by optimizing the offloading strategy in a wireless power transfer-aided MEC network [
27]. Additionally, ref. [
28] minimized the total computational latency for multi-access MEC network by integrating a DRL framework with a coordination mechanism. However, these existing DRL algorithms easily suffer from accumulated bias in value function estimation. Moreover, in traditional DRL algorithms, overestimation can also lead to oscillations in policy updates, while underestimation may reduce the convergence speed, both of which cause low learning efficiency of the algorithms in complex and dynamic environments.
1.2. Motivations and Contributions
To cope with the above challenges, this paper constructs a NOMA-MEC network architecture based on data compression. By establishing a joint optimization model of computational resources, offloading strategies, and compression ratios in a dynamic environment, the total task processing latency is minimized under constraints such as the user’s battery capacity and the MEC server’s computing power, by exploiting a DRL framework based on Softmax Deep Double Deterministic Policy Gradient (SD3). Specifically, a composite state space integrating multi-user task specifications, real-time battery levels, and time-varying channel gains is firstly designed, and then a joint action space capable of simultaneously outputting continuous-valued compression ratios, offloading ratios, and server computational resource allocations is constructed. Finally a reward function incorporating penalty terms for constraint violations is designed. The main contribution of this paper is given as follows.
To improve the offloading efficiency of multi-user MEC scenarios, a NOMA-MEC network architecture based on data compression processing is proposed. This architecture comprehensively considers practical factors such as users’ battery capacity, task computation latency, and computational resource limitations. A total task processing latency minimization optimization problem of computational resources, offloading strategies, and data compression ratios is formulated.
A DRL framework based on SD3 is proposed to tackle the joint optimization of offloading proportion, compression ratio, and computation resource. Unlike methods that update optimization variables alternately, our approach outputs all variables simultaneously, enabling coordinated decisions that minimize the total task processing latency in NOMA-MEC networks. The framework exhibits strong adaptability to dynamic network environments and user mobility, providing valuable practical insights for real-world deployment.
We compare and discuss the effects of computing frequency of the edge server, task data size, number of users, and bandwidth on the total long-term task latency. Furthermore, extensive simulation results show that our algorithm outperforms other benchmark algorithms concerning the total long-term task latency and the convergence speed. The findings offer important implications for the design of future wireless communication systems, particularly in MEC scenarios requiring ultra-low latency and high reliability services.
The structure of the paper is outlined as follows. In
Section 2, the system model and the associated optimization problem are presented. Then, in
Section 4, the optimization problem is reformulated as a MDP, and more details about the design of the proposed SD3 algorithm are introduced. In
Section 5, numerical simulations are given, as well as a comparative analysis of the results. Finally,
Section 6 is the conclusion of this paper.
2. System Model
As shown in
Figure 1, we consider a NOMA-MEC network consisting of
N users and a BS equipped with an edge server. It is assumed that all users and the base station are equipped with a single antenna. The set of users is denoted as
. Each user is equipped with a rechargeable battery and can harvest Radio Frequency (RF) energy from the surrounding environment to sustain device operation. In the system model, the total time is divided into
T equal-length time slots, each with a duration of
, and each time slot is identified by the index
. It is assumed that Channel State Information (CSI) remains quasi-static within each time slot.
2.1. Communication Model
At each time slot
t,
N users utilize NOMA to offload their computation tasks to the BS concurrently [
29]. The BS receives and competes the offloaded tasks within each time slot, then returns the computation results to the served users. Assuming the channel gains between all users and the base station are sorted in descending order, i.e.,
. Accordingly, the achievable offloading rate for user
n is given by
where
B is the channel bandwidth,
is the transmit power of user
n,
is the noise power. When transmission rates are limited, the system may either increase the compression ratio to reduce data transmission latency or allocate additional computing resources to minimize overall computational latency.
2.2. Computation Model
Due to limited computational capabilities, each user adopts a partial offloading scheme, whereby a portion of its task can be offloaded to the edge server for computation, with the remaining portion executed locally. The computational task of user n can be represented as , where denotes the task size (in bits) of user n’s task, and represents the number of CPU cycles required to compute each bit of task data (cycles per bit). Let be the task offloading ratio of user n in time slot t, meaning that a task amount of is offloaded to the BS, and the remaining task amount of is executed locally at user n.
2.2.1. Local Computing
In the local computing mode, the execution latency of user
n at time slot
t is given as
where
is the local CPU computing frequency.
Similarly, the energy consumption by user
n to handle the task locally is represented as
where
indicates the energy consumption of the CPU per cycle for computing.
2.2.2. Edge Computing
The task offloading process consists of three key phases: the task transmission phase, the task execution phase, and the result return phase. It is worth noting that when the MEC server returns the computation results to the user, the computation results are usually far smaller in magnitude than the input task data, so the delay and energy consumption of this phase can be neglected [
30].
Task Transmission Stage
When tasks are offloaded to the MEC server, the user firstly performs local compression processing on the task data to be offloaded. Then, the compressed task data is transmitted to the base station. Upon receiving the compressed data, the base station decompresses it before performing subsequent computation and processing. In this paper, taking into account factors such as data integrity, complexity and real-time performance trade-offs, as well as compatibility, Huffman compression model is adopted [
31]. Let
denote compression ratio, meaning one bit of raw data is compressed to
bits;
is a fixed parameter determined solely by the chosen compression method. Additionally, we adopt a partial compression scheme [
21], where only a fraction
of the raw data for user
n is compressed, while the rest is transmitted uncompressed,
and
represent no compression and complete compression, respectively. Specifically, a tractable model can be used [
20], where the required CPU cycles for compressing 1-bit of raw task for user
n can be approximated with an exponential function of the task compression ratio
as [
22]
where
is a positive constant depending on the data compression method. Specially, when
, the required CPU cycle is zero which indicates no data compression operation is executed. Correspondingly, the delay and energy consumption caused by the compression of user
n in time slot
t can be expressed as
respectively, where
is the energy consumption per cycle for data compression. It is assumed that the MEC server applies the same compression technique for decompression, so the corresponding latency for decompression is given as
where
represents the CPU computing frequency assigned by the MEC server.
Moreover, at time slot
t, the transmission latency and energy consumption for user
n is given by
respectively.
Task Execution Stage
After decompression, the MEC server executes the offloaded computation task. For user
n, the latency of the MEC server executing task at time slot
t is written as
Therefore, in time slot
t, the latency incurred for completing user
n’s offloaded computing task can be expressed as
In summary, the total delay for user
n to complete the computing task within time slot
t can be expressed as
Similarly, the total energy consumed by user
n to complete the computing task in time slot
t can be expressed as
2.3. Energy Harvesting Model
The system employs a linear energy harvesting model for wireless charging of each user device. This model comprises a RF energy transmitter and the user’s rechargeable battery [
32]. Due to the varying spatial distribution of users around the transmitter across different time slots, the amount of energy harvested by each user also fluctuates. The distance
between a specific user
n and the RF energy transmitter is modeled as a time series following a Markov chain, characterized by its transition probability matrix [
33]. To account for practical energy conversion losses during the storage process in the battery, an energy conversion efficiency factor (denoted as
) is introduced. Consequently, the energy collected by user
n from the RF energy transmitter during a specific time slot
t is given by
where
is transmit power of RF transmitter,
indicates path loss factor, and
is antenna gain of between transmitting and receiving antennas,
is energy conversion factor.
It is presumed that at the beginning of time slot
t, the battery energy is denoted as
. To prevent over-discharging of the battery, the entire process is consistent with the energy causality constraint and cannot exceed the maximum battery capacity
. Consequently, the battery energy of user
n must change with
where
is the maximal battery capacity of each user, and
.
3. Problem Formulation
In this paper, the goal is to minimize the total long-term task processing latency through the joint optimization of offloading proportion
, compression ratio
, and computation resource allocation at the MEC server
. Accordingly, the optimization problem is formulated as
where
represents the maximal computing frequency of the MEC server, C1 indicates the energy causality constraint for each user. To avoid the task queue congestion, C2 restricts that the task processing is accomplished within each time slot. In addition, C3–C5 are the constraints for computing resources, offloading proportion, and compression ratio, respectively.
It is evident that Problem P1 is a non-convex optimization problem. The non-convexity stems from the complex coupling relationships among the optimization variables and the presence of nonlinear terms in the objective function. Traditional convex optimization methods cannot handle such non-convex structures and thus fail. While heuristic algorithms can obtain feasible solutions, they are easily trapped in local optima. Decomposition–coordination methods, due to the strong coupling between variables, result in cyclic dependencies among subproblems, making it difficult for traditional algorithms to solve it directly. Furthermore, to solve the formulated optimization problem, considering the dynamic network environment, this paper proposes a DRL framework based on the SD3 algorithm. This framework directly addresses the non-convex problem through adaptive learning to obtain optimal resource allocation decisions. The process of solving the optimization problem with this algorithm will be introduced next.
4. Deep Reinforcement Learning Framework Based SD3 Algorithm
To address these challenges, this paper employs a DRL framework to find the optimal resource allocation decision strategies. Specifically, first, the problem is modeled as a MDP, with relevant definitions provided for the model. Subsequently, a SD3 algorithm is proposed to solve the optimization problem.
4.1. Markov Decision Process Modeling
In the optimization problem P1, the decisions regarding the offloading proportion of computing tasks, compression ratio, and computational resource allocation, are only relevant to the current state, but not to the past state, which is a typical MDP problem, consisting of three main elements as follows.
4.1.1. State Space
At the beginning of time slot
t, the neural network receives as input the current system state, including the task and energy status of all users and the considered network status. Specifically,
where
represents the task parameter of all users;
is the remaining battery energy at each user;
is the channel gains between all user and BS.
4.1.2. Action Space
At time slot
t, the action space consists of the offloading proportion, compression ratio, and computation resource allocation of the MEC server. Specifically, the action space is define as
where
is the task offloading strategy;
is the set of all users’ task data compression ratio;
represents the computation resource allocation strategy at BS.
4.1.3. Reward Function
The design of reward function should align with the system’s optimization objective, ensuring that the system evolves toward the minimization of total long-time task latency. Specifically,
where
is the penalty actions that violate constraints are penalized, reflected in the reward value, helping the agent to make better decisions, it is defined as
4.2. The Framework of SD3 Algorithm
DRL algorithms for continuous control, notably DDPG and Twin Delayed DDPG (TD3), exhibit inherent limitations in value estimation accuracy [
34]. For example, DDPG suffers from systematic overestimation bias due to maximization operations in its Q-function updates, leading to inflated action-value approximations and suboptimal policies. Although TD3 mitigates this overestimation through clipped double Q-learning and delayed policy updates, it inadvertently induces persistent underestimation bias. This secondary bias constrains exploration efficacy and ultimately degrades asymptotic performance. In contrast, the SD3 algorithm employs a dual-mechanism framework to counteract the inherent biases in value estimation. Its key contributions include (i) a Boltzmann softmax operator applied to double Q-estimators, which replaces conventional max/min operators and alleviates overestimation bias; (ii) bias-corrected targets that dynamically balance the variances between the two estimators, thus harmonizing opposing estimation errors. This design harmonizes opposing biases while preserving approximation accuracy during temporal-difference updates. Compared to TD3 and DDPG, SD3 achieves superior sample efficiency and asymptotic performance by simultaneously constraining both overestimation and underestimation biases. The synergistic integration of softmax-based regularization with variance-aware double estimation establishes a new state-of-the-art in bias-resilient continuous control [
35]. In
Figure 2, the framework of SD3 algorithm is given to clarify the SD3 training and execution process.
SD3 algorithm is founded upon the actor-critic framework with double actor networks
and double critic networks
, corresponding network function parameters are
and
. When the agent starts learning, SD3 algorithm randomly samples
from the replay buffer to form a mini-batch of
N training data. The target actor network predicts the next actions
, and the
Q value can be given as
In the subsequent phase, to smooth the
Q value estimate and reduce the estimation error, this algorithm introduces the softmax operator which is defined as
where
is the softmax operator parameter, and
represents the probability density function of a Gaussian distribution. The softmax operator serves to mitigate the impact of actions with estimation error on the results by taking a weighted average of
Q values, thereby enhancing the stability of the policy. The target value
can be calculated in conjunction with the discount factor
as
governing the agent’s policy trade-offs between immediate and future latency penalties.
The target value is employed to adjust the parameters of the critic network, allowing it to better approximate the value of the state–action pair. The loss function is designed to measure the difference between the target value and the current
Q value, which can be represented as
The parameters of the critic network denoted as , are iteratively adjusted using gradient descent to reduce the loss function.
Next, the goal of updating the actor network is to determine the optimal policy parameters that maximize the objective function
. These parameters maximize the expected return, which can be given by
Subsequently, the following expression can be substituted in order to calculate the policy gradient as
Finally, the target network is subject to a soft update, which allows for the implementation of a latency update strategy. This method can improve the convergence stability by minimizing the fluctuations seen in the target network. The specific procedures of the SD3 algorithm are presented in Algorithm 1.
| Algorithm 1: SD3 algorithm for joint data compression and resource allocation |
- 1:
Input: Environment states (channel, task, energy), neural networks - 2:
Output: Optimal policy for compression, offloading, resource allocation - 3:
Initialize SD3 agent (Actor & Critic networks), replay buffer - 4:
for each episode do - 5:
Observe initial state - 6:
for each time slot t do - 7:
Agent selects action using policy - 8:
Execute action: - 9:
Users: Compress data [Equation ( 4)], offload - 10:
MEC: Decompress & compute [Equations (6) and (9)] - 11:
Update battery with harvested RF energy [Energy Harvesting Model] - 12:
Calculate reward [Equation ( 11)] - 13:
Store transition in replay buffer - 14:
Sample mini-batch; update Critic via Softmax-smoothed Q-values: - 15:
Update Actor using deterministic policy gradient - 16:
end for - 17:
end for
|
4.3. Computational Complexity Analysis
In DRL frameworks, parameter updates are dominated by matrix multiplication operations. Considering that the dimensions of state space and action space are and , respectively. The SD3 algorithm adopts an L-layer fully connected network where denotes the number of neurons in the l-th layer. Compared to TD3 and DDPG, which share the same network architecture, all three algorithms have per-iteration complexity scaling as for batch size . While SD3’s novel components (softmax operator and variance-corrected targets) introduce additional steps, they contribute only overhead. The per-iteration computational complexity of all algorithms grows linearly with the number of users N, as the dimensions of both the state and action spaces are proportional to N. Thus, SD3 maintains the same asymptotic complexity as its counterparts: over T time frames. Empirically, this translates to a minimal runtime increase of only 5–7% per training step—a negligible cost that enables SD3’s superior bias correction and sample efficiency.
In contrast, Actor-Critic (AC) methods typically maintain separate policy and value networks with
and
layers respectively, resulting in higher computational complexity of
due to dual-network updates. Dueling Double Deep Q-Network (D3QN) introduces dueling architecture where the final hidden layer splits into parallel advantage and value streams, adding moderate overhead with complexity
. To sum up,
Table 1 shown compares the computational complexity of the proposed algorithms and traditional ones. It is clear from
Table 1 that, while both AC and D3QN incur higher computational costs than SD3, they remain efficient within their respective learning paradigms. Overall, SD3 achieves advanced bias-variance trade-off without significant complexity increase, maintaining competitive efficiency among state-of-the-art DRL algorithms.
5. Simulation Results
The system model postulates a BS accommodating
N = 10 users uniformly dispersed within a circular coverage area with a radius of
d. The number of CPU cycles required per bit for computing ranges between [
] cycles/bit, and the number of CPU cycles required per bit of data for compression and decompression is 200 cycles/bit [
36]. In the SD3 algorithm, the actor and critic networks consist of two hidden layers with sizes [
], and the activation function applied to each hidden layer is the rectified linear unit. The learning rates for the actor and critic networks are adjusted to
and
, respectively. Unless otherwise specified, the basic parameters are defined in
Table 2. Moreover, to assess the performance of the proposed scheme, it is compared with the following traditional baseline DRL algorithms.
Actor–Critic (AC): The Actor–Critic algorithm combines value-based methods and policy-based methods [
37]. The target
Q-values are calculated by the same network and each time the maximum value among the current estimates of all action values from the Bellman equation is obtained. The neural network tends to accumulate positive errors, thereby causing the overestimation problem.
Dueling Double Deep Q Network (D3QN): The value-based discrete action Deep Q Network improvement algorithm, the network for selecting actions, and the network for evaluating actions are different, and the lower value of the two is chosen [
38]. This approach introduces the drawback of an overall underestimation problem.
Deep Deterministic Policy Gradient (DDPG): The algorithm integrates deep learning with deterministic policy gradient to address issues in continuous action space problems [
39]. Built upon the Actor–Critic framework, the DDPG algorithm can achieve effective offline policy learning in complex high-dimensional environments, but it is unable to avoid estimation errors.
Figure 3 shows the convergence behavior under different learning rates. As illustrated in
Figure 3, both excessively large and small learning rates can degrade the algorithm’s convergence performance. A too-small learning rate may lead to local and low convergence, while a too-large learning rate can cause fluctuations and prevent convergence. A learning rate of
achieves approximately the best average reward. This also demonstrates that the algorithm is more sensitive to changes in learning rates. Moreover, it can also be seen that, there is a clear trade-off between the convergence speed and stability: higher learning rates offer faster convergence speed but with lower reward fluctuations, while lower learning rates provide slower learning processes but with higher reward.
Figure 4 illustrates the convergence performance of various algorithms over 2000 training episodes. The results indicate that all algorithms converge with increased training episodes. In a holistic view, SD3 algorithm outperforms other algorithms significantly, achieving higher average rewards during training. This is primarily attributable to the enhanced framework design of the SD3 algorithm and the incorporation of the softmax operation to enhance the precision of the value function. This enables SD3 algorithm to learn and optimize strategies with greater efficiency in complex environments, and avoids the impact of overestimation and underestimation biases, thereby exhibiting superior performance during training. After 2000 episodes, the AC algorithm lags behind both SD3 and DDPG algorithms in reward value, reflecting its limited capability in handling complex tasks and the online strategy framework is underutilized for five-tuple samples, the D3QN algorithm exhibits the poorest performance due to the lower accuracy of discrete actions compared to continuous actions.
Moreover, to validate the effectiveness of the proposed softmax operator in balancing value estimation biases of the Critic network, we conducted an ablation study on the SD3 algorithm: the convergence performance was compared under two configurations with softmax operator (SD3 with softmax) and without softmax operator (SD3 without softmax) in
Figure 5. It can be seen from
Figure 5 that the SD3 algorithm incorporating the softmax operator achieves higher average reward values. This result verifies the critical role of the softmax operator in suppressing value function estimation biases and enhancing policy optimization performance. This is mainly because the SD3 algorithm without the softmax operator resembles the basic dual-Critic architecture, which is more prone to accumulation of value estimation biases in dynamic and complex NOMA-MEC resource allocation environments, leading to larger training fluctuations and lower policy performance after convergence. In contrast, the softmax operator effectively balances biases between dual Q-estimators through a weighted smoothing mechanism based on the Boltzmann distribution, avoiding common overestimation or underestimation issues in traditional DDPG/TD3 algorithms, thereby improving the algorithm’s learning stability and final performance.
Figure 6 demonstrates the relationship between the total long-term task latency and the MEC server computing frequency. As shown in
Figure 6, the total long-term task latency diminishes as the maximum computing frequency rises. This occurs because, as the computing resources of the MEC server increase, task processing speed becomes faster and task execution latency becomes smaller. Consequently, users are more inclined to reduce task execution latency through computing offloading. In contrast to the benchmark algorithms, the total long-term task latency of the SD3 algorithm is reduced by 26.9% when the computing capability is increased from 5 GHz to 10 GHz, which still maintains the lowest average latency. As the MEC server computing capability increases, the latency decreases most significantly.
Figure 7 illustrates the trend of the total long-term task latency under different task data sizes. The curve relationship indicates that as the data size of user computing tasks expands, the overall latency in the system exhibits a synchronous growth trend. This phenomenon can be attributed to two key mechanisms: First, the local computing latency increases linearly with growing data size. Second, in MEC scenarios, the transmission latency during the data transfer phase rises significantly due to enlarged packet sizes. Simulation comparison results demonstrate that, under different task data sizes, the proposed SD3 algorithm in this paper shows a marked advantage in latency control compared to other DRL-based optimization schemes.
Figure 8 illustrates the connection between the total long-term task latency and the number of users. It is evident that the total long-term task latency increases progressively as the number of users grows. This is due to the increase in the number of users, which results in a greater variety of processing tasks and data sizes. However, the limited computing resources of the MEC server lead to higher computing latency, and the uplink congestion further reduces the transmission rate. In addition, as the user count grows, both the state and action dimensions that the agent must handle will expand. Moreover, we also can see from
Figure 8 that the SD3 algorithm is demonstrably more effective than the DDPG, AC, and D3QN algorithms in terms of the total long-term task latency, which demonstrates the robustness of the SD3 algorithm in dynamic environments.
In
Figure 9, two benchmark schemes are introduced: Full compression and without compression. From
Figure 9, we can observe that, when the available bandwidth is limited, the full compression scheme will perform better than the scheme without compression. However, when the bandwidth reaches a certain value, the performance of the scheme without compression will be better than that of the full compression scheme. This is because when the bandwidth resources are sufficient, the transmission rate is already high enough, and the compression processing will add an additional burden to the transmission latency. Moreover, we also can see that the proposed adaptive compression scheme can dynamically modify the compression ratio in the offloading task according to the different channel bandwidths, thereby exhibiting superior performance compared to the two benchmark schemes.
To further dissect the source of performance improvement of the SD3 algorithm,
Figure 10 compares the performance of the complete SD3 algorithm (SD3 + opt, i.e., equipped with both the softmax operator and cooperative optimization mechanism) and baseline algorithms with random optimization strategies (DDPG + random, SD3 + random, DDPG + opt) under different numbers of users. As shown, with increasing user numbers, the system delays of all algorithms show an upward trend due to intensified competition for computing and communication resources. However, the complete SD3 algorithm (SD3 + opt) achieves the lowest task processing delay across all user scales, significantly outperforming other configurations. Notably, compared to SD3 with only optimization (SD3 + random), SD3 with the complete optimization mechanism (SD3 + opt) shows significant performance improvement. This clearly indicates that the joint effect of the softmax operator and the compression-offloading-resource cooperative optimization mechanism is the core of the algorithm’s superiority. The softmax operator ensures stable and efficient learning, while the cooperative optimization mechanism enables adaptive optimal decision-making in dynamic multi-user environments. Their combined action significantly reduces the system’s long-term task processing delay.
6. Conclusions
This study tackles the problem of minimizing the total time-long latency for computational tasks within NOMA-MEC networks. Data compression techniques are introduced during task offloading to MEC servers to significantly reduce communication latency. A joint optimization framework is established, integrating computational resource allocation, offloading decisions, and data compression ratio selection. To efficiently solve this complex problem, we propose the SD3 algorithm with Softmax exploration. This novel approach substantially enhances convergence performance by effectively mitigating the overestimation and underestimation biases common in conventional reinforcement learning frameworks. Comprehensive experimental results demonstrate that the proposed SD3 algorithm outperforms benchmark methods, achieving significant improvements in convergence speed and reducing total system latency. This work provides a robust solution for enhancing task processing efficiency in next-generation MEC systems.
While the SD3-based joint optimization framework delivers remarkable latency reduction in NOMA-MEC networks, it presents notable limitations. Algorithmic hurdles include the dual-actor dual-critic architecture, which elevates training complexity by requiring four parallel network updates per iteration, sensitivity to reward function design, and scalability bottlenecks arising from linearly expanding state/action spaces. Practically, deployment necessitates GPU/NPU-equipped edge servers, lightweight terminal codecs, NOMA modules, and millisecond-level CSI feedback. In future, we will focus on three key issues: developing small-scale prototypes, exploring distributed DRL for enhanced scalability, and long-term integration of semantic compression, cross-layer orchestration, and B5G/6G standards to enable operator edge-cloud deployment.