DRL-Assisted Resource Allocation for NOMA-MEC Offloading with Hybrid SIC

Multi-access edge computing (MEC) and non-orthogonal multiple access (NOMA) are regarded as promising technologies to improve the computation capability and offloading efficiency of mobile devices in the sixth-generation (6G) mobile system. This paper mainly focused on the hybrid NOMA-MEC system, where multiple users were first grouped into pairs, and users in each pair offloaded their tasks simultaneously by NOMA, then a dedicated time duration was scheduled to the more delay-tolerant user for uploading the remaining data by orthogonal multiple access (OMA). For the conventional NOMA uplink transmission, successive interference cancellation (SIC) was applied to decode the superposed signals successively according to the channel state information (CSI) or the quality of service (QoS) requirement. In this work, we integrated the hybrid SIC scheme, which dynamically adapts the SIC decoding order among all NOMA groups. To solve the user grouping problem, a deep reinforcement learning (DRL)-based algorithm was proposed to obtain a close-to-optimal user grouping policy. Moreover, we optimally minimized the offloading energy consumption by obtaining the closed-form solution to the resource allocation problem. Simulation results showed that the proposed algorithm converged fast, and the NOMA-MEC scheme outperformed the existing orthogonal multiple access (OMA) scheme.


I. INTRODUCTION
With fifth-generation (5G) networks being available now, the sixth-generation (6G) wireless network is currently under research, which is expected to provide superior performance to satisfy growing demands of mobile equipment, such as latency sensitive, energy hungry and computationally intensive services and applications [1], [2].For example, the Internet of Things (IoT) networks are being developed rapidly, where massive numbers of nodes are supposed to be connected together, and IoT nodes can not only communicate with each others but also process acquired data [3]- [5].However, such IoT and many other terminal devices are constrained by the battery life and computational capability, and thereby these devices cannot support computationally intensive tasks.A conventional approach to improve the computation capability of mobile devices is mobile cloud Computing (MCC), where computation intensive tasks are offloaded to the central cloud servers for data processing [6], [7].However, MCC will cause significant delays due to the long propagation distances.To address the offloading delay issue, especially for delay sensitive applications in the future 6G networks, multi-access edge computing (MEC) has been emerged as a decentralized structure to provide the computation capability close to the terminal devices, which are generally implemented at the base stations to provide cloud-like task processing service.[7]- [10].
From the communication perspective, non-orthogonal multiple access (NOMA) has been recognized as a promising technology to improve the spectral efficiency and massive connections, which enables multiple users to utilize the same resource block such as time and frequency for transmissions [11], [12].Take power domain NOMA as an example, the signals of multiple users are multiplexed in power domain by the superposition coding, and at the receiver side, successive interference cancellation (SIC) is adopted to remove the multiple access interference successively [13].Hence, integrating NOMA with MEC can potentially improve the service quality of MEC including low transmission latency and massive connections compared to the conventional orthogonal multiple access (OMA).

A. Related Works
The integration of NOMA and MEC has been well studied so far, and researchers have proposed various approaches on optimal resource allocation to minimize the offloading delay and energy consumption.In [14], the author minimized the offloading latency for a multi-user scenario, in which the power allocation and task partition ratio were jointly optimized.The partial offloading policy can determine the amount of data to be offloaded to the server, and the remainder is processed locally.The author of [15] proposed a iterative two-user NOMA scheme to minimize the offloading latency, in which two users offload their tasks simultaneously by NOMA.Since one of the users suffers performance degradation introduced by NOMA, instead of forcing two users to complete offloading at the same time, the remaining data is offloaded in together with the next user during the following time slot.Moreover, many existing works investigate the energy minimization of NOMA-MEC networks.For example, the joint optimization of central processing unit (CPU) frequency, task partition ratio and power allocation for a NOMA-MEC heterogeneous network were considered in [16], [17].In [18], the author considered a multi-antenna NOMA-MEC network, and presented an approach to minimize the weighted sum energy consumption by jointly optimizing the computation and communication resource.
In addition to the existing works on pure NOMA schemes as aforementioned, a few works also combine NOMA and OMA in together, which is denominated as hybrid NOMA [19].In this paper, the author proposed a two-user hybrid NOMA scenario, in which one user is less delay tolerable than the other.The two users offload during the first time slot by NOMA, and the user with longer deadline offloads the remaining data during an additional time duration by OMA.This configuration presents significant benefits, which outperforms both OMA and pure NOMA in terms of energy consumption since the energy can be saved for the delay tolerable user instead of finishing offloading at the same time in pure NOMA networks.In [20], [21], the hybrid NOMA scheme is extended to multi-user scenarios, in which a two-to-one matching algorithm is utilized to pair every two users into a group, and each group offload through a sub-carrier.
For the resource allocation in NOMA-MEC networks, user grouping is a non-convex problem, which is solved by exhaustive search or applying matching theory.Deep reinforcement learning (DRL) is recognized as a novel approach to this problem, which is a powerful tool to solve the real-time decision-making tasks, and only handful papers utilized it for user grouping and sub-channel assignment such as [22], [23] which output the user grouping policy for uplink and downlink NOMA networks respectively.Moreover, in most of the NOMA works, the SIC decoding order is prefixed, which can either be determined by the channel state information (CSI) or the quality of service (QoS) requirements of users [24]- [26].A recent work [27] has proposed a hybrid SIC scheme to switch the SIC decoding order dynamically, which has shown significant performance improvement in uplink NOMA networks.The author of [28] integrated the hybrid SIC scheme with an MEC network to serve two uplink users, and the results reveals that the hybrid SIC outperforms the QoS based decoding order.

B. Motivation and Contributions
Motivated by the existing research on MEC-NOMA, in this paper, we investigate the energy minimization for the uplink transmission in multi-user hybrid NOMA-MEC networks with hybrid SIC.More specifically, a DRL based framework is proposed to generate a user grouping policy, and the power allocation, time allocation and task partition assignment are jointly optimized for each group.The DRL framework collects experience data including CSI, deadlines, energy consumption as labeled data to train the neural networks (NNs).The main contributions of this paper are summarized as follows: • A hybrid NOMA-MEC network is proposed, in which an MEC server is deployed at the base station to serve multiple users.All users are divided into pairs, and each pair is assigned into one sub-channel.The users in each group adopt NOMA transmission with the hybrid SIC scheme in the first time duration, and the user with longer deadline transmits the remaining data by OMA in the following time duration.We propose a DRLassisted user grouping framework with joint power allocation, time scheduling, and task partition assignment to minimize the offloading energy consumption under transmission latency and offloading data amount constraints.
• By assuming that the user grouping policy is given, the energy minimization problem for each group is non-convex due to the multiplications of variables and a 0-1 indicator function, which indicates two cases of decoding orders.The solution to the original problem can be obtained by solving each case separately.
A multilevel programming method is proposed, where the energy minimization problem is decomposed into three sub-problems including power allocation, time scheduling, and task partition assignment.By carefully analyzing the convexity and monotonicity of each sub-problem, the solutions to all three sub-problems are obtained optimally in closed-form.The solution to the energy minimization problem for each case can be determined optimally by adopting the decisions successively from the lower level to the higher level (i.e., from the optimal task partition assignment to the optimal power allocation).Therefore, the solution to the original problem can be obtained by comparing the numerical results of those two cases and selecting the optimal solution with lower energy consumption.
• A DRL framework for user grouping is designed based on a deep Q-learning algorithm.We provide a training algorithm for the NN to learn the experiences based on the channel condition and delay tolerance of each user during a period of slotted time, and the user grouping policy can be learned gradually at the base station by maximizing the negative of the total offloading energy consumption.
• Simulation results are provided to illustrate the convergence speed and the performance of this user grouping policy by comparing with random user grouping policy.Moreover, compared with the OMA-MEC scheme, our proposed NOME-MEC scheme can achieve superior performance with much lower energy consumption.

C. Organizations
The rest of the paper is structured as follows.The system model and the formulated energy minimization problem for our proposed NOMA-MEC scheme are described in Section II.Section III, it presents the optimal solution to the energy minimization problem.Following that, the DRL based user grouping algorithm is introduced in Section IV.Finally, the simulation results of the convergence and average performance for the proposed scheme are shown in Section V, and Section VI concludes this paper.

A. System Model
In this paper, we consider a NOMA-MEC network, where a base station is equipped with an MEC server to serve K resource-constrained users.During one offloading cycle, each user offloads its task to the MEC server and then obtains the results which processed at the MEC server.Generally, the data size of the computation results is relatively smaller than the offloaded data in practical, thus, the time for downloading the results can be omitted [18].Moreover, since the MEC server has much higher computation capability than mobile devices, the data processing time at the MEC server can be ignored compared to the offloading time [14].Therefore, in this work, the total offloading delay is approximated to the time consumption of data uploading to base station.
We assume that all K users are divided into Φ groups to transmit signals at different sub-channels, and each group φ contains two users such that K = 2Φ.In each group, we denote the user with short deadline by U m,φ , and the user with relevantly longer deadline by U n,φ , which indicates τ m,φ ≤ τ n,φ , where τ i,φ is the latency requirement of U i,φ , ∀i ∈ {m, n} in group φ.Because U m,φ has a tighter deadline, it is assumed that the whole duration τ m,φ will be used up, which means that the offloading time t m,φ = τ m,φ .
In this system model, we adopt the block channel model which indicates that the channel condition remains static during each time slot.With the small scale fading, the channel gain of a user in group φ can be expressed as where hi,φ ∼ CN (0, 1) is the Rayleigh fading coefficient, d i,φ is the distance between U i,φ to the base station, and α is the pass loss exponent.The channel gain is normalized by the addictive white Gaussian noise (AWGN) power with zero-mean and σ 2 variance, which can be written as As shown in Fig. 1, since those two users have different delay tolerance, it is natural to consider that the U n,φ is unnecessary to finish offloading within τ m,φ via NOMA transmission, and potentially to save energy if U n,φ can utilize the spare time τ n,φ − τ m,φ .Hence, our proposed hybrid NOMA scheme enables U n,φ to offload part of its data when U m,φ offloading its task during τ m,φ , an additional time duration t r,φ is scheduled within each time slot to transmit U n,φ 's remaining data.The task transmission for U m,φ should be completed within τ n,φ , i.e., As aforementioned, the users in each group will occupy the same sub-channel to upload their data to the base station simultaneously via NOMA.In NOMA uplink transmission, SIC is adopted at the base station to decode the superposed signal.Conventionally, the SIC decoding order is based on either user's CSI or the QoS requirement [27].For the QoS based case, to guarantee U m,φ can offload its data by τ m,φ , U n,φ is set to be decoded first, and the data rate is where B is the bandwidth of each sub-channel.P n,φ and P m,φ are the transmission power of U n,φ and U m,φ during NOMA transmission respectively.Based on the NOMA principle, the signal of U m,φ can then be decoded if ( 4) is satisfied, and the data rate for U m,φ can be written as If U n,φ is decoded first according to the CSI principle, the achievable rate is same as (4) since U n,φ treat the signal of U m,φ as noise power.In contrast, U m,φ can be decoded first if the following condition holds: Then the data rate of U n,φ can be obtained by removing the information of U m,φ , which is If the same power is allocated to U n,φ for both QoS and CSI scheme, it is evident that the achievable rate in (7) is higher than that in (4), and the decoding order in ( 7) is preferred in this case.However, since the constraint (6) cannot be always satisfied, the system has to dynamically change the decoding order accordingly to achieve better performance, which motivated us to utilize the hybrid SIC scheme.
In addition, during t r,φ , U n,φ adopts OMA transmission, and the data rate can be expressed as where P r,φ represents the transmission power of U n,φ during the second time duration t n,φ .
In this work, the data length of each task is denoted by L, which is assumed to be bitwise independent, and we propose a partial offloading scheme in which each task can be processed locally and remotely in parallel.An offloading partition assignment coefficient β φ ∈ [0, 1] is introduced, which indicates how much amount of data is offloaded to the MEC server, and the rest can be executed by the local device in parallel.Thus, for each task, the amount of data for offloading to the server is β φ L and (1 − β φ )L is the data processed locally.
U n,φ can take the advantage of local computing by executing (1−β φ )L data locally during the scheduled NOMA and OMA time duration t m,φ +t r,φ .Therefore, the energy consumption for U n,φ 's local execution, which is denoted by E loc n,φ , can be expressed as where κ 0 denotes the coefficient related to the mobile device's processor and C is the number of CPU cycles required for computing each bit.
The total energy consumed by U n,φ per task involves three parts, including the energy consumed by local computing, and transmission during NOMA and OMA offloading.The power for offloading is scheduled separately during these scheduled two time duration according to the hybrid SIC scheme, and thereby the offloading energy consumption E of f n,φ can be expressed as Hence, the total energy consumption can be expressed as

B. Problem Formulation
We assume that the resource allocation of U m,φ is given as a constant in each group since U m,φ is treated as the primary user whose requirement need to be guaranteed in priority, and we only focus on the energy minimization for U n,φ during both NOMA and OMA duration.Given the user grouping policy which will be solved in Section IV, the energy minimization problem for each pair can be formulated as (P1) : min P n,φ ≥ 0, P r,φ ≥ 0 (12d) where

III. ENERGY MINIMIZATION FOR NOMA-MEC WITH HYBRID SIC SCHEME
In this section, a multilevel programming method is introduced to decompose the problem (P1) into three sub-problems, i.e., power allocation, time slot scheduling and task assignment, which can be solved optimally by obtaining the closed-form solution.The optimal solution to the original problem (P1) can thereby be found by solving those three sub-problems successively, which are provided in the below subsections.

A. Power Allocation
Let t r,φ and β φ be fixed, the problem (P1) is regarded as a power allocation problem which can be rewritten as (P2) : min P n,φ ≥ 0, P r,φ ≥ 0 (13d) Since there exists an indicator function, (P2) is solved in two different cases, i.e., when 1 n,φ = 1 and when The following theorem provides the optimal solution of both cases.
Theorem 1.The optimal power allocation to (P2) is given by the following two cases according to the indicator function: 1) For 1 n,φ = 1, U m,φ is decoded first, and the power allocation for this decoding order is presented as follows: a) When P n,φ = 0 and P r,φ = 0, U n,φ offloads in both time duration, which is termed as hybrid NOMA, and the power allocation is given in the following two cases: ii b) When U n,φ only offloads during the first time duration τ m,φ , this scheme is termed as pure NOMA, and the power allocation is obtained as c) When P * n,φ = 0, U n,φ chooses to offload solely during the section time duration t r,φ , and the optimal power allocation is: 2) For 1 n,φ = 0: 1) When P n,φ = 0 and P r,φ = 0, U n,φ , the hybrid NOMA power allocation is given by 2) When P r,φ = 0, the pure NOMA case can be obtained as 3) When P n,φ = 0, the OMA case is: Proof.Refer to Appendix A.
Remark 1. Theorem 1 provides the optimal power allocation for both two decoding sequences, i.e., U m,φ is decode first when 1 n,φ = 1, and U n,φ is decode first when 1 n,φ = 0.The optimal solution to (P1) is obtained by numerical comparison between these two cases in terms of energy consumption.Both cases can be further divided into three offloading scenarios including hybrid NOMA, pure NOMA and OMA based on different power allocation.For hybrid NOMA case, U n,φ transmits during both τ m,φ and t r,φ , which indicates P n,φ > 0, P r,φ > 0 and t r,φ > 0.
Pure NOMA scheme indicates that U n,φ only transmits simultaneously with U m,φ during τ m,φ , and therefore, P r,φ = 0 and t r,φ = 0.In addition, the OMA case represents that U m,φ occupies τ m,φ solely, and U m,φ only transmit during t r,φ .
Remark 2. Appendix A provides the proof for the case 1 n,φ = 1.The proof for the case 1 n,φ = 0 similarly, and it can be referred to the previous work in [21].Thus, the proof for the case 1 n,φ = 0 is omitted for this and the following two sub-problems.
In this subsection, the optimal power allocation for the hybrid NOMA scheme is obtained when t r,φ is fixed, and then the optimization of t r,φ is further studied to minimize E tot n,φ in the following subsection.

B. Time Schedualing
The aim of this subsection is to find the optimal time allocation for the second time duration t r,φ which is solely utilized by U n,φ for OMA transmission.As aforementioned in Theorem 1, the optimal power allocation for hybrid NOMA scheme is given as a function of t r,φ and β φ .Hence, by fixing β φ , (P1) is rewritten as Proposition 1.The offloading energy consumption (21a) is monotonically decreasing with respected to t r,φ for both 1 n,φ = 1 and 1 n,φ = 0 cases.To minimize the energy consumption, the optimal time allocation is to schedule the entire available time before the deadline τ n,φ , i.e., Proof.Refer to Appendix B.
By assuming all the data is offloaded to the MEC server, the following lemma studies the uplink transmission energy efficiency of the two hybrid NOMA-MEC schemes for 1 n,φ = 0 and 1 n,φ = 1.
Lemma 1. Assume all data are offloaded to the MEC server, i.e., β φ = 1, the solution in (18) for the case 1 n,φ = 0 has higher energy consumption than the solution in (14) for the case Proof.Without considering local computing, the energy consumption for ( 14) can be written as and the energy consumption for the case ( 18) is given as Bτ n,φ To proof that E 2 ≥ E 1 , the inequality can be rearranged as Therefore, ζ (x) is monotonically decreasing since τ m,φ < τ n,φ , and the following inequality holds:

C. Offloading Task Assignment
In this subsection, we focus on the optimization of the task assignment coefficient for U n.φ in each group φ.
Given the optimal power allocation and time arrangement, (P1) is reformulated as (P4) : min Proposition 2. The above problem is convex, and the optimal task assignment coefficient can be characterized by those three optimal power allocation schemes for the hybrid NOMA model in ( 14), (15), and (18), which is given by where W denotes the single-valued Lambert W function, and z 1,φ and z 2,φ are determined by the different power allocation schemes, which are presented as follows: (a) 1 n,φ = 1: If ( 14) is adopted: If ( 15) is adopted: where Bτ n,φ τ n,φ 2 (Pm,φ|hm,φ| 2 +1) Proof.Refer to Appendix C Remark 3. Problem (P4) is the lowest level of the proposed multilevel programming method, which provides three task assignment solutions corresponding to the three power allocation schemes ( 14), (15), and ( 18) respectively.The final solution to the energy minimization problem (P1) can be obtained by substituting the optimal task assignment into the corresponded power allocation schemes.Then the most energy efficient scheme is selected among ( 14), (15), and (18) by comparing the numerical energy consumption for each scheme.

IV. DEEP REINFORCEMENT LEARNING FRAMEWORK FOR USER GROUPING
In the previous section, it is assumed that the user grouping is given, and the optimal resource allocation is obtained in closed-form.The optimal user grouping can be obtained by exploring all possible user grouping combinations and find the one with the lowest energy consumption.Although this method can obtain the optimal user pairing scheme, the complexity of the exhaustive search method is high, and it is not possible to output real time decisions.
Therefore, we propose a fast converge user pairing training algorithm based on DQN to obtain the user grouping policy, which is introduced in the following subsection, in which the state space, action space and reward function are defined.Subsequently, the training algorithm for the user grouping policy is provided.

A. The DRL Framework
The optimization of user grouping is modeled as a DRL task, where the base station is treated as the agent to interact with the environment which is defined as the MEC network.In each time slot t, the agent takes an action a t from the action space A to assign users into pairs according to an optimal policy which is learned by the DNN.
The action taken under current state s t results an immediate reward r t , which is obtained at the beginning of the next time slot, and then move to the next state s t+1 .In this problem, the aforementioned terms are defined as follows.
1) State Space: The state s t ∈ S is characterized by the current channel gains and offloading deadlines of all users since the user grouping is mainly determined by those two factors.Therefore, the state s t can be expressed as 2) Action Space: At each time slot t, the agent takes a action a t ∈ A, which contains all the possible user grouping decisions j k,φ .The action is defined as where j k,φ = 1 indicates that U k is assigned to group φ.In our proposed scheme, each group can only be assigned with two different users.
3) Rewards: The immediate reward r t is described by the sum of the energy consumption of each groups after choosing the action a t under state s t .The numerical result of the energy consumption in each group can be obtained by solving the problem (P1).Therefore, the reward is defined as The aim of the agent is to find an optimal policy that maximizes the long-term discounted reward, which can be written as where γ ∈ [0, 1] is the discount factor which balance the immediate reward and the long-term reward.

B. DQN-based NOMA User Grouping Algorithm
To accommodate the reward maximization problem, a DQN-based user grouping algorithm is proposed in this paper, illustrated in Fig. 2. In the conventional Q-learning, Q-table is obtained to describe the quality of an action for a given state, and the agent chooses actions according to the Q-values to maximize the reward.However, it will be slow for the system to obtain Q-values for all the state-action pairs if the state space and action space are large.Therefore, to speed up the learning process, instead of generating and processing all possible Q-values,  DNNs are introduced to estimate the Q-values based on the weight of DNNs.We utilize a DNN to estimate the Q-value denoted by Q-network, which the Q-estimation is represented as Q(s t , a t ; θ), and an additional DNN with the same setting to generate the target network with Q(s t , a t ; θ − ) for training, where θ and θ − are the weights of the DNNs.
We adopt -greedy policy with 0 < < 1 to balance the exploration of new actions and the exploitation of known actions by either randomly choosing an action a t ∈ A with probability to avoid the agent sticking on non-optimal actions or picking the best action with the probability 1 − such that [29]: Generally, the threshold is fixed, which indicates the probability of choosing random action remains the same throughout the whole learning period.However, it brings fluctuation when the algorithm converges and may lead to diverge again in extreme cases.In this paper, we adopt an -greedy decay scheme, which a large + (more greedy) is given at the beginning, and then the it decays with each training step until a certain small probability − .The above policy encourages the agent to explore the never-selected actions at the beginning, and then the agent intends to take more large reward-guaranteed actions when the network is already converged.
The target network only updates every certain iterations, which provides a relatively stable label for the estimation network.The agent stores the tuples (s t , a t , r t , s t+1 ) as experiences to a memory buffer R, and a mini-batch of samples from the memory are fed into the target network to generate the Q-values labels, which is given by Algorithm 1 DQN-based User Grouping Algorithm 1: Parameter initialization: 2: Initialize Q-network Q(s i , a i ; θ) and target network Q(s i , a i ; θ − ).
3: Initialize Reply memory R with size |R|, and memory counter.for time step = 1, 2, ..., N ts do 8: Input state s t into Q-network and obtain Q-values for all actions.9: Take the user grouping decision as action a t based on the -greedy decay policy. 10: Agent receive the reward r t based on (35) and the observation to next state s t+1 . 11: Store the experience tuple (s t , a t , r t , s t+1 ) into the memory R. 12: if memory counter > |R| then 13: Remove the old experiences from the beginning. 14: Randomly sample a mini-batch of the experience tuples (s t , a t , r t , s t+1 ) with batch size and feed into the DNNs. 16: Update the Q-network weights θ by calculating the Loss function ( Replace θ − by θ after every δ up steps.

18:
end for 19: end for Hence, the loss function for the Q-network can be expressed as The Q-network can be trained by minimizing the loss function to obtain the new θ, and the weights of the target network is updated after δ up steps by replacing θ − with θ.The whole DQN-based user grouping framework is summarized in Algorithm 1.

V. SIMULATION RESULTS
In this section, several simulation results are presented to evaluate the convergence and effectiveness of the proposed joint resource allocation and user grouping scheme.Specifically, the impact of learning rate, user number, offloading data length, and delay tolerance are investigated.Moreover, the proposed hybrid SIC scheme is compared to some benchmarks including QoS based SIC scheme and other NOMA and OMA schemes.The system parameters are set up as follows.All users are distributed uniformly and randomly in a disc-shape cell where the base station located in the cell center.The total number of users is six, and each of them has a task contains 2 Mbit of data for offloading.As aforementioned, the delay sensitive primary user U m,φ is allocated with a predefined power which is P m,φ = 1 W for all groups in the simulation.The delay tolerance for each user is given randomly between [0.2, 0.3] seconds.In addition, the rest of the system parameters are listed in Table I.
To implement the DQN algorithm, the two DNNs are configured with the same settings, where each of them consists of four fully connected layers, and two of which are hidden layers with 200 and 100 neurons respectively.
The activation function we adopted for all hidden layers is Rectified Linear Unit (ReLU), i.e., f (x) = max(0, x), and the final output layer is activated by Tanh of which the range is (−1, 1) [30].The Adaptive moment estimation optimizer (Adam) method is used to learn the DNN weight θ with given learning rate [31].The rest of the hyperparameters are listed in Table II.All simulation results are obtained with PyTorch 1.70 and CUDA 11.1 on Python 3.8 platform.

A. Convergence of Framework
In this part, we evaluate the convergence of the proposed DQN based user pairing algorithm.Fig. 3 compares the convergence rate of the average reward for each episode under different learning rate, which is described by the average energy consumption.Learning rate controls how much it should be to adjust the weights of a DNN based on the network loss, and we set the learning rate = [0.1,0.01, 0.001] to observe its influence to the convergence.
The network with 0.1 learning rate converges slightly faster than the one with 0.01 learning rate, and both of them converge much faster than the network with 0.001 learning rate.However, when the learning rate is 0.1, even though the large learning has better convergence, it overshoots the minimum and therefore has higher energy consumption after converge than other two plots.Therefore, the most suitable learning rate for our proposed DQN algorithm is 0.01, which is adopted to obtain the rest of simulation results in this paper.hybrid SIC schemes has lower energy consumption than the OMA scheme.In Fig. 7, the energy consumption is presented as a function of the offloading data length.As the data length increases, the average energy consumption also grows.Our proposed hybrid-SIC scheme reduces the energy consumption significantly especially when the data length is large.Moreover, Fig. 8 reveals the energy consumption comparisons versus the maximum delay tolerance for U n,φ .With tight deadlines, the energy consumption of the hybrid-SIC scheme is much lower than OMA scheme, and more portion of data is processed locally to save energy compared to the fully offloading curve.

VI. CONCLUSION
This paper studied the resource allocation problem for a NOMA-assisted MEC network to minimize the energy consumption of users' offloading activities.The hybrid NOMA scheme has two duration during each time slot, in which NOMA is adopted to serve the both users simultaneously during the first time duration, and a dedicate time slot is scheduled to offload the remaining part of the delay tolerable user solely by OMA.Upon fixing the user grouping, the non-convex problem was decomposed into three sub-problems including power allocation, time allocation and task assignment, which were all solved optimally by studying the convexity and monotonicity.
The hybrid SIC scheme selects the SIC decoding order dynamically by a numerical comparison of the energy consumption between different decoding sequences.Finally, after solving those sub-problems, we proposed a DQN based user grouping algorithm to obtain the user grouping policy and minimize the long-term average offloading energy consumption.By comparing with various benchmarks, the simulation results proved the superiority of the proposed NOMA-MEC scheme in terms of energy consumption.

A. Proof of Theorem 1
By fixing t r,φ and β φ , the above problem in the case 1 n,φ = 1 can be rewritten as: It is evident that the problem is convex, and by rearranging (40d) as the Lagrangian function can be obtained as follows: where λ [λ 1 , λ The Karush-Kuhn-Tucker (KKT) conditions [32] can be obtained as The total energy consumption can be expressed as: x 3 ≥ 0, ∀x > 0.
Therefore, dEH1 dt r,φ ≤ 0, which is monotonically decreasing.Hence, the larger t r,φ is scheduled, the less energy is consumed, and the optimal situation is when t * r,φ = τ n,φ − τ m,φ .For the power allocation scheme in (15), the energy consumption is given as (55) Thus, g 2 (x) is monotonically increasing for x > 0, and g(t r,φ ) ≤ g(∞) = 0, which indicates dEH2 dt r,φ ≤ 0. Similar to the previous case, the energy function is monotonically decreasing with respected to t r,φ , and the optimal time allocation is t * r,φ = τ n,φ − τ m,φ .
φ is decoded first and vice verse.Constraint (12b) and (12c) ensure all the users should complete offloading the designated amount of data within the given deadline.The constraint (12e) limits the additionally scheduled time slot should not beyond U n,φ 's delay tolerance.Constraints (12d) (12f) set the feasible range of the transmission power and offloading coefficient.The problem (P1) is non-convex due to the multiplication of several variables.Therefore, in the following section, we propose a multilevel programming algorithm to address the energy minimization problem optimally by obtaining the closed-form solution.

Fig. 2 :
Fig. 2: A demonstration of the proposed DQN-based user grouping scheme in the NOMA-MEC network.

Fig. 4 Fig. 6 :
Fig.4illustrates the effectiveness of the DQN user grouping algorithm proposed in this paper.By setting the numbers of users to[6,8,10], the algorithm shows a similar performance that the average energy consumption decreases over training and converges within the first 20 episodes for the all three cases.Moreover, more users in the network can result in higher energy consumption, and the algorithm shows the superior performance over the

Fig. 7 :
Fig. 7: Average energy consumption versus training episodes with different numbers of users.

Fig. 8 :
Fig. 8: Average energy consumption versus training episodes with different numbers of users.