Two Tier Slicing Resource Allocation Algorithm Based on Deep Reinforcement Learning and Joint Bidding in Wireless Access Networks

Network slicing (NS) is an emerging technology in recent years, which enables network operators to slice network resources (e.g., bandwidth, power, spectrum, etc.) in different types of slices, so that it can adapt to different application scenarios of 5 g network: enhanced mobile broadband (eMBB), massive machine-type communications (mMTC) and ultra-reliable and low-latency communications (URLLC). In order to allocate these sliced network resources more effectively to users with different needs, it is important that manage the allocation of network resources. Actually, in the practical network resource allocation problem, the resources of the base station (BS) are limited and the demand of each user for mobile services is different. To better deal with the resource allocation problem, more effective methods and algorithms have emerged in recent years, such as the bidding method, deep learning (DL) algorithm, ant colony algorithm (AG), and wolf colony algorithm (WPA). This paper proposes a two tier slicing resource allocation algorithm based on Deep Reinforcement Learning (DRL) and joint bidding in wireless access networks. The wireless virtual technology divides mobile operators into infrastructure providers (InPs) and mobile virtual network operators (MVNOs). This paper considers a single base station, multi-user shared aggregated bandwidth radio access network scenario and joins the MVNOs to fully utilize base station resources, and divides the resource allocation process into two tiers. The algorithm proposed in this paper takes into account both the utilization of base station (BS) resources and the service demand of mobile users (MUs). In the upper tier, each MVNO is treated as an agent and uses a combination of bidding and Deep Q network (DQN) allows the MVNO to get more resources from the base station. In the lower tier allocation process, each MVNO distributes the received resources to the users who are connected to it, which also uses the Dueling DQN method for iterative learning to find the optimal solution to the problem. The results show that in the upper tier, the total system utility function and revenue obtained by the proposed algorithm are about 5.4% higher than double DQN and about 2.6% higher than Dueling DQN; In the lower tier, the user service quality obtained by using the proposed algorithm is more stable, the system utility function and Se are about 0.5–2.7% higher than DQN and Double DQN, but the convergence is faster.


Introduction
With the advent of the 5G era, the demand and application of mobile traffic and wireless networks have increased dramatically. This huge demand has driven the convergence of multiple traditional and emerged communications technologies to form the 5G mobile communications system. 5G mobile communication systems employ new technologies and new network architectures that enable them to go beyond traditional communications as a resource game issue, which needs to consider the dynamic competing behaviors of users to maximize the overall satisfaction of users [22].
At present, a lot of research work has been done based on these two methods. Ref. [23] used an allocation strategy of orthogonal and multiplexed subchannels to ensure the isolation of inter-slice and solved the problem of minimizing system power in the bidirectional transmission link. Ref. [24] proposed a new auction-based shared resource and revenue optimization model. Ref. [25] proposed a stochastic game model to solve the dynamic resource allocation problem of multi-user virtual enterprise networks and proposed a blind approximate based great likelihood estimation algorithm to solve the model, thus overcoming the cost of information exchange and computation, but the model does not consider user-specific demands. Ref. [26] mathematically analyzes the joint optimization of access control and bandwidth allocation for multiple BS and multiple NS scenarios. However, the solution is based on the assumption that different users have the same fixed demand rate, which is unlikely to be found in practice. Ref. [27] proposed an LSTM-based prediction scheme, and use a power allocation algorithm based on DRL to solve this problem. But in practical scenarios, different types of user demands need to be considered when solving the network resource allocation problem. Ref. [28] proposed an optimization framework based on a resource pricing strategy to maximize resource efficiency and customer profit by studying the relationship between profit maximization and resource efficiency. Ref. [29] proposed an AC priority algorithm to meet the high demand and high priority slice to improve the overall resource demand satisfaction rate, ref. [30] used game theory to analyze the relationship between InPs and users to optimize the allocation problem and solve the communication problem during peak hours. Ref. [31] used communication games and learning mechanisms to solve the distributed problem of wireless NS resources, but without considering the deployment of users with different types of demand. Ref. [32] proposed an online resource management for inter-slice genetic slicing policy optimizer, but it ignores the relationship between the required resources on the different types of slices and the SLA. Ref. [33] proposed a novel channel information absent Q-learning (CIAQ) algorithm to speed up the training, but this algorithm is only an auxiliary method for solving the resources allocation problem. Reference [34] considered the problem of allocating different types of resources (bandwidth, cache, backhaul capacity) to network service tenants based on user demand and proposed a mathematical solution, but when the simulation parameters are increased proportionally, the optimization problem will become difficult to solve. Ref. [35] uses a DRL method to control the energy of the UAV scene. Ref. [36] proposed a DNAF-based DQL merging method that improves the convergence speed of the algorithm. Ref. [37] proposed an HA-DRL algorithm, that uses heuristic functions to optimize the exploration of action space.
The bidding methods and the DRL methods have been proposed in the above literature to solve the resource allocation problems of BS to different users. But some methods do not consider that users are with different specific needs, and the resource allocation to users has the problem of poor service quality or waste of resources for a proportion of users, moreover, some solutions simply consider the service satisfaction rate of users and ignore the total bandwidth of BS. Some solutions simply consider the user's service satisfaction rate and ignore the total bandwidth utilization rate of the BS, which also results in the waste of wireless network resources. To solve the challenges and problems mentioned above, this paper proposes a two-tier resource allocation model considering both the BS resource utilization and user service satisfaction rate. In fact, this paper decomposes a single objective optimization problem into two-level sub-objective optimization problems, and creatively uses DRL to solve the two-level resource allocation optimization problem considering the inconsistency between the upper and lower value spaces. The upper tier model is for MVNOs to request resources from the BS by bidding, and this paper uses a combination of bidding and Dueling DQN to solve the optimization problem of this upper tier model. Likewise, the lower tier model is for MVNOs to allocate the resources which are received from the BS to its contained users and set the service satisfaction rate of the users, the same as the upper tier, the lower tier model is optimized using Dueling DQN. The main contributions of this paper are as follows.
(1) First, a two tier resource allocation problem in wireless NS is proposed. The upper tier MVNOs will submit bid prices to the InP for wireless resources. The InP will further allocate physical resources to the MVNOs based on the bid values of the MVNOs. Each MVNO will then use the wireless resources allocated by the network to serve its mobile subscribers.
(2) Second, the algorithm based on Dueling DQN and joint bidding is used to solve the upper tier resource allocation optimization problem. In this paper, the utility of each MVNO is obtained by calculating the downlink transmission rate of the user after obtaining the bandwidth, and the utility function of the whole system is denoted as the weighted sum of the upper tier benefits and the lower layer utility function. This ensures that the BS resources are allocated to the maximum extent possible to meet the service demand of the users more efficiently.
(3) Third, this paper shows the process of mathematical analysis of the proposed two-tier model and algorithm with its corresponding parameters for problem solving, and shows how bidding can be used in conjunction with Dueling DQN with the corresponding parameters. The penalty function is proposed to prevent the MVNO from overbidding, and the evaluation function to represent the revenue of the MVNO. This paper considers a radio access network scenario with multiple users sharing aggregated bandwidth under a single BS, where users are randomly located within the range of the BS and have different service demands, the BS does not have direct access to the channel information and service demand information of the users, and each MVNO manages the users in a sub-region. In future research work, it can take into consideration changes in user location and changes in service demand, in order to get closer to the actual communication scenario.
The rest of the paper consists of the following: Section 2 presents the two-tier model proposed in this paper with its mathematical analysis process. Section 3 presents the solution algorithm and the relevant mathematical background, and details the process of corresponding parameters when using the Dueling DQN and DQN algorithm in the twotier model. The simulation process and results are given in Section 4, and a comparative analysis is performed. Section 5 concludes the paper and gives an expectation.

System Model and Problem Formulation
In this part, we consider a downlink scenario with a single BS, as shown in Figure 1. This single BS is divided into a physical BS and a set of MVNOs, M = {M 1 , M 2 , . . . , M m }, each MNVO has j users U m = u m j , u m j , . . . , u m j connected, and each MVNOS provides specific mobile services to its connected users. This BS has resources (shared aggregated bandwidth) C. Each MVNO is required to bid resources to the BS according to the demands of the connected users and allocate the resources received from the BS to its connected users. In this paper, the SLA satisfaction rate (SSR) is used to represent the quality of experience (QoE) of the users. The core problem of this paper is how to schedule among the MVNOs and satisfy the demands of the connected users and maximize the total profit of MVNO. Moreover, the resources of this BS are virtualized and sliced to meet the demands of the users. The resource allocation problem after NS is divided into two tiers.

Upper Tier Model
In the upper tier model, based on the number and the QoS requirements of users it connects, each MVNO has to decide the required wireless bandwidth and estimate a bid value to submit to InP. The InP will allocate a proportion of its resources (bandwidth) to each MVNO based on the MVNO's bid value, which means that the InP will allocate the largest part of bandwidth to the MVNO which submits the highest bid [6]. The resources allocated by the BS to the mth MVNO are denoted as c m , and the resources allocated by the mth MVNO to the users are denoted as c m j , and each MVNO will count the minimum rate demand v m j,0 , and the maximum rate demand v m j,1 of its linked users and estimate from these demands.

Upper Tier Model
In the upper tier model, based on the number and the QoS requirements of users it connects, each MVNO has to decide the required wireless bandwidth and estimate a bid value to submit to InP. The InP will allocate a proportion of its resources (bandwidth) to each MVNO based on the MVNO's bid value, which means that the InP will allocate the largest part of bandwidth to the MVNO which submits the highest bid [6]. The resources allocated by the BS to the mth MVNO are denoted as , and the resources allocated by the mth MVNO to the users are denoted as , and each MVNO will count the minimum rate demand ,0 , and the maximum rate demand ,1 of its linked users and estimate from these demands.
Each MVNO gets the minimum rate demand ,0 , the maximum rate demand ,1 and the bid value of each MVNO.
The BS are allocated resources ( ) to MVNOs in proportion to their bids. To prevent MVNOs from excessively increasing their bids, an evaluation function ( ( )) is established, and as a penalty function which will reduce the profit of MVNOs if they excessively increase their bids, and is represented by function (5).  (1) The BS are allocated resources c m (b) to MVNOs in proportion to their bids. To prevent MVNOs from excessively increasing their bids, an evaluation function y m (c m (b)) is established, and q m as a penalty function which will reduce the profit of MVNOs if they excessively increase their bids, and α is represented by function (5).
The optimization problem of the upper model is to maximize the weighted sum of the benefits and utility of all MVNOs, i.e., maxF = ∑ m∈M f m + ω * ∑ m∈M y m (c m (b)) Constraint: constraint (6) ensures the segregation of the resources allocated between different MVNOs. Since the bandwidth of the BS is limited, constraint (7) ensures that the bandwidth allocated to all MVNOs does not exceed the total bandwidth of the BS, and constraint (8) means that the sum of the bandwidth allocated by each MVNO to its connected users cannot be greater than the bandwidth allocated to itself from the BS. The problem of each MVNO getting resources by bidding can also be solved by DQN, the exact process of which will be mentioned later.

Lower Tier Model
The MVNO is allocated by the resources received from the InP by the upper tier to the connected users, and the main task in the lower tier model is to find a suitable bandwidth allocation scheme to maximize the utility function of each MVNO, labeled f m , and the utility function f m can be expressed as a weighted sum of SE m and SSR u m j . The computation of SE m and SSR u m j is described in the following section. From Shannon's formula, it can be calculated that v m j , v m j denotes the downlink rate from the BS to the jth user u m j which is linked to mth MVNO.
u m j denotes the jth user of the mth MVNO, and SNR u m j is the signal-to-noise ratio with the BS u m j .
denotes the fading gain of the channel between the BS and u m j , P denotes the transmitted power, and N 0 denotes the one-sided noise spectral density.
SSR u m j denotes the SSR of the jth user connected by the mth MVNO In this paper, the SSR is expressed as the ratio of the number of valid packets successfully accepted by the user to the total number of packets sent by the MVNO. q m j denotes the packet successfully accepted by the user u m j , and binary α q m j denotes whether the accepted is the downlink transmission rate that is preset in advance for the user u m j according to the SLA.
The optimization objective of the lower-tier model is to maximize the total utility function f m for each MVNO, and f m can be expressed as a weighted sum of SE and SSR. ρ and ϕ = {ϕ 1 , ϕ 2 , . . . , ϕ s } denotes the important weights of SE and SSR, respectively.
Notably, this optimization process can be analyzed as a Markov decision process, but trying to solve (15) is difficult, and using traditional assignment or using the Q-learning al-gorithm does not yield a better solution quickly. Fortunately, DRL is useful for solving such problem, the process of mapping to the Dueling DQN algorithm will be mentioned later.

Deep Reinforcement Learning
DQN is a typical DRL algorithm, it is advantageous for solving high computational problems and decision problems. In DRL, there will be an agent to control the learning process. The intelligent agent attempts to generate a lot of new data through constant trial-and-error interaction with the environment, and then learns a set of policies based on this data that enables it to maximize the cumulative expected reward while finding the best action for a given state. We can model the agent's interaction with the environment as a Markov decision process (S, A, R, P, γ).
The parameters are explained as follows: S is the state space containing the current state s and the new state s ; A is the action space containing the current action a and the new action a ; the policy π(·|s) determines how state s is mapped to the action; R is the reward function obtained by performing the action a under the state s according to the policy π(·|s) ; P(·|s, a) is the transfer probability and γ is a discount factor.
Additionally, the state value function V π (s) can be obtained according to π(·|s) under the state s.
Similarly, the action value function Q π (s, a) obtained by executing the action a under the state s according to the policy π(·|s) .
The process of interaction between the intelligent body and the environment is as follows: the agent gets an observation as a state s from the environment and inputs s to the neural network to get all Q π (s, a), then uses the -greedy strategy selects an action and makes a decision from Q π (s, a), and the environment will give a reward and the next observation based on this action. Finally, the agent is updated according to the reward given by the environment using Equation (17).
DQN is based on DL with the addition of neural networks with parameters θ for parameter updating and action selection. The Q-value function network is updated in real time and the target Q-value function network is updated every certain number of iterations. Q(s, a; θ) denotes the value function with parameters θ, the optimal parameters θ will be obtained by minimizing the TD error squared according to Equation (18) to let Q(s, a; θ) = Q * (s, a).
The target Q-value of the network of target Q-value functions is Also, the loss function defined in L(θ) DQN is While Dueling DQN improves on the network structure of DQN, Dueling DQN divides the Q value into two parts, one for the state value function, and one for the advantage function, denoted as: V π (s) is unconcerned with action a, and only one status value is returned, while A π (s, a) is related to action and state, Q Dueling π (s, a) can be expressed in more detail as: The parameters θ in the formula are shared by the two function networks, and α and β are their exclusive parameters. In order to increase the identification of the two functions, the dominant function is generally centralized, that is:

Two Tier Slicing Resource Allocation Algorithm Based on Dueling DQN and Joint Bidding
In actual communication, due to various factors, the channel information and service demands of users are private. In order to better meet the user demand and to maximize the utilization rate of physical resources in the BS, MVNO is added between the BS and users. The MVNOs collect the users' demand information and channel status, then bid and obtain resources from the BS, finally allocate resources to users connected to it. This paper mapped the above problem to a Markov decision process, uses the framework of bidding for the upper tier model in the allocation process, and uses the DRL for both two tiers to solve the optimization problem, get the optimal solution by iterative training.
Algorithm 1 uses the DQN joint with the framework of bidding to solve the optimization problem for the upper tier model. After initializing the bidding pool B, the parameters in the neural net within the DQN (such as (Q, θ, α, β,Q, and N). In the simulation, each MVNO obtains the bidding range to establish a bidding pool B, the total maximum and minimum demand resources of the users of the MVNO are first estimated, which is represented by the maximum and minimum value of the sum of the expected rates (set by SLA) of all users connected to it. It is used to indicate the maximum rate requirement of each MVNO if the service requirement of each user is the service type with the maximum rate. After converting the rate requirement to the maximum and minimum bid value according to a specific ratio, the bid pool B can be established. The upper tier uses the bid pool B as the action space, and the maximum lower tier action corresponding to each upper tier action is found in the lower tier and stored in table A.

2:
Initialize the action-value function Q, target action-value functionQ the replay memory D to capacity N

3:
Each MVNO m ∈ M estimates the maximum total needed rate and minimum total needed rate of linked users, then create the Bidding pool B; 4: For b m in B do 5: Find the lower tier optimal allocation action and store it in table A; 6: end for

7:
Random choose an action a t i.e., bidding value b m B and BS distributes c m to each MVNO according to (2); 8: Repeat 9: For t = 1, to T, do

10:
Calculate the ratio of the allocated bandwidth to its required minimum rate, and take it as the current state S = s of the last iteration; 11: For m = 1 to M, do 12: Each MVNO m allocates optimal bandwidth c m j to its users according to table A; 13: Each MVNO m calculates the v m by (9) and (10); 14: Each MVNO m calculates the penalty q m by (4); 15: Each MVNO m and calculates the profit y m by (3) and get the reward r m ; Algorithm 1 Cont.

16:
End for 17: Calculate the total system utility F according to (5); 18: Calculate the total reward r; 19: Choose an action a t i.e., bidding value b m B according to the policy of DQN; 20: InP distributes c m to each MVNO according to (2); 21: Get the state S = s' after the selection action of this iteration; 22: #Train DQN 23: The agent i.e., each MVNO inputs (s, a, s , r) into the DQN; 24: The agent stores transition (s, a, s , r) in D;

26:
Set if episode terminates at step _ + 1 otherwise

27:
The agent perform a gradient descent step on (y_ − Q(s_, a_; θ)) 2 with respect to the network parameters θ; 28: Every steps resetQ = Q; 29: End for 30: Until The predefined maximum number of iterations has been completed.
Before starting the iteration, an upper tier action needs to be randomly selected to generate the initial state. The components of the iteration process include: getting the current state s, selecting the action a according to the policy π(·|s) in the current state s and generating the state s _, calculating the utility function F, and calculating the reward r. At the beginning of each iteration, the current state s is available. In combination with the DQN algorithm, the actions in each iteration are selected according to the DQN policy, the -greedy policy, randomly selected an action or selected a better action according to a t = argmax a Q(ϕ(s t ), a; θ). The action a = a t of each iteration contains the bids of each MVNO in this iteration a = {b 1 , b 2 , . . . , b m }. The InP receives the bids b m from MVNOs and divides the bandwidth resources proportionally to each MVNO bandwidth c = {c 1 , c 2 , . . . , c m } according to Equation (2). Each MVNO will allocate bandwidth c m to each user and count the rate v m sum of each user, each MVNO can get the ratio of the allocated bandwidth to its required minimum rate, and take it as the next state s _. The MVNO also constructs an action space when allocating bandwidth to users, and the optimal lower-tier action a l corresponding to each upper-tier action can be found based on table A. Then, the MVNO derives a discount function from Equation (4) and calculates the profit value y m in this iteration from Equation (3) based on the sum of v m and q m . When all MVNOs in this iteration have performed the above actions, the total utility function F and the total reward r of the system in this iteration is counted.
Finally, the s, a, s _ and r generated by this iteration are input into the DQN and trained. In DQN, the agent stores the transition (s, a, s , r) of each iteration into the experience pool D, then takes a small random transition ( s_, a_, s _, r _) from the experience pool D for training the parameters of the Q-value net, finally updates the parameters of the target Q-value net by the loss function L(θ).
Algorithm 2 uses the Dueling DQN algorithm to solve the optimization problem of the lower-level model. As in Algorithm 1, the parameters (Q, θ,Q, and N) in the Dueling DQN neural network are first initialized and each MVNO creates its lower tier action space A l after receiving the resources c m allocated from the BS. Before each iteration, each MVNO will randomly select an action a ∈ A l from its lower action space and execute it. The action a first divides its resources into resource blocks for three services, then allocates resources c m j to users which are connected to it, then count the number of packets successfully received q m j by the user and denote it as state s. Then start the iteration, the agent i.e., MVNO will get the current state s, and choose an action a according to the policy of the Dueling DQN policy, the -greedy policy, randomly selects an action or selected a better action according to a t = argmax a Q(ϕ(s t ), a; θ, α, β), after the allocation process, MVNO counts the state s , utility function f m and reward r, finally, input the (s, a, s , r) into the Dueling DQN and train the neural network until the predefined maximum number of iterations has been completed. MVNO randomly chooses an action a ∈ A l and performs a; 6: MVNO allocates the bandwidth c m j to users which are connected with it; 7: Calculate the q m j as state s; 8: For t = 1, to T, do 9: The agent gets the current state s; 10: Choose an action a ∈ A l according to the policy of Dueling DQN; 11: Calculate the total system utility f m according to (15); 12: Calculate the total reward; 13: The agent allocates the bandwidth to users and calculates the state after the selection action of this iteration as s ; 14: #Train Dueling DQN 15: The agent i.e., each MVNO inputs (s, a, s , r) into the Dueling DQN; 16: The agent store transition (s, a, s , r) in D;

Simulation Results and Discuss
Compared with the latest published literature in recent years, as Table 1, this paper considers the sliced bandwidth resources as a two tier resource allocation process, and ensures the service quality of users' multiple service requirements. Through the simulation, we get good results by using the DRL joint bidding.  [3] no yes yes no [12] yes no no no [38] no yes yes BER [39] no yes yes no

Simulation Parameters
In the scenario considered in this paper, the maximum aggregated bandwidth provided by a single BS is 10 MHz, and the minimum specification of the bandwidth resource block is set to r block = 0.2 MHz, three types of services (i.e., VoLTE, eMBB, and URLLC) and four MVNOs are provided to the subscribers, and 100 registered subscribers are randomly present within an approximate circle of 40 m radius around the BS. The transmission power of the users is 20 dBm, and the transmit power of the BS is 46 dBm. The noise spectral density of the channel is −174 dBm/Hz under the given channel model. The minimum rate constraint for VoLTE service is 51 kbs, the minimum rate constraint for eMBB service is 0.1 Gb/s, and the minimum rate constraint for URLLC service is 0.01 Gb/s. The detailed simulation parameters are shown in the following Table 2. The simulation sets up 100 users randomly distributed in a single BS coverage area, and the users have three different service demand types (i.e., VoLTE, eMBB, and URLLC), and the service demand of each user is also random. An MVNO is set up to pre-allocate the BS resources between the BS and the users, and the users are connected to different MVNOs according to their locations. To demonstrate the feasibility and advantages of the proposed resource allocation algorithm, the following work is carried out in this paper.
Firstly, the proposed model based on bidding and a two tier Dueling DQN algorithm is simulated through the python platform and simulated with a Double DQN algorithm, DQN algorithm. and Q-Learning algorithm. After getting the data of the four algorithms plotted graphs and comparing, it is concluded that the algorithm proposed in this paper is feasible and has some advantages over the other three algorithms in this paper. The following is the curve and comparative analysis after plotting some data obtained from this simulation.
In the process of simulation for the training network parameters set the reward is calculated as: The In particular, in the upper model, we evaluated the method of joint bidding of Doble DQN and Dueling DQN, and compared it with the results of traditional DQN, Double DQN, Dueling DQN, and Q-learning. In the experiment, the learning rates of various algorithms are set to 0.01. And the importance weight of the optimization objective obtained by formula (6) and formula (15) is set to ρ = 0.01, ϕ = [1, 1,1], ω = 0.1. The learning rate of the Dueling DQN network is set to 0.01, and the choice of Gama value was experimentally set to 0.95.
In the whole simulation process, 100 user locations are randomly distributed, with the BS location as the origin, and 4 MVNOs manage four areas, respectively, and collect their service demands. In this paper, as Table 3, the service types of the users connected by MVNO-0 include 11 eMBB services, 9 VoLTE services, and 7 URLLC services; the service types of the users connected by MVNO-1 include 11 eMBB services, 8 VoLTE services, and 7 URLLC services; the service types of the users connected by MVNO-2 include 8 eMBB services, 6 VoLTE services, and 13 URLLC services. 6 VoLTE services and 13 URLLC services; MVNO-3 connected users' service types include 2 eMBB services, 8 VoLTE services, and 7 URLLC services.

Simulation Results and Discuss
The resource allocation algorithm based on bidding and two-tier DRL proposed in this paper is divided into two tiers.       We can see from Figure 2 that the QoE of VoLTE service reaches 1 without optimization, because the required rate requirement is very small (51 kbs). Providing a small part of the bandwidth for this service can meet its requirements. From Figures 3 and 4, the QoE of URLLC and eMBB services fluctuate because the rate requirements of these two services are large (0.1 Gbs and 1 Gbs). Nevertheless, the QoE of these two services is maintained between 0.96 and 1.0. Some abnormal values in subsequent iterations are trial and error attempts made by dueling the DQN algorithm to prevent over optimization.
It can be seen in Figures 5 and 6 that the curves of the SE graph are significantly different from the curves of the QoE graphs of the other three services, and the SE curve has a strong correlation with the system utility curve compared to the three service curves.  We can see from Figure 2 that the QoE of VoLTE service reaches 1 without optimization, because the required rate requirement is very small (51 kbs). Providing a small part of the bandwidth for this service can meet its requirements. From Figures 3 and 4, the QoE of URLLC and eMBB services fluctuate because the rate requirements of these two services are large (0.1 Gbs and 1 Gbs). Nevertheless, the QoE of these two services is maintained between 0.96 and 1.0. Some abnormal values in subsequent iterations are trial and error attempts made by dueling the DQN algorithm to prevent over optimization. It can be seen in Figures 5 and 6 that the curves of the SE graph are significantly different from the curves of the QoE graphs of the other three services, and the SE curve has a strong correlation with the system utility curve compared to the three service curves. However, for each MVNO, its system utility functions and SE shows significant optimization with increasing iterations, which confirms that using the Dueling DQN algorithm is a suitable choice for the model optimization problem proposed in this paper. In MVNO-1, for example, the SE curve fluctuates a lot before 400 iterations, and after 400 iterations, the SE curve has converged to the maximum value of 300 and tends to be stable, with a few low values after more than 400 iterations but does not affect the overall trend. The reason for this phenomenon is that the training neural net parameters were set to be replaced every 200 iterations during the simulation. The neural net parameters were in a relatively poor state when the training was first performed using DQN, and most of the assigned actions obtained from the initial neural net parameters and strategy selection were randomly selected actions in the action space, so the curve showed substantial fluctuations at the beginning. When the number of iterations reaches 400 and the neural net parameters in the Dueling DQN algorithm reach better, the subsequent choices of the al- However, for each MVNO, its system utility functions and SE shows significant optimization with increasing iterations, which confirms that using the Dueling DQN algorithm is a suitable choice for the model optimization problem proposed in this paper. In MVNO-1, for example, the SE curve fluctuates a lot before 400 iterations, and after 400 iterations, the SE curve has converged to the maximum value of 300 and tends to be stable, with a few low values after more than 400 iterations but does not affect the overall trend. The reason for this phenomenon is that the training neural net parameters were set to be replaced every 200 iterations during the simulation. The neural net parameters were in a relatively poor state when the training was first performed using DQN, and most of the assigned actions obtained from the initial neural net parameters and strategy selection were randomly selected actions in the action space, so the curve showed substantial fluctuations at the beginning. When the number of iterations reaches 400 and the neural net parameters in the Dueling DQN algorithm reach better, the subsequent choices of the allocations all appear to be better choices.
The changes in system utility, QoE, and SE for MVNO-1 with an increasing number of iterations using different methods are shown in Figures 7-11.     Analyzing the curves of QoE for three service types (Figures 7-9), it can be seen that for VoLTE service, the QoE values of all three methods are stable at 100%. For URLLC service and eMBB service, the QoE values of four algorithms show some fluctuations of low values, but all three methods are basically stable at 100%. However, the QoE curves of the three services obtained by the Dueling DQN algorithm are more stable and less volatile than the other three algorithms. It can be observed from the curves of system utility and SE (Figures 10 and 11) that the DRL algorithms have a significant improvement over the QL algorithm.     For the curves of SE and system utility, the curves using the Dueling DQN algorithm have higher values than the curves of the other methods, and the curves converge and stabilize at the highest values (SE > 300, utility > 6). After 2200 iterations, the actual simulation data show that the SE obtained by the Dueling DQN algorithm is about 1% higher than that of the DQN algorithm, about 2.7% higher than the Double DQN algorithm, and about 76% higher than the QL algorithm.
And utility has also been slightly improved. For the curves of SE and system utility, the curves using the Dueling DQN algorithm have higher values than the curves of the other methods, and the curves converge and stabilize at the highest values (SE > 300, utility > 6). After 2200 iterations, the actual simulation data show that the SE obtained by the Dueling DQN algorithm is about 1% higher than that of the DQN algorithm, about 2.7% higher than the Double DQN algorithm, and about 76% higher than the QL algorithm.
And utility has also been slightly improved. These four algorithms have obvious optimization for the whole system, and the SE and utility curves have obvious optimization trends. Through comparison, it is concluded that the curve obtained by the Dueling DQN algorithm is more stable than other centralized algorithms. Especially after 2200 iterations, the curve obtained by the Dueling DQN algorithm rarely fluctuates greatly, and even its average value converges to a relatively high value, which shows that using the Dueling DQN algorithm to solve the optimization problem of the lower model is a very effective method. Figures 12 and 13 show the comparison of profit and utility of the upper model using different algorithms. These four algorithms have obvious optimization for the whole system, and the SE and utility curves have obvious optimization trends. Through comparison, it is concluded that the curve obtained by the Dueling DQN algorithm is more stable than other centralized algorithms. Especially after 2200 iterations, the curve obtained by the Dueling DQN algorithm rarely fluctuates greatly, and even its average value converges to a relatively high value, which shows that using the Dueling DQN algorithm to solve the optimization problem of the lower model is a very effective method. Figures 12 and 13 show the comparison of profit and utility of the upper model using different algorithms. It can be seen from the line graph that the optimization effect of the upper model using the DQN algorithm (red curve) is the best. After the number of iterations reaches 3500, the profit of MVNO and the utility function of the system gradually converge to about 200 and 9, respectively. The second is the QL algorithm, whose curve is significantly higher than that of the other two algorithms, but after 3500 iterations, it is more volatile than that of the DQN algorithm. The two curves obtained by the Double DQN and Dueling DQN algorithm perform worse. As the advantages of the DQN algorithm over the other three algorithms cannot be clearly seen from the line graph, the violin graph is used to analyze and compare the data. It can be seen from the line graph that the optimization effect of the upper model using the DQN algorithm (red curve) is the best. After the number of iterations reaches 3500, the profit of MVNO and the utility function of the system gradually converge to about 200 and 9, respectively. The second is the QL algorithm, whose curve is significantly higher than that of the other two algorithms, but after 3500 iterations, it is more volatile than that of the DQN algorithm. The two curves obtained by the Double DQN and Dueling DQN algorithm perform worse. As the advantages of the DQN algorithm over the other three algorithms cannot be clearly seen from the line graph, the violin graph is used to analyze and compare the data. In the violin diagram, the wider the blue width is, the higher the ratio of the value here is. The middle line represents the mean value, and the upper and lower lines represent the maximum and minimum values.
It is obvious from Figures 14 and 15 that, the system utility and MVNO benefit obtained by using the DQN algorithm are better than the other algorithms. The average values of Se and utility obtained by the DQN algorithm are the largest, and the values obtained are concentrated in a relatively high range, which is about 5.4% higher than Double DQN and about 2.6% higher than Dueling DQN. The reason may be that Double DQN and Dueling DQN improved by the DQN algorithm pay too much attention to the behavior of trial and error, but reduce the optimization effect of the system.
In general, it can be seen from Table 4, that the algorithm proposed in this paper is better than the comparison method in optimization performance, convergence speed, and convergence stability. tained are concentrated in a relatively high range, which is about 5.4% higher than Double DQN and about 2.6% higher than Dueling DQN. The reason may be that Double DQN and Dueling DQN improved by the DQN algorithm pay too much attention to the behavior of trial and error, but reduce the optimization effect of the system.  ior of trial and error, but reduce the optimization effect of the system.

The Complexity Analysis
In terms of time complexity, the algorithm proposed in this paper needs to generate the state after the interaction between the environment and MVNO in each iteration, so it is difficult to obtain the operation time required by the algorithm in each iteration. However, the preset number of iterations in this paper is 6000.
From the perspective of spatial complexity, the spatial complexity of the DRL algorithm is obtained according to the number of neural network parameters, real-time addition Ca, and real-time multiplication cm that needs to be stored. The DRL algorithm used in this paper uses K hidden full connection layers, and each hidden layer is set with o K neural units.
The neural network set up in this paper uses the Relu activation function, the number of hidden layers K = 2, and the number of neurons in the two hidden layers o K , Therefore, according to formula (22) and formula (25)- (27), we can get the spatial complexity: Therefore, it can be deduced that the complexity of the proposed algorithm is low. in addition, from the results, it can be seen that the proposed algorithm can converge at a faster speed and get the optimization results.

Conclusions
In this paper, we propose a two-tier slicing resource allocation algorithm with Dueling DQN and joint bidding to solve the optimization problem of resource allocation for multiple users in RAN scenarios. We first combine Dueling DQN and bidding in the upper tier of the proposed model to try to maximize the utilization of the BS resources, using an exhaustive enumeration to obtain the optimal lower tier actions corresponding to the upper tier actions, and using a penalty function to prevent the MVNOs from overbidding. The Dueling DQN is used in the lower tier of the model to allocate the resources to the users connected by each MVNO. Also, in this paper, bidding is combined with the Q-learning algorithm in the upper tier of the model, and the hard slicing approach is combined with bidding and used as a comparison to conclude that using the Dueling DQN algorithm in combination with bidding exhibits better performance. The use of the Dueling DQN algorithm in the lower tier also shows superiority over the use of the Double DQN algorithm, DQN algorithm, and the Q-Learning algorithm. In future work, it can take into consideration changes in user location and changes in service demand, in order to get closer to the actual communication scenario. And improve the proposed two-tier model by combining the bidding algorithm with more advanced DL algorithms to obtain a better allocation scheme.