Augmented Lagrangian-Based Reinforcement Learning for Network Slicing in IIoT

: Network slicing enables the multiplexing of independent logical networks on the same physical network infrastructure to provide different network services for different applications. The resource allocation problem involved in network slicing is typically a decision-making problem, falling within the scope of reinforcement learning. The advantage of adapting to dynamic wireless environments makes reinforcement learning a good candidate for problem solving. In this paper, to tackle the constrained mixed integer nonlinear programming problem in network slicing, we propose an augmented Lagrangian-based soft actor–critic (AL-SAC) algorithm. In this algorithm, a hierarchical action selection network is designed to handle the hybrid action space. More importantly, inspired by the augmented Lagrangian method, both neural networks for Lagrange multipliers and a penalty item are introduced to deal with the constraints. Experiment results show that the proposed AL-SAC algorithm can strictly satisfy the constraints, and achieve better performance than other benchmark algorithms.


Introduction
With the rapid development of industrial internet of things (IIoT), more and more devices are connected and controlled via wireless networks. Providing precise services for these devices to fulfill their diverse requirements becomes a fundamental issue in IIoT. Facing this challenge, three application scenarios are defined by International Telecommunication Union (ITU) and Fifth Generation Public Private Partnership (5G-PPP) [1,2], that is, enhanced mobile broadband (eMBB), ultra-reliable low latency communications (URLLC), and massive machine type communication (mMTC). In more detail, the eMBB scenario provides devices with requirements on high transmission rate, such as high-definition surveillance video in factories, whose peak rate for each camera can be greater than 10 Gbps [3]. mMTC refers to the scenarios, where a large number of devices connect simultaneously while the requirements on the transmission rate and delay are not critical [4]. In contrast, URLLC serves applications with a strict transmission on reliability, and latency, such as automatic operators and controllers [5].
To satisfy these disparate scenarios within one network infrastructure, a network slicing technique was proposed. It divides a physical network into multiple independent logical networks [6,7], where each network slice is isolated from others and provides one kind of network service via dedicated resource allocation. To efficiently allocate resources and meet the dynamic of wireless networks, many intelligent algorithms have been proposed. For instance, in [8], the genetic algorithm, ant colony optimization with a genetic algorithm, and quantum genetic algorithm were used to jointly allocate radio and cloud resources to minimize the end-to-end response latency experienced by each user. In [9], two deep learning technologies, supervised and unsupervised learning, were introduced to jointly optimize user association and power allocation problems, combining the data-driven and model-driven learning. The study in [10] exploited a learning-assisted slicing and concurrent resource allocation process to jointly to improve the users' service reliability and resource utilization rate.
Following 5G network architecture, edge servers with caching and computing capacities are deployed close to the base stations (BSs). Such kinds of deployment enables the intelligent cooperative among the neighboring BSs, which is suitable for the network slicing for an intelligent factory. Resource allocation is a dynamic programming problem, which can also be solved effectively by reinforcement learning (RL). In [11], RL was utilized to dynamically update the number of radio resource units allocated to each slice, where a utility-based reward function was adopted to achieve efficient resource allocation. Cooperative proximal policy optimization was adopted in RL to maximize resource efficiency by considering different characteristics of the different network slices in [12]. A general framework was proposed in [13] that uses RL to achieve dynamic resource management of dynamic vehicle networks in realistic environments. Moreover considering the high complexity and combinatorial nature of the future heterogeneous networks consisting of multiple radio access technologies and edge devices, a multi-agent DRL-based method was utilized in [14,15]. Specifically, in [14], deep Q-network (DQN) algorithm was used in each agent to assign radio access technologies, while the multi-agent deep deterministic policy gradient (DDPG) algorithm to allocate power. The authors in [15] investigated a multi-agent cooperative problem in resource allocation aiming at improving the data process ability of wireless sensor networks and eliminating the non-stationary problem for channel allocation.
However, resource allocation problems in a wireless network always involve constraints, e.g., the device's various requirements on average latency, cumulative throughput, or the average package loss rate, which cannot be solved well by traditional RL. To manage the constraints, constrained Markov decision processes (CMDP) arose, which mainly include four classes: (1) penalty function method: it adds penalty terms into the optimization objective to construct an unconstrained optimization problem. Such as in [16], the logarithmic barrier function is introduced as a penalty. (2) Primal-dual method: it uses the Lagrangian relaxation technique to transform the original problem into a dual problem, for instance [17,18]. (3) Direct policy optimization: it replaces the objective or constraint in the original problem by a more tractable function, such as [19][20][21]. (4) Safeguard uses an extra step mechanism to guarantee the constraint in each training step [22].
In addition to the constraint problem, the discrete-continuous mixed action space is involved in our work. Inspired by the concepts of the augmented Lagrangian similar to the primal-dual method, we propose an augmented Lagrangian-based soft actor-critic (AL-SAC) algorithm to solve the network slicing problem with constraints and the hybrid action space. The main contributions of this paper are as follows.

•
A two-stage action selection is designed by considering a hierarchical policy network to solve the hybrid action space problem in RL, which can significantly reduce the action space; • A penalty-based piece-wise reward function and a constraint-handling part involving neural networks for Lagrangian multipliers and cost functions are introduced to solve the constraint problem; • Simulation results show that our proposed algorithm satisfies the constraints, and AL-SAC has a higher reward value than the DDPG algorithm with a penalty item.

System Model and Problem Formulation
This section firstly presents the network model of network slicing and transmission rate model. Then, considering various requirements of different network slices, the con-straints of different types of devices are developed. Finally, a constrained mixed integer nonlinear problem is formulated.

Network Model
As illustrated in Figure 1, we consider a wireless IIoT with multiple BSs and devices with different network requirements. We denote the set of BSs and devices by M = {1, 2, . . . , M} and N = {1, 2, . . . , N}. Moreover, the devices are categorized into three typical scenarios, i.e., eMBB, mMTC, and URLLC, which are denoted by N eM , N uR and N mM , respectively. Hence, N eM ∪ N uR ∪ N mM = N .  We further denote the bandwidth available in the m-th BS by B m , m ∈ M, and its transmission power by P m . Additionally, a binary variable x mn ∈ {0, 1} is used to denote the association between BS-m and device-n, and accordingly, b mn , the bandwidth allocated by the BS-m to the device-n if x nm = 1; otherwise, x nm = 0.

SINR and Transmission Rate
Denote the distance between the m-th BS and the n-th device by d nm , and h nm the channel fading gain, the received SINR received at the n-th device from the m-th BS can be expressed as [23] where A 0 denotes the path losses at the reference distance d nm = 1 and α denotes the pathloss exponent; σ 2 denotes the noise power. Hence, the transmission rate that the n-th device can achieve when associated with the m-BS can be calculated as For the n-th device, the transmission rate achieved is given by

Requirements of Different Network Slices
As mentioned before, each device belongs to one typical scenario. To provide the transmission service required, three corresponding network slices are defined. That is, • eMBB slice: The devices served by this network slice require a high transmission rate, such as the device with real-time streaming of high-resolution 4K or 3D video [24]. That is, the transmission rate achieved by these devices has a minimum requirement: where R 0 denotes the rate threshold. • URLLC slice: The devices served by this network slice have a strict requirement on delay, which include the transmission delay, queuing delay, propagation delay, and routing delay [25]. Denoted them by T 1 , T 2 , T 3 , T 4 , respectively, the end-to-end delay can be calculated as T 1 + T 2 + T 3 + T 4 . The minimum requirement for wireless transmission delay is as follows: where L denotes the packet length, and T 0 denotes the delay threshold. As mentioned in [26], the achievable rate of a URLLC wireless link, i.e., Equation (4) in [26], can be approximated by the Shannon capacity when the block length is large. For this reason, in this work, we use the Shannon capacity to calculate the link rate and focus on the transmission time independent of the other delay components [27]. • mMTC slice: The devices served by this network slice have no strict rate or latency requirements [27]. Hence, to ensure the basic wireless connection, a minimum bandwidth B 0 should be allocated to support the connection. That is,

Problem Formulation
The aim of network slicing design is maximizing the overall utility achieved by the devices in the system, Firstly, inside each network slice, to achieve fairness among the inner-slice devices, the proportional fairness is utilized as the utility function of each device [28]. That is, More importantly, considering the disparity of throughput in three network slices, weight preference is utilized to balance their contribution to the overall utility [29]. In this work, we use w eM , w uR and w mM to denote the weight for the devices in eMBB, URLLC and mMTC slices, respectively.
Hence, we have the following optimization problem, via the BS-device association and bandwidth allocation to maximize the overall system utility: In addition, (9) represents the bandwidth allocated to the associated users that cannot exceed the overall bandwidth in this BS; (10) indicates that a device only can be associated with one BS at one-time instance; and (11), (12) and (13) represent that the network requirements in each slice introduced above.
In essence, the above problem is a constrained mixed integer problem. In the following, we propose an augmented Lagrangian-based reinforcement learning (RL) with a soft actorcritic (SAC) framework to solve this problem. Then, with the optimal results on the device-BS association and the bandwidth allocated to each device, an extra but simple procedure is needed to complete the slicing details: /(1) Calculate the number of radio resource blocks (RRB) needed for each slice in one BS, and determine the corresponding collection of RRBs.
(2) Each BS allocates a subset of the RRBs belonging to the slice to each serving device.

Proposed Augmented Lagrangian-Based Reinforcement Learning
In the considered scenario, the agent deploying the proposed algorithm can be the edge server in the 5G architecture. Since the edge server is deployed neighboring the BSs, and with the channel information feedback from the BSs, the proposed algorithm can give the optimal device-BS association and each device's bandwidth.
In this section, the basic of the augmented Lagrangian method is firstly presented. Then, the hybrid action space, state space, and reward function are defined. Finally, the architecture and workflow of the proposed AL-SAC algorithm are elaborated.

Preliminary of Augmented Lagrangian Method
The augmented Lagrangian method not only replaces the constrained optimization problem with an unconstrained problem but also introduces a penalty term to accelerate convergence [30]. Given an objective function f (x) to maximize with parameter x and the constraint functions c i (x) > 0, this optimization problem can be solved by its dual problem as follows: where λ denotes the Lagrange multiplier vector for the constraints, and µ denotes the parameter for the penalty term. Then, the typical solving process alternatively optimizes λ and x during iterations.
i is updated according to the rule as Additionally, when the constraint is not satisfied, µ is enlarged with a scalar.

Definition of State, Action and Reward in RL
To solve this problem using the framework of RL, in the following, we further define the corresponding state, action spaces, and the reward function in this problem.

State Space
In reinforcement learning, the state space represents the environment observed by an agent. Hence, in our scenario, the wireless environment, i.e., the channel condition between BSs and devices, is defined as the state. Since in the factory, the location of devices are fixed, the channel condition only related to the channel fading gain, that is, the state space, can be expressed as Based on the state, we can calculate the SINR by (1), and then the transmission rate and other parameters involved in the optimization problems.

Hybrid Action Space
Considering the discrete and continuous variables involved in problem (8), a hybrid action space is used in this problem. That is, the discrete action, which represents the association between devices and BSs; the continuous action which represents the bandwidth assigned by BS.
Since a BS only allocates bandwidth resources to its associating devices, we have that only when x mn = 1, then b mn > 0. Hence, a hierarchical policy network is designed, where we divide the action selection into two stages to significantly reduce the action space, as illustrated in Figure 2, association action a 1 is selected at Stage-1 based on the state s, and then at Stage-2, bandwidth allocation action a 2 is chosen based on the set {s, a 1 }.
In t-th time episode, we denote the action for the whole system as

Reward Function
In (8), we have multiple constraints. For the association constraint, i.e., (10), and the requirement constraints of the devices, i.e., (11)-(13), we directly map the actions to the corresponding ranges. As for bandwidth constraints (9), we consider an augmented Lagrangian method to constrain the total bandwidth since it cannot be handled at the time of action selection.
Motivated by the augmented Lagrangian method, we introduce the penalty into the design of the reward function. In more detail, that is, when the bandwidth constraints of all BS are satisfied, the reward function equals the original weighted overall utility, i.e., (8), which can be calculated as Otherwise, the reward equals the penalty item similar to the augmented Lagrangian method, that is, where the set M just involves BSs that do not satisfy the bandwidth constraint, and G m = ∑ n∈N x nm b nm represents the overall bandwidth used in m-th BS. In (21), it means when the total bandwidth constraint is not satisfied, the reward is a negative value related to the exceeded bandwidth.

Proposed AL-SAC Algorithm
The framework of SAC includes the actor and critic parts, which are for policy evaluation and policy improvement, respectively. A policy is the function that returns a feasible action for a state, denoted by π. That is, a ∼ π(×|s).
In SAC, the algorithm aims to find the optimal policy which maximizes the average of the entropy of the policy and the expected reward. That is, where γ (0 < γ ≤ 1) is the discount factor; H(π(a|s)) denotes the Shannon entropy of policy π. Considering the hybrid action spaces in our problem, the Shannon entropy for policy π involves the entropy calculation of two parts: H (π(a|s)) = β 1 H(π(a 1 |s)) + β 2 H(a 2 |s), where β 1 and β 2 are the entropy temperatures. As illustrated in Figure 3, the architecture of the proposed AL-SAC algorithm incorporates a constraint part in addition to the original actor, critic parts, and replay buffer. Specifically,

•
Actor part: it deploys a policy network denoted by π, which generates the policy of device association and bandwidth allocation; • Critic part: it deploys a value network and a Q-value network, denoted by V and Q, estimating the value of state and state-action, respectively; • Constraint part: it deploys Lagrangian multiplier networks and cost networks, denoted by L and C, estimates the cost value of constraints and adjusting the Lagrangian multipliers accordingly.
• Replay buffer: it is used in DRL to store the tuples, i.e., {s (t) , a (t) , r (t) , s (t+1) , G (t) m }, from which the sampled tuples are used in neural network training.
In the following, we describe the updating process of each neural network component in the three parts network.

Cost Network
Cost Network

Loss for Q-Value
Networks Figure 3. The architecture of proposed AL-SAC algorithm.

Value Network V
This network is utilized for estimating the state value and target state value, i.e., V φ (s) and Vφ(s), where φ andφ are parameters, andφ is updated by an exponentially moving average of the value network weight [31]. In the learning process, this network is trained by minimizing the squared residual error Then, the gradient in (25) can be estimated by an unbiased estimator and used in the update of the neural network. That is, , a (t) ) + log π θ (a (t) |s (t) )).

Q-Value Network Q
To evaluate the reward function, a Q-value network is deployed to calculate the stateaction value for each action, i.e, Q ψ s, a , where ψ denote the parameter of the neural network. This network is trained by minimizing the soft Bellman residual where Vφ(s (t+1) ) is target state value mentioned above to enhance the training stability. Then, this neural network is updated by

Constraint Networks C
Multiple constraint networks are also deployed to estimate the constraints cost expectation for bandwidth allocation. Similar to double deep Q-learning, each constraint network has two separated Q-value networks with parameters ϕ m andφ m . They are involved in generating the continuous action-state value for m-th BS, i.e., which be utilized for estimating the value allocated bandwidth. Moreover, we also define the continuous action b This network is trained by minimizing the loss. That is, They can be updated by ) is the target action-state value for the training stability, whereφ m is periodically updated by copying ϕ m .

Lagrangian Multiplier Network L
To update Lagrange multiplier λ in (15), we also deploy Lagrangian multiplier networks. As mentioned in Section 3.1, we can learn λ by minimizing the objective function according to the constraints' verification. That is,

Policy Network π
This network searches the optimal policy according to the estimated values generated by networks in the critic and constraint part. It generates an action based on policy π for each state, i.e., π θ (×|s), where θ is the parameter of the policy network. As mentioned earlier, we consider a two-stage action selection. Specifically, the discrete action will be selected first, i.e., the BS connection state will be determined. Then with the corresponding channel state between the device and BS, the bandwidth allocation is determined.
Then, we can update π by maximizing the following function The gradient of (34) can be approximated by where a (t) = f θ ( (t) ; s (t) ), (t) is an input vector sampled from Gaussian distribution, and π θ is defined implicitly in terms of f θ [31].
The workflow of the proposed AL-SAC algorithm is summarized in Algorithm 1. Specifically, lines 2-6 illustrate the experience collected from the environment with the current policy, and then the update of the networks is presented in lines 8-18. Observe the environment s (t) .

4:
Calculate total bandwidth allocated in each BS, i.e., G Calculate the reward r (t) depending on G if t > K then 8: Randomly select a batch of samples (s (t) , a (t) , r (t) , s (t+1) ) from the replay memory buffer D. 9:

Simulation
In this section, we first test the performance of the proposed AL-SAC algorithm in various scenarios, and also verify the constraints required. Moreover, compare it with other benchmark algorithms, including the original SAC algorithm, and the DDPG algorithm with a penalty item dealing with constraints. In the end, the network service constraints for devices with different weights proportion are shown.

Parameter Setting
We consider a wireless scenario with multiple BSs and eMBB/URLLC/mMTC devices, where the locations of devices are randomly distributed. The channel fading between a device and BS follows a Rayleigh distribution, varying with time. Combining the distancebased path loss and fading, the channel condition in an episode can be calculated.
In the simulation, the number of BSs is M = 2, the number of devices in eMBB, URLLC, and mMTC slices are (3,3,4) or (5,5,5). Three different weight designs are considered, that is, (w eM , w UR , w mM ) = ( 1 3 , 1 3 , 1 3 ), ( 2 3 , 1 6 , 1 6 ), or ( 1 10 , 3 5 , 3 10 ). Furthermore, the transmission power P m , the path loss exponent α and noise power σ 2 are set as 2 W, 3.09 and 10 −9 W, respectively [32]. For mMTC devices, B 0 = 0.18 MHz, for eMBB devices, R 0 = 4 Mbps, for URLLC devices, T 0 = 20 ms. The available bandwidth for each BS is set as B m = 10, 12.5, or 15 MHz, the neural networks are trained by the Adam optimizer, and batch size is set as K = 256. All these simulation parameters are also listed in Table 1 for clarity. In Figure 4, we plot the average reward achieved by the proposed AL-SAC algorithm when the maximum bandwidth B m = 10, 12.5, and 15 MHz available with the number of devices N = 10 for each BS, respectively, as well as B = 10 MHz, N = 15. Firstly, it is observed that in all cases, the proposed AL-SAC algorithm converges, although a slight fluctuation exists when B m = 15 MHz, N = 10. Secondly, we can see that with the growth of available bandwidth, the curve of achieved reward is a little less stable. The reason behind this is that the bandwidth allocation options for multiple devices also increase, i.e., the action space. Thirdly, it can be seen that with the growth of available bandwidth resources, the reward achieved by the proposed algorithm increases apparently, which implies increasing system utilities due to more radio resources being available. Lastly, compared with B m = 15 MHz, N = 10, the proposed AL-SAC algorithm with B m = 15 MHz, N = 15 can reach similar reward but more converge speed because when more devices share the same limited bandwidth resources, the weighted overall utility is not higher. Moreover, the action space for each device reduces when the devices number increases, resulting in more stable performance. Meanwhile, to verify the proposed algorithm can provide an effective solution to the constrained RL problem, in Figure 5, we also show the bandwidth constraint in the same scenarios as Figure 4. It clearly can be seen that in all the cases, the proposed AL-SAC can meet bandwidth requirements after 100 episodes. This also shows that the proposed algorithm can provide a feasible and effective solution to the constraint optimization problem considered. In Figures 6 and 7, we compare the performance of the proposed AL-SAC algorithm with two benchmark algorithms: the DDPG algorithm with the penalty involved in the reward function, termed as penalty DDPG [33] and the original soft actor-critic algorithm without constraint handling, termed as SAC [31]. Observe from Figure 6, the SAC algorithm achieves the highest reward, which is much larger than those achieved by AL-SAC and Penalty DDPG. However, from Figure 7, we can see that the overall bandwidth allocated exceeds the maximum bandwidth constraint. Hence, it can be concluded that the high reward achieved by the SAC algorithm is because it cannot handle the constraint of the action sum, resulting in more radio resources being available, i.e., BS cannot satisfy the constraints in (9).
More importantly, comparing the rewards achieved by the proposed AL-SAC and Penalty DDPG, we can see that from Figure 7, both of them satisfy the maximum bandwidth, and the proposed AL-SAC algorithm significantly outperforms the Penalty DDPG algorithm due to the larger reward achieved. The reason behind this is that although Penalty DDPG can meet bandwidth constraints relying on the penalty item in the rewards, the proposed AL-SAC has a strong capability in handling much smaller discrete and continuous action spaces by two stage-design, and the introduction of the constraint part can make the optimizing process more effective. Specifically, the proposed AL-SAC algorithm achieves an improvement of around 42.1% in reward compared to the Penalty DDPG algorithm after 5000 episodes. Moreover, from Figure 7, the overall bandwidth allocated of the proposed AL-SAC algorithm is a little larger than the Penalty DDPG, which is both under the maximum bandwidth B m = 10 MHz constraint. It shows that the policy of device association and bandwidth allocation trained by the proposed AL-SAC algorithm are more efficient.

Results and Analysis
Furthermore, in Figure 8, we tend to verify the performance of each device in different network slices. We consider the scenario of the maximum bandwidth B m = 10 MHz; the number of devices N = 15; the minimum rate requirement for eMBB device R 0 = 4 Mbps; the minimum delay requirement for URLLC device T 0 = 20 ms; and the minimum bandwidth requirement for mMTC device B 0 = 0.18 M. It can be seen from the figure that the proposed AL-SAC algorithm can make all of the devices in various network slices meet their corresponding constraints in three slices, while the Penalty DDPG algorithm cannot. As shown in Figure 8a,b, the device eMBB-3 cannot achieve the required rate, and the devices URLLC-3 and URLLC-5 exceed the maximum delay in (5). Moreover, comparing the transmission rate, delay and bandwidth constraint in the three sub-figures, it can be seen that the network performance achieved by the device in one slice is more even when the proposed AL-SAC algorithm is adopted. This means a better fairness performance is obtained by our proposed AL-SAC algorithm. To see the results affected by different weights, in Figure 9, we also plot the average reward achieved by the proposed AL-SAC algorithm when weight (w eM , w UR , w mM ) = ( 1 3 , 1 3 , 1 3 ), ( 2 3 , 1 6 , 1 6 ), and ( 1 10 , 3 5 , 3 10 ) with the maximum bandwidth B m = 15 MHz, n = 10, respectively. It can be seen that (1) comparing the blue and orange bars in Figure 9a, the total bandwidth allocated to the eMBB slices grows when weight w eM increases. (2) Comparing the blue and yellow bars in Figure 9b, the overall delay in the URLLC slice reduces when weight w UR increases. (3) Comparing the yellow bars in Figure 9a,c, the overall bandwidth allocated grows proportionally for devices in mMTC slices when weight w mM . This is because the agent will more attention to the slice with the higher weight.

Conclusions
In this paper, we investigated the network slicing problem in IIoT, where the device association and bandwidth allocation for devices in different slices are jointly optimized. By formulating it as a constraint mixed integer nonlinear programming problem with continuous and discrete variables, a Lagrangian-based SAC algorithm is proposed to solve it using DRL. Aiming to maximize the total weighted utility under limited bandwidth resources, cost neural networks and Lagrangian multiplier networks are introduced to update the Lagrangian multipliers and the penalty term is introduced also to the reward function. Moreover, specifically, a novel two-stage actions selection network is presented based on DRL to handle the hybrid actions and decrease the action space simultaneously. Our results verify that the proposed AL-SAC algorithm can effectively meet the constraint and achieve better performance than other benchmark algorithms in terms of average reward and fairness.

Conflicts of Interest:
The authors declare no conflict of interest.