AQMDRL: Automatic Quality of Service Architecture Based on Multistep Deep Reinforcement Learning in Software-Defined Networking

Software-defined networking (SDN) has become one of the critical technologies for data center networks, as it can improve network performance from a global perspective using artificial intelligence algorithms. Due to the strong decision-making and generalization ability, deep reinforcement learning (DRL) has been used in SDN intelligent routing and scheduling mechanisms. However, traditional deep reinforcement learning algorithms present the problems of slow convergence rate and instability, resulting in poor network quality of service (QoS) for an extended period before convergence. Aiming at the above problems, we propose an automatic QoS architecture based on multistep DRL (AQMDRL) to optimize the QoS performance of SDN. AQMDRL uses a multistep approach to solve the overestimation and underestimation problems of the deep deterministic policy gradient (DDPG) algorithm. The multistep approach uses the maximum value of the n-step action currently estimated by the neural network instead of the one-step Q-value function, as it reduces the possibility of positive error generated by the Q-value function and can effectively improve convergence stability. In addition, we adapt a prioritized experience sampling based on SumTree binary trees to improve the convergence rate of the multistep DDPG algorithm. Our experiments show that the AQMDRL we proposed significantly improves the convergence performance and effectively reduces the network transmission delay of SDN over existing DRL algorithms.


Introduction
In the era of the Internet of Things (IoT), sensors can connect any physical device to the Internet, creating a digital and intelligent world. Individuals, businesses and industries can work around the clock with vast amounts of data collected and implemented by IoT sensors, leading to the exponential growth of Internet traffic. Thus, owing to the unprecedented growth of Internet traffic and the stringent quality of service (QoS) requirements of various applications, traditional data center networks are under tremendous pressure.
Software-defined networking (SDN) has become one of the critical technologies for data center networks, with features such as decoupling data plane and control plane, controller management of global network and programmability, etc. SDN controllers can improve the network performance from a global perspective using artificial intelligence algorithms, including network throughput, packet loss, transmission delay, and load balancing, which is of great significance to ensure the network QoS [1]. However, most traditional QoS optimization schemes in SDN are based on traditional routing or heuristic algorithms [2][3][4][5][6]. Heuristic algorithms need to build models based on specific real-world problems. When faced with dynamically changing complex networks, heuristic algorithms are often powerless to guarantee network QoS.
In recent years, many researchers have applied deep learning (DL) to the QoS optimization of SDN due to its powerful learning capability and excellent performance advantages [7][8][9][10][11]. However, deep learning requires large datasets for model training and has poor generalization ability, making dynamic network performance optimization challenging. Compared to deep learning, reinforcement learning (RL) employs online learning to train a model by continuously exploring, learning and changing its behavior to obtain the best return. Reinforcement learning can generate action decisions based on the state of the environment and reward feedback, and has strong decision-making and generalization capabilities. Therefore, some researchers have applied reinforcement learning to the QoS optimization of networks, allowing the agent to interact with the network environment, in order to achieve the adaptive optimization of network performance [12][13][14][15][16][17][18]. However, the perceptual capability of RL algorithms is weak, which makes the decisions of RL algorithms in dynamic network scenarios highly variable and extends the learning time. Deep reinforcement learning (DRL) combines the perceptual capability of DL and can effectively improve the convergence performance of RL algorithms. Thus, it has attracted the attention of many researchers and has been applied to SDN performance optimization, such as deep Q-network (DQN) [19][20][21][22], deep deterministic policy gradients (DDPG) [23][24][25][26][27][28][29], twin delayed deep deterministic policy gradient (TD3) [30,31], etc. However, the DQN algorithm can only make decisions in a discrete action space and has limited capability. The slow and unstable convergence of DDPG and TD3 algorithms will lead to more extended trial-and-error times, thus making the network jittery for a long time. It leads to poor QoS performance and seriously affects the regular order of the network.
Concerning the above challenges, we propose an automatic QoS architecture based on multistep DRL (AQMDRL) to optimize the QoS performance of SDN in load balancing and network latency. The agent of AQMDRL first receives the network link status and the evaluated reward value. We use the link delay and the standard deviation of link bandwidth as the agent's reward function. Then, it generates an action, namely the interconnection weight matrix. Finally, it uses the Dijkstra algorithm to select the optimal routing path through the interconnection weight matrix and then creates the forwarding flow table. The agent evaluates the quality of the action and determines the next action through the reward value in each episode to minimize the link delay and achieve link load balance. To avoid long-term network jitter before convergence, the AQMDRL algorithm aims to solve the problem of slow convergence rate and instability of traditional DRL algorithms applied in QoS optimization of SDN. In order to make the algorithm more stable after convergence, it uses a multistep strategy to optimize the DDPG algorithm, which effectively alleviates the overestimation and underestimation problem of DDPG. In addition, it further adopts the SumTree-based prioritized experience sampling mechanism to implement the experience sampling of DDPG to improve the convergence rate. Through experiments, it is verified that the proposed optimization scheme can effectively improve the convergence rate and stability, and reduce the impact of network performance caused by the reinforcement learning model in the trial-and-error phase.
The main contributions of this paper are as follows: 1.
We propose an automatic QoS architecture based on AQMDRL to effectively solve the poor service quality caused by unstable SDN performance.

2.
We design a DRL-based QoS optimization model to evaluate the SDN network quality of service.

3.
We use a multistep strategy to optimize the DDPG algorithm. The multistep approach uses the maximum value of the n-step action currently estimated by the neural network instead of the one-step Q-value function, as it reduces the possibility of positive error generated by the Q-value function and can effectively improve convergence stability. In addition, we adapt a prioritized experience sampling based on SumTree binary trees to improve the convergence rate of the multistep DDPG algorithm.

4.
We verify the effectiveness of the AQMDRL algorithm by designing an experimental system.
The rest of this paper is organized as follows: Section 2 entails the related research; Section 3 describes the network architecture and the AQMDRL algorithm; Section 4 discusses the experiments and performs the analysis; Section 5 concludes the paper.

Related Research
Traditional network-based QoS policies are usually based on open shortest path first (OSPF) [2], equal-cost multipath routing (ECMP) [3] or heuristic optimization algorithms [4][5][6] for forwarding traffic. However, OSPF and ECMP use fixed forwarding methods, which easily result in link congestion when faced with complex traffic tasks. Heuristic algorithms need to build models based on specific real-world problems. When faced with dynamically changing complex networks, heuristic algorithms are often powerless and have limited scalability to guarantee network service quality.
Traditional network-based QoS policies cannot handle increasingly complex tasks and no longer meet the needs of SDN environments. In recent years, DL has been increasingly applied to SDN networks due to its powerful learning algorithms and excellent performance advantages. Zou et al. [7] proposed a deep learning approach to perform time-aware service QoS prediction tasks through feature integration. In other studies [8][9][10], a long short-term memory neural network algorithm (LSTM) was used to optimize the SDN network performance. Chen et al. [11] employed the LSTM traffic prediction method in the SDN application plane to solve the SDN load balancing problem. Nonetheless, deep learning requires a large number of datasets for training, and it has poor generalization ability, arising from its inability to interact with the environment. The above reasons make it difficult to optimize the performance of dynamic networks. Compared with DL, RL uses online learning for model training, changing the agent behavior through continuous exploration and experimentation to obtain the best reward; therefore, it does not need to train the model in advance, but can change its action according to the environment state and reward feedback. Reinforcement learning has strong decision-making and generalization ability. Certain authors [12][13][14] have combined RL with SDN to optimize QoS. Younus et al. [15] used the Q-learning algorithm to optimize the network performance and effectively improve the network convergence speed. They [16] added QoS to the reward function setting. Casas-Velasco et al. [17] proposed the Reinforcement Learning and Software-Defined Networking Intelligent Routing (RSIR) algorithm, which avoids links with high latency and packet loss by minimizing the reward value of the Q-learning agent. The RSIR algorithm outperforms Dijkstra's algorithm regarding link throughput, packet loss and latency. Al-Jawad et al. [18] implemented a trade-off between QoS and quality of experience (QoE) for users of the Q-learning algorithm. However, the Q-learning algorithm learns the optimal action-value function as a Q-table, and does not access previous experience, resulting in a lack of data representation in deep learning.
Deep reinforcement learning utilizes the data representation capabilities of deep neural networks in reinforcement learning to implement decisions, which opens up a new avenue to solve complex network QoS problems. Baek et al. [19] used a DQN-based optimization method to approximate the optimal value function, in order to maximize and stably converge the common objective of the nodes. Chen et al. [20] combined a graph neural network and ensemble learning with the DQN algorithm to address the lack of generalization ability in untrained SDN topologies. Other scholars [21,22] combined DQN and LSTM to achieve network routing policy optimization. However, the action space of DQN is discrete, so they cannot be solved for a high-dimensional set of states. DDPG solves the problem that DQN can only make decisions in discrete action spaces. In further studies [23][24][25], DDPG was applied to SDN routing optimization, and the scheme achieved intelligent optimization of the network and effectively reduced network latency. Lan et al. [26] proposed the QoS optimization algorithm, R-DRL, which combines the DDPG algorithm with LSTM to generate a dynamic traffic scheduling policy that satisfies the objective of QoS optimization. Guo et al. [27] proposed a DDPG-based DQSP protocol that guarantees security and QoS in the network. Mai et al. [28] proposed a DDPG-based slicing method to provide different QoS guarantees. Chen et al. [29] updated the sampling method of the experience pool using the SumTree structure, which improves the random sampling strategy of the experience replay mechanism in DDPG. Although the DDPG algorithm can learn more efficiently on subsequent actions, it blindly selects the action with the largest Q-value when choosing the action, causing the DDPG to suffer from an overestimation problem.
In order to solve the overestimation problem of the DDPG algorithm, Fujimoto et al. [30] proposed the TD3 algorithm, which refers to the clipped double Q-learning algorithm in the value network and uses delayed policy update and target policy smoothing techniques. TD3 improves the high estimation and high variance present in DDPG. Sun et al. [31] proposed a TD3-based intelligent routing technique that can self-adapt to dynamic network environments. Other scholars [32][33][34] used pinning control to address the poor scalability and low robustness problems that commonly exist in the current DRL algorithm applied to network routing policy generation.
Although the TD3 algorithm alleviates the overestimation problem, it may lead to significant underestimation bias and affect the convergence performance when using the minimum approach for value interception. TD3 still suffers from slow convergence and instability, seriously affecting the network QoS. Pan et al. [35] used the Boltzmann softmax operator to estimate the value function to solve the problem of underestimation of TD3. However, this method increases the complexity of the TD3 algorithm and requires more computing resources. Huang et al. in [36] used a Graph Convolutional Network (GCN) to optimize DRL networks. Wang et al. [37] proposed a reinforcement learning-based relay Graph Neural Network (GNN). The model combining GNN and RL performs better in specific network scenarios. However, it must simultaneously optimize both networks' parameters, requiring a longer training time. Meng et al. [38] proposed a mixed multistep DDPG approach (MMDDPG) and empirically demonstrated that the multistep approach can effectively mitigate the overestimation problem in DRL. Inspired by this strategy, we propose an improving DDPG algorithm using multi-step method and prior experience sampling to optimize the QoS performance of SDN. It enhances the algorithm's stability by estimating the target Q-value update network after n steps and improves the algorithm's convergence rate by extracting samples with a more influential temporal-difference error (TD error), thus solving the problems of unstable network performance and poor quality of service.

System Architecture and Components
The DRL-based QoS optimization architecture includes a data plane, a control plane and an application plane, as shown in Figure 1. The specific functions of each plane are as follows: 1.
Data plane. The data plane consists of SDN-enabled switches responsible for network data transmission. It senses the network state in real time and passes the network state to the control plane through the southbound interface. Meanwhile, it receives policies issued by the control plane. When a new flow arrives, the switch queries the flow table entries according to priority and matches the data one by one. Then, it performs the corresponding processing step according to the matching rules and results. If the flow does not correspond to any flow table entry, it will be discarded or reported to the controller. 2.
Control plane. The control plane consists of the SDN controllers responsible for flow table configuration. It passes the network state collected from switches to the application plane. Next, it receives the interlink weight matrix generated by the application plane. Finally, it uses the Dijkstra algorithm to select the optimal routing path and sends the forwarding flow table to the switches. 3.
Application plane. The application plane runs the AQMDRL agent responsible for generating routing optimization policies. The AQMDRL agent uses the network state as the state input and communicates the generated interlink weight matrix to the control plane. This process is repeated until the agent learns the action strategy that maximizes the reward value. 1. Data plane. The data plane consists of SDN-enabled switches responsible for network data transmission. It senses the network state in real time and passes the network state to the control plane through the southbound interface. Meanwhile, it receives policies issued by the control plane. When a new flow arrives, the switch queries the flow table entries according to priority and matches the data one by one. Then, it performs the corresponding processing step according to the matching rules and results. If the flow does not correspond to any flow table entry, it will be discarded or reported to the controller. 2. Control plane. The control plane consists of the SDN controllers responsible for flow table configuration. It passes the network state collected from switches to the application plane. Next, it receives the interlink weight matrix generated by the application plane. Finally, it uses the Dijkstra algorithm to select the optimal routing path and sends the forwarding flow table to the switches. 3. Application plane. The application plane runs the AQMDRL agent responsible for generating routing optimization policies. The AQMDRL agent uses the network state as the state input and communicates the generated interlink weight matrix to the control plane. This process is repeated until the agent learns the action strategy that maximizes the reward value.

Problem Definition and System Model
Based on the system architecture described above, the role of the DRL agent is to assign routes for incoming traffic demands to maximize network utility. In this section, we describe the DRL-based QoS optimization model of the network performance evaluation.
We use to refer to a network in the data plane, as shown in Equation (1): where represents the set of forwarding nodes, as shown in Equation (2); represents the set of links, as shown in Equation (3).
We use to denote the capacity of link and define C as the set of network link capacities, as shown in Equation (4), where | | = | | = .

Problem Definition and System Model
Based on the system architecture described above, the role of the DRL agent is to assign routes for incoming traffic demands to maximize network utility. In this section, we describe the DRL-based QoS optimization model of the network performance evaluation.
We use G to refer to a network in the data plane, as shown in Equation (1): where V represents the set of forwarding nodes, as shown in Equation (2); E represents the set of links, as shown in Equation (3).
We use c i to denote the capacity of link e i and define C as the set of network link capacities, as shown in Equation (4), where |C| = |E| = m.
We use b t ij to denote the traffic demand from node v i to node v j at time t and define B t as the set of all traffic demands, as shown in Equation (5): We define ϕ t as the link-delay coefficient at time t, which is calculated by Equation (6): where d t i represents the delay of link e i at time t. The higher the link-delay coefficient, the lower the transmission delay of the link, and the faster the transmission. We define ρ t as the load-balancing coefficient at time t, which is calculated by Equation (7): where u t i represents the bandwidth usage of link e i at time t, and σ(·) represents the standard deviation. The load-balancing coefficient is used to examine the load-balancing degree of the network. The smaller the load-balancing coefficient, the better the loadbalancing degree.
Reinforcement learning is a learning process for the agent to achieve an objective in interacting with the environment. The agent generates an action for the environment. Then, the environment evaluates the agent according to the goal, stimulating the agent to generate the next action. In this paper, the optimization objective is to maximize the link-delay coefficient ϕ t and minimize the load-balancing coefficient ρ t , which can optimize the QoS performance of SDN in load balancing and network latency.
Given the above optimization objective, we design the state, action, and reward of reinforcement learning as follows:

1.
State. We set the state as the network state information obtained from the environment. In this paper, the state corresponds to the traffic demand B t , as shown in Equation (8).
2. Action. We design the action as the set of link weights generated by the DRL agent at time t, as shown in Equation (9).
where w i denotes the weight value of link e i .

3.
Reward. The reward function is the return for the previous action and is used to evaluate the service quality. According to the optimization objective, we design the reward function as Equation (10), where β 1 and β 2 denote the weight coefficients of the reward. To make sure that the value of the link-delay coefficient ϕ t and the load-balancing coefficient ρ t are of the same order of magnitude, we use log 2 (·) to process the link-delay coefficient.

AQMDRL Algorithm
In this section, we describe in detail the AQMDRL algorithm proposed in this paper and the working process of the AQMDRL agent.

Deep Deterministic Policy Gradients
DDPG is an off-policy deep reinforcement learning algorithm. It is essentially the actor-critic-based framework, which combines the deterministic policy gradient and DQN based on the action value. It constructs a deterministic strategy to maximize the Q-value by using the method of gradient rise.
The DDPG algorithm contains an experience pool and four neural networks, as depicted in Figure 1. The functions are described in detail as follows: 1.
Experience pool. The experience pool is used to store the tuple (s t , a t , r t , s t+1 ) generated during training, and the target networks randomly select mini-batch samples from the experience pool to train the model.

2.
Actor online network. This network is responsible for generating action a t according to state s t , noise N and policy function µ θ (·), as shown in Equation (11), and interacting with the environment to obtain new state s t+1 and reward r t .
where θ represents the actor online network parameter.

3.
Actor target network. This network is responsible for generating optimal actionâ t for the critic target network according to policy function µ θ (·) and state s t sampled from the experience pool, as shown in Equation (12), where θ represents the actor target network parameter. The parameter θ is periodically copied from θ.
4. Critic online network. This network generates the online Q-value Q ω (s t , a t ) according to the online Q-value function Q ω (·), where ω represents the critic online network parameter. The policy gradient function uses the online Q-value to update actor online network parameter θ, as shown in Equation (13).
where N represents the number of mini-batch samples.

5.
Critic target network. This network calculates the target Q-value y i according to the target Q-value function Q ω (·), as shown in Equation (14): where ω represents the critic target network parameter and γ is the reward-discount factor, γ ∈ [0, 1]. State s t and reward r t are obtained in the experience pool. The network parameter ω is periodically copied from ω. The loss function uses the target Q-value to update critic online network parameter ω, as shown in Equation (15): where N represents the number of mini-batch samples. The parameters θ , ω in the target network use the soft update method for updating, as shown in Equations (16) and (17): where the update rate τ is a hyperparameter.

Multistep DDPG
Although the DDPG algorithm can learn more efficiently on subsequent actions, it blindly selects the action with the largest Q-value when choosing the action, causing the DDPG to suffer from an overestimation problem. Because the critical network of DDPG employs the DQN algorithm, each time, the DDPG algorithm obtains the target Q-value y i according to the current reward r i and the target Q-value Q ω (s t+1 ,â t+1 ). However, Q ω (s t+1 ,â t+1 ) itself calculated by the critical target network may generate forward or reverse errors. In the updating mode of the DQN algorithm, the neural network will accumulate positive errors; therefore, when the DDPG algorithm updates the parameters, Q ω (s t+1 ,â t+1 ) may be overestimated. Similarly, if Q ω (s t+1 ,â t+1 ) is used as the update target to update the online Q-value Q ω (s t , a t ) in the previous step, Q ω (s t , a t ) will also be overestimated, and such errors will gradually accumulate. The larger the action space, the more serious the overestimation problem of the DDPG algorithm will be. The inaccurate estimation of the target Q-value may make the learning strategy strongly deviate from the optimal direction, resulting in poor convergence performance.
The TD3 algorithm proposed in [30] uses the clipped double Q-learning algorithm, delayed policy update and target policy smoothing in the value network to solve the overestimation problem of the DDPG algorithm. However, it may lead to significant underestimation bias and affect the convergence performance when using the minimum approach for value interception.
This paper uses the multistep DDPG (MDDPG) method to alleviate the overestimation and underestimation of the target Q-value function of DDPG. In the MDDPG, the corresponding optimality equation for the target Q-value can be defined as the sum of the n-step discounted accumulated reward and target Q-value function Q ω (·) through the n-step forward operation, as shown in Equation (18), where n represents the number of steps, and M is the size of the experience pool.
In the previous analysis, the main reason for the overestimation of DDPG is the Q-value function Q ω (·). In MDDPG, we use the maximum value of the n-step action currently estimated by the neural network instead of the Q-value function Q ω (·), as it reduces the possibility of positive error generated by Q ω (·). In addition, we use the n-step discounted accumulated reward instead of current reward r i to alleviate the undervaluation problem.
According to the above target Q-value in Equation (18), we update the loss function of DDPG (Equation (15)), as shown in Equation (19):

Priority Experience Playback Mechanism Based on SumTree
Convergence rate is one of the essential indicators of reinforcement learning performance. The DDPG algorithm implements the experience playback mechanism through the random sampling method, so the sampling probability of each sample in the experience pool will be the same. This leads to the low utilization rate of valuable samples, thus reducing the convergence rate of the algorithm.
In order to improve the convergence rate of the MDDPG algorithm, we use a prioritized mechanism based on the SumTree structure to optimize the experience pool sampling method of MDDPG. The SumTree structure contains leaf nodes, parent nodes and a root node. We calculate the corresponding priority p i for each tuple in the experience pool and stores it in the leaf node. The parent nodes store the sum of the stored values of the two child nodes connected to it, while the root node holds the sum of all leaf node storage values. The RL agent uniformly takes the value in the interval of the root node storage value and samples the experience pool data through the SumTree data extraction method, as shown in Algorithm 1. In the SumTree architecture, the higher the priority p i , the greater the probability of being sampled, thus improving the utilization of significant samples in the experience pool.
In the DDPG algorithm, the TD error, which is the difference between the current Q-value and its next bootstrap estimate, is a critical indicator for judging the importance of samples in the experience pool. The more significant the TD error of a sample, the lower the utilization of this sample and the greater the potential to improve the performance of the neural network. Therefore, we select the TD error as the sample priority value p i , as shown in Equation (20), where y (n) i represents the n-step target Q-value.
Algorithm 1: SumTree-based data extraction algorithm 1: Uniform sampling of data at the root node, assuming as p 2: Take the root node as the parent node and traverse its child nodes 3: repeat 4: Define p L as the value stored on left child node; 5: if p L > p then 6: Take the left child node as the parent node and traverse its child nodes; 7: else 8: p = p − p L ; 9: Select the right child node as the parent node and traverse its child nodes; 10: end if 11: until the selected child node is leaf node; 12: Extract the tuple corresponding to the selected leaf node from experience pool.
The sampling probabilities of each sample P(i) and importance sampling weights w i are shown in Equations (21) and (22), respectively: where N is the number of samples. Next, we add importance sampling weights to update the policy gradient function (Equation (13)) and loss function (Equation (19)), as shown in Equations (23) and (24), respectively:

Working Process of the AQMDRL Agent
The working process of the AQMDRL agent at each iteration is described in Algorithm 2 below.

Input:
Reward discount factor γ, target update rate τ, number of mini-batch samples N, number of episodes M.

Initialize the parameters θ and ω;
Copy the parameters θ ← θ and ω ← ω; Initialize replay buffer and set the priority of all leaf nodes in SumTree to 0; Initialize the action to explore random noise N ; Initialize the state s 1 collected from the SDN controller; 1: for t = 1 to M do 2: Select the action weight a t = µ(s t ) + N according to the state s t in the actor online network; 3: Deploy a t on the SDN controller; Store the tuple (s t , a t , r t , s t+1 ,) into the replay buffer; 7: Calculate the sample priority value p i ; 8: Update all nodes' value of SumTree; 9: Extract n-step samples from the replay buffer through Algorithm 1; 10: Calculate the importance-sampling weight using Equation (22); 11: Calculate the target Q-value using Equation (18); 12: Use loss function (Equation (24)) to update the parameter ω; 13: Use policy gradient function (Equation (23)) to update the parameter θ; 14: Update the target network parameters:

Experiments and Analysis
This section presents the evaluation of AQMDRL. We first offer the experimental setup and parameter settings, then the algorithm performance comparison, and finally, the ablation experiment.

Experimental Setup
In the experiments, we chose Ubuntu 18.04 for the computer operating system. The experimental simulation environment is Mininet emulation platform and RYU controller. The computer was equipped with an Intel Core i7-7700HQ CPU processor with a base frequency of 2.80 GHz, 1 TB HDD and 16 GB of RAM. The network topology was referenced to NSFNET, a backbone network formed by the National Science Foundation [39]. The network contained 14 switches and 21 links, all with a link bandwidth of 10 Mbps, as shown in Figure 2, where each switch connected to a host for receiving and sending data. The PyTorch deep learning framework implemented reinforcement learning algorithms in SDN.

Parameter Setting
The purpose of adjusting the parameters is for the model to obtain matching network scenarios and improve the model's convergence performance. Consequently, our experiments determine the model's hyper-parameters by comparing the effects of AQMDRL under various parameters. In the following experiments, to effectively reduce the link delay of the network, we set parameters and of the reward function as 10 and 1, respectively.
As a critical parameter in deep learning, the learning rate determines whether and when the objective function can converge to a local minimum. We observe the effect of

Parameter Setting
The purpose of adjusting the parameters is for the model to obtain matching network scenarios and improve the model's convergence performance. Consequently, our experiments determine the model's hyper-parameters by comparing the effects of AQMDRL under various parameters. In the following experiments, to effectively reduce the link delay of the network, we set parameters β 1 and β 2 of the reward function as 10 and 1, respectively.
As a critical parameter in deep learning, the learning rate determines whether and when the objective function can converge to a local minimum. We observe the effect of the AQMDRL scheme by altering the learning rate and determine the optimal value of AQMDRL applied to the network by convergence speed and stability. In this experiment, we found that different traffic intensities have different optimal actor learning rates. We set the actor learning rate to 0.0001, 0.0003, 0.001, 0.01, and 0.1, and tested the performance of every training step. The training results are shown in Figure 3. In addition, the parameters of step , critic learning rate and size of replay buffer are also crucial for the experiments. Since the optimal values of these three parameters are consistent under the four traffic intensities, we choose to show our tuning results in an environment with 100% traffic intensity. The results are presented in Figure 4. In Figure 4a, although the 8-step option reaches convergence first, the convergence value is lower, and the computational cost is higher for the 8-step compared to the 4-step option; therefore, we choose 4 as the number of steps. In Figure 4b, the reward curve is the best when the critic learning rate is 0.0003 because when the learning rate is too small or too large, the learning of AQMDRL is relatively unstable, which leads to the inability to obtain the optimal policy in a short time. In Figure 4c, the reward curve is the best when  Figure 3 shows that the optimal actor learning rate varies with traffic intensity. Under high traffic intensities (100% and 75%), the reward curve is the best when the actor learning rate is 0.0001, as shown in Figure 3a,b. The reward curve is the best when the actor learning rate is 0.1 at low traffic intensities (50% and 25%), as shown in Figure 3c,d. In a high traffic intensity environment, because the link easily reaches the saturation state, the routing algorithm needs to adjust the link weight value more frequently. The reward is more stable after convergence at a lower actor learning rate. Based on the above results, the actor-learning rate of AQMDRL is set to 0.1, 0.1, 0.0001, and 0.0001 when the traffic intensity is 25%, 50%, 75%, and 100%, respectively. In low traffic intensity, the remaining bandwidth of the network link is sufficient, and the agent can speed up the optimal reward value exploration through a higher learning rate. However, in high traffic intensity, the agent needs a lower learning rate to explore the optimal reward value.
In addition, the parameters of step n, critic learning rate and size of replay buffer are also crucial for the experiments. Since the optimal values of these three parameters are consistent under the four traffic intensities, we choose to show our tuning results in an environment with 100% traffic intensity. The results are presented in Figure 4.
In Figure 4a, although the 8-step option reaches convergence first, the convergence value is lower, and the computational cost is higher for the 8-step compared to the 4-step option; therefore, we choose 4 as the number of steps. In Figure 4b, the reward curve is the best when the critic learning rate is 0.0003 because when the learning rate is too small or too large, the learning of AQMDRL is relatively unstable, which leads to the inability to obtain the optimal policy in a short time. In Figure 4c, the reward curve is the best when the replay buffer is 50. Figure 3. Effect of learning parameters: (a) the traffic intensity is 100%; (b) the traffic intensity is 75%; (c) the traffic intensity is 50%; (d) the traffic intensity is 25%.
In addition, the parameters of step , critic learning rate and size of replay buffer are also crucial for the experiments. Since the optimal values of these three parameters are consistent under the four traffic intensities, we choose to show our tuning results in an environment with 100% traffic intensity. The results are presented in Figure 4. In Figure 4a, although the 8-step option reaches convergence first, the convergence value is lower, and the computational cost is higher for the 8-step compared to the 4-step option; therefore, we choose 4 as the number of steps. In Figure 4b, the reward curve is the best when the critic learning rate is 0.0003 because when the learning rate is too small or too large, the learning of AQMDRL is relatively unstable, which leads to the inability to obtain the optimal policy in a short time. In Figure 4c, the reward curve is the best when the replay buffer is 50.
The hyperparameters set in the experiment are shown in Table 1. The hyperparameters set in the experiment are shown in Table 1.

Algorithm Performance Comparison
In this section, we compare the network performance of AQMDRL with the traditional routing algorithm OSPF [2] and the DRL algorithms DDPG [25] and TD3 [30]. Evaluation metrics are convergence performance, link latency, load balancing and throughput.
First, we test the convergence performance differences of three DRL algorithms-A-QMDRL, TD3, and DDPG-under different traffic intensities, as shown in Figure 5. We trained 400 episodes for each traffic intensity and tested the model's performance in the training environment for every training step.
Reward value is used to measure the outcome of a RL action. The larger the reward value, the better the network performance is. As demonstrated in Figure 5, our AQMDRL algorithm has the fastest convergence rate and the most stable convergence performance compared to the other algorithms. DDPG has poor performance due to the problem of overestimation when blindly selecting the action with the largest Q-value. Its convergence rate is slow, and there are still large fluctuations after convergence in a high-traffic intensity scenario. TD3 minimizes the two Q-values generated by the target network, effectively solving the overestimation problem of DDPG. In the highest and lowest traffic intensity scenarios, TD3 converges faster and has a more stable convergence performance than DDPG, as shown in Figure 5a,c,d; however, it does not consider the problem of underestimation. Therefore, the curve jump is more significant under average traffic intensity, and the convergence performance is lower than that DDPG, as shown in Figure 5b. The proposed AQMDRL uses the multistep method to mitigate the overestimation and underestimation of DDPG, effectively improving the stability of the convergence in all traffic intensity scenarios. In addition, the priority caching mechanism based on SumTree improves the convergence speed of DDPG. It is worth noting that when the traffic changes more drastically, the agent must re-explore the learning operation, so the reward will be jumpy. AQMDRL will show occasional jumps but reaches convergence soon, while other algorithms take longer to restore the convergence state. The experimental results show that AQMDRL outperforms DDPG and TD3 in all traffic intensity scenarios. Link delay and load balancing are significant for the quality of service in the net In order to explore the performance of AQMDRL, TD3, DDPG and OSPF, we further pare the link delay and load balancing differences.
First, we focus on the link delay coefficient, that is, the sum of the reciprocal d of each link, as shown in Equation (6), to compare the optimization of link delay in networks by various algorithms. The higher the link delay coefficient, the lower th delay, and the faster the transmission. Table 2 shows the comparison results of eac sode's average link delay coefficients at different traffic intensities. Meanwhile, we also compare the average load-balancing coefficient of each ep at different traffic intensities. The load-balancing coefficient is the sum of the sta deviation of all link bandwidths, which is used to examine the link load-balancing d of the network, as shown in Equation (7). The lower the load-balancing coefficien better the link load-balancing degree, and the higher the network link utilization. T sults are shown in Table 3. Link delay and load balancing are significant for the quality of service in the network. In order to explore the performance of AQMDRL, TD3, DDPG and OSPF, we further compare the link delay and load balancing differences.
First, we focus on the link delay coefficient, that is, the sum of the reciprocal delays of each link, as shown in Equation (6), to compare the optimization of link delay in SDN networks by various algorithms. The higher the link delay coefficient, the lower the link delay, and the faster the transmission. Table 2 shows the comparison results of each episode's average link delay coefficients at different traffic intensities. Meanwhile, we also compare the average load-balancing coefficient of each episode at different traffic intensities. The load-balancing coefficient is the sum of the standard deviation of all link bandwidths, which is used to examine the link load-balancing degree of the network, as shown in Equation (7). The lower the load-balancing coefficient, the better the link load-balancing degree, and the higher the network link utilization. The results are shown in Table 3. As shown in Tables 2 and 3, the average link-delay coefficient and average loadbalancing coefficient of AQMDRL are optimal in most traffic intensity scenarios. When the traffic intensity is 25%, the average link-delay coefficient of AQMDRL is slightly higher than that of TD3 and DDPG, while the average load-balancing coefficient is 15% lower than TD3 and 7% lower than DDPG. When the traffic intensity is 50%, AQMDRL has the best performance, with the average link-delay coefficient raised by 9% relative to TD3 and 6% compared to DDPG; the average load-balancing coefficient is 16% lower than TD3 and 27% lower than DDPG. When the traffic intensity is 75%, the average load-balancing coefficient of AQMDRL is worse than that of TD3 and DDPG, but the average link-delay coefficient is the highest. When the traffic intensity is 100%, the average link delay coefficient of AQMDRL is slightly lower than that of TD3, but 5% higher than that of DDPG. The average load-balancing coefficient of AQMDRL is 24% lower than TD3 and 19% lower than DDPG. Although the average load-balancing factor of OSPF is close to the three RL algorithms, the average link-delay factor is more than 10% lower than that of RL algorithms. Our results indicate that AQMDRL is superior to other reinforcement learning algorithms and OSPF. The reason is that AQMDRL improved the convergence rate and stability of the algorithm through priority experience pool mechanism and multi-step method, resulting in short trial and error time for early exploration. When there are significant changes in traffic, the agent needs to re-explore the learning operation, and AQMDRL can achieve convergence faster.
Additionally, when the traffic intensity is 25-75%, the AQMDRL's confidence intervals of the average link-delay and load-balancing coefficient are close to TD3 and lower than DDPG. When the flow intensity is 100%, the confidence interval of AQMDRL is lower than that of TD3 and DDPG. The results show that AQMDRL collects more concentrated values and has better convergence performance than DDPG and TD3.
Next, we compare the average throughput of all hosts at different traffic intensities, as shown in Table 4. Throughput is an important indicator to measure the performance of the network. The results show that AQMDRL are optimal in all traffic intensity scenarios. In the low-traffic intensity network scenario, the network traffic does not reach the network forwarding bottleneck, and the network links are less congested. Therefore, the throughput of all algorithms has little difference, and AQMDRL is slightly higher than other algorithms. However, when the network traffic scale exceeds the network forwarding bottleneck, the throughput of AQMDRL is more than 5% higher than that of the other three algorithms due to its convergence rate and stability. Finally, we analyze the calculation cost of the AQMDRL, TD3 and DDPG algorithms. In terms of the neural network calculation cost, both the AQMDRL and DDPG algorithms use one online network and one target network to train a mini-batch of samples, whereas TD3 requires two online networks and two target networks. Therefore, the calculation cost of TD3 is higher than that of AQMDRL and DDPG. In addition, compared with DDPG and TD3, AQMDRL increases the calculation cost of SumTree-based priority sampling O(logN), where N is the size of the replay buffer. In summary, the calculation cost of AQMDRL is higher than that of DDPG and lower than that of TD3. Meanwhile, AQMDRL has a better convergence rate and performance than DDPG and TD3.

Ablation Experiment
In this section, we perform ablation experiments on the proposed algorithms to verify the effectiveness of each module, mainly a multi-step module or a SumTree priority experience replay mechanism module. We compare the AQMDRL (DDPG + multi-step + SumTree) algorithm with the DDPG, SumTree-DDPG (DDPG + SumTree) and MDDPG (DDPG + multistep) algorithms in an environment with 100% traffic intensity. The results of the ablation experiments are shown in Figure 6. and TD3, AQMDRL increases the calculation cost of SumTree-based priority sampling ( ), where is the size of the replay buffer. In summary, the calculation cost of AQMDRL is higher than that of DDPG and lower than that of TD3. Meanwhile, AQMDRL has a better convergence rate and performance than DDPG and TD3.

Ablation Experiment
In this section, we perform ablation experiments on the proposed algorithms to verify the effectiveness of each module, mainly a multi-step module or a SumTree priority experience replay mechanism module. We compare the AQMDRL (DDPG + multi-step + SumTree) algorithm with the DDPG, SumTree-DDPG (DDPG + SumTree) and MDDPG (DDPG + multi-step) algorithms in an environment with 100% traffic intensity. The results of the ablation experiments are shown in Figure 6. Figure 6 shows that each module of AQMDRL is crucial. SumTree-DDPG eliminates the multistep module of AQMDRL, and the convergence performance of the agent worsens. The MDDPG algorithm eliminates the SumTree-based prioritized experience playback of AQMDRL. Although it mitigates the overestimation problem of DDPG, its convergence rate is slow. The performance pf DDPG is the worst among all algorithms, with a slow convergence rate in the early stage and more jumps in the late stage. This is because DDPG blindly selects the action with the largest Q-value when selecting the action, which makes the algorithm itself have an overestimation problem. The experimental results show that our proposed AQMDRL algorithm uses a multistep method to mitigate the overestimation and underestimation of DDPG and improve the stability of agent convergence, while adding the SumTree-based priority caching mechanism to improve the convergence rate of DDPG.

Conclusions
This paper proposes AQMDRL, an automatic QoS architecture based on multistep DRL, which aims to solve the QoS problem encountered in SDN. This algorithm uses a multistep approach to solve the overestimation problem of the DDPG algorithm, which can effectively improve its stability. In addition, AQMDRL adapts a SumTree-based prioritized experience playback mechanism to optimize the convergence rate of the multistep DDPG algorithm. We design a DRL-based QoS optimization model to evaluate the SDN Figure 6. Ablation experiment results. Figure 6 shows that each module of AQMDRL is crucial. SumTree-DDPG eliminates the multistep module of AQMDRL, and the convergence performance of the agent worsens. The MDDPG algorithm eliminates the SumTree-based prioritized experience playback of AQMDRL. Although it mitigates the overestimation problem of DDPG, its convergence rate is slow. The performance pf DDPG is the worst among all algorithms, with a slow convergence rate in the early stage and more jumps in the late stage. This is because DDPG blindly selects the action with the largest Q-value when selecting the action, which makes the algorithm itself have an overestimation problem. The experimental results show that our proposed AQMDRL algorithm uses a multistep method to mitigate the overestimation and underestimation of DDPG and improve the stability of agent convergence, while adding the SumTree-based priority caching mechanism to improve the convergence rate of DDPG.

Conclusions
This paper proposes AQMDRL, an automatic QoS architecture based on multistep DRL, which aims to solve the QoS problem encountered in SDN. This algorithm uses a multistep approach to solve the overestimation problem of the DDPG algorithm, which can effectively improve its stability. In addition, AQMDRL adapts a SumTree-based prioritized experience playback mechanism to optimize the convergence rate of the multistep DDPG algorithm. We design a DRL-based QoS optimization model to evaluate the SDN network quality of service and experiments to evaluate the performance of AQMDRL. The experiments show that AQMDRL significantly improves the load-balancing performance of SDN and effectively reduces the network transmission delay over existing DRL algorithms.
In future work, we will consider improving the reinforcement learning neural network to capture the characteristics of the network state better. Additionally, we will further expand the network scale and explore more methods, such as multi-agent reinforcement learning methods, multi-objective reinforcement learning methods, or multi-strategy RL methods, to improve the convergence performance of DRL algorithms in large-scale networks and optimize the network QoS more effectively. Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: The authors confirm that the data supporting the findings of this study are available within the article.

Conflicts of Interest:
The authors declare that they have no conflict of interest.