Dynamic Service Function Chain Deployment and Readjustment Method Based on Deep Reinforcement Learning

With the advent of Software Defined Network (SDN) and Network Functions Virtualization (NFV), network operators can offer Service Function Chain (SFC) flexibly to accommodate the diverse network function (NF) requirements of their users. However, deploying SFCs efficiently on the underlying network in response to dynamic SFC requests poses significant challenges and complexities. This paper proposes a dynamic SFC deployment and readjustment method based on deep Q network (DQN) and M Shortest Path Algorithm (MQDR) to address this problem. We develop a model of the dynamic deployment and readjustment of the SFC problem on the basis of the NFV/SFC network to maximize the request acceptance rate. We transform the problem into a Markov Decision Process (MDP) and further apply Reinforcement Learning (RL) to achieve this goal. In our proposed method (MQDR), we employ two agents that dynamically deploy and readjust SFCs collaboratively to enhance the service request acceptance rate. We reduce the action space for dynamic deployment by applying the M Shortest Path Algorithm (MSPA) and decrease the action space for readjustment from two dimensions to one. By reducing the action space, we decrease the training difficulty and improve the actual training effect of our proposed algorithm. The simulation experiments show that MDQR improves the request acceptance rate by approximately 25% compared with the original DQN algorithm and 9.3% compared with the Load Balancing Shortest Path (LBSP) algorithm.


Introduction
Network operators often face significant challenges in delivering services efficiently. Traditional network architectures rely heavily on specialized hardware devices known as middleboxes to provide various network functions (NFs). However, this approach can result in high capital and operating expenses. To address this issue, Network Function Virtualization (NFV) unbundles specific network functions (e.g., firewalls, Deep Packet Inspection) from hardware and allows them to run on general-purpose devices (e.g., X86 servers) [1]. Moreover, Software Defined Networking (SDN) technology enables flexible deployment and traffic scheduling of NFV by decoupling the control plane from the data plane [2]. Service Function Chain (SFC) consists of a specific sequence of Virtual Network Functions (VNFs), each of which can be deployed on a virtual machine of the underlying network and flexibly scaled or migrated to other servers [3,4].
In domains such as 5G networks, cloud computing, and data centers, we need various services to meet the user demands. Therefore, it is worthwhile to study how to quickly and flexibly deploy SFCs on the underlying network according to users' needs [5]. This deployment process requires optimal utilization of computing resources while also satisfying latency and bandwidth constraints and arranges the deployment location of SFC/VNF 1.
We propose MQDR, a DRL-based framework for dynamic deployment and readjustment of SFCs. This framework employs two trained agents, A * and B * , to select VNF deployment locations by A * and adjust the underlying network by B * , respectively, to enable SFC requests to be received successfully when such requests arrive dynamically. We first present the use of these two agents together to address the dynamic SFC deployment problem.

2.
We impose restrictions on the action space to simplify the training of the agents and enhance the deployment performance. Unlike allowing the agents to determine deployment locations among all nodes, MQDR incorporates the M Shortest Path algorithm (MSPA) to reduce the range of actions to be chosen by the agents. 3.
Finally, we compare MQDR with other methods, and it is found that MDQR improves the request acceptance rate by approximately 25% compared with the original DQN algorithm and by 9.3% compared with the Load Balancing Shortest Path (LBSP) algorithm. Therefore, the proposed method is a feasible solution to the dynamic SFC deployment problem.
The remainder of this paper is organized as follows: Section 2 provides a summary of previous related work conducted by other researchers; Section 3 presents the network architecture, the system model, and the formulation description of the research problem; Section 4 describes our MQDR algorithm in detail; Section 5 shows simulation experiments and performance evaluation; and Section 6 concludes and discusses future work.

Related Works
Traditional network function equipment can lead to a long update cycle of network service and high operating expenses. In response to the practical need to solve these problems, SDN and NFV have become research focuses in the relevant field in recent years [27][28][29]. On the basis of SDN/NFV technologies, SFC defines a sequence of ordered VNFs. In earlier studies, the SFC deployment problem was usually modelled as an optimization problem and solved by integer linear programming (ILP) or its deformation. Zhong et al. [7] proposed an ILP model to minimize the cost of deploying SFCs among data centers. Bari et al. [6] considered minimizing operational costs while maximizing resource utilization and proposed a corresponding ILP model. Savi et al. [8] considered the impact of network function location on deployment costs and proposed an ILP model. Addis et al. [9] developed an MILP model to minimize the CPU resources used to instantiate the VNF. However, most of these studies focused on the static deployment of SFCs and did not consider that SFC requests are constantly arriving.
Compared with ILP, heuristic algorithms can reduce the calculation time to obtain suboptimal solutions. Rankothge et al. [10] proposed an SFC resource allocation framework using genetic algorithms. The ILP takes several hours to compute for SFC deployment in a network of 16 servers, while their heuristic algorithm takes only a few milliseconds. Jin et al. [11] proposed a depth-first search algorithm to select paths and a path-based greedy algorithm to allocate VNFs when applying the MILP problem to a large network scenario. Wu et al. [12] proposed a heuristic algorithm to jointly optimize end-to-end latency, resource consumption, and network load balancing on the basis of SRv6. Although heuristic algorithms can deploy SFCs sequentially, they may lack flexibility and are unsuitable for their dynamic deployment of SFCs. Additionally, these algorithms are susceptible to quickly falling into local optima, which limits their effectiveness.
Machine Learning (ML) has emerged as a powerful tool for various domains in recent years [30][31][32][33]. Some studies have leveraged it to tackle the challenges of SFC deployment problem. Tang et al. [14] solved the node overload problem by predicting the resources required for SFCs on the basis of a deep belief network prediction algorithm. For nodes that are predicted to be overloaded, each VNF on them is migrated to the underlying node that satisfies the resource threshold constraint by a greedy algorithm based on merit selection. Subramanya et al. [15] considered a neural network to predict traffic, followed by deployment with heuristic algorithms. Among the available studies, deep learning was used mainly for traffic prediction, aiding other deployment methods.
In Reinforcement Learning (RL), the agent struggles to obtain the strategies that maximize returns during interaction with the environment. RL has been applied to various research areas, including complex networking and communications problems [17]. Several studies have recently applied RL to the SFC deployment problem. Sun et al. [18] proposed a method that combines graph neural networks with RL for SFC deployment. Li et al. [19] proposed an adaptive SFC deployment method that chooses between two heuristic algorithms using DQN. However, choosing between two heuristic algorithms leads to a limited range of actions. Wang et al. [20] utilized RL to determine to which data center the SFCs and standby SFCs are deployed. The authors proposed five backup-level schemes to improve fault tolerance. However, this study considered only where SFCs are deployed and not where VNFs are deployed. Gu et al. [21] proposed an Intelligent VNF Orchestration and Flow Scheduling model via Deep Deterministic Policy Gradient (DDPG), aided by a heuristic algorithm to accelerate the training progress. However, the model in that paper assumes that the underlying resources are unlimited and that traffic requests can be predicted in advance, which is inconsistent with reality. Pei et al. [22] formulated a Binary Integer Programming (BIP) model and obtained the placement of VNFs by Double Deep Q-network (DDQN). When resources were insufficient, they considered horizontal scaling, which initiated new instances. However, this study also assumed that traffic can be predicted in advance. Qiu et al. [23] proposed a DQN-based online SFC deployment method. The algorithm minimizes the total resource consumption overhead while satisfy- ing the request delay constraint and improves the request acceptance rate of the operator's network. However, this algorithm does not limit the action space of the DQN, which poses a challenge for training the algorithm effectively in practice.
There already exists some research on the readjustment of SFCs. Fu et al. [24] applied DQN to adapt SFCs to changing traffic in IoT networks and proposed decomposing VNFs into components, but did not elaborate on the implementation details. Tang et al. [25] proposed a traffic prediction method and designed two dynamic VNF instance scaling algorithms. Liu et al. [26], proposed a Dyna-Q-based SFC readjustment algorithm to balance the load of the underlying network and demonstrated its advantages over the baseline algorithm in terms of overhead and running time. However, these studies focused mainly on adjusting existing SFCs to adapt to traffic changes and did not address how to accommodate new SFC requests.
Compared with existing studies, we consider constraints on the action space when applying DRL and consider network readjustment for newly arrived SFC requests. These changes have paid off and improved the request acceptance rate.

System Model and Problem Description
This section provides a detailed description of the VNF/SFC-based network model, followed by a formulation of the dynamic deployment and readjustment problems of SFCs. The problem is then defined as a Markov decision process (MDP).

Network Model
The network architecture considered in this study is a three-layer NFV architecture [34,35], as depicted in Figure 1 (the Notations used in the model are shown in Table 1). The physical layer consists of physical nodes (servers) connected by physical links in the underlying network. In the VNF layer, VNFs are instantiated on physical nodes that provide the required resources. In the application layer, different NFs are deployed on the corresponding type of VNFs to form SFCs.

Notations
Description G The underlying network  End-to-end delay allowed for the request r j The underlying network is modeled as a fully connected undirected graph, denoted as G = (V S , E S ), where the set of nodes is represented by V S , and the set of links is represented by E S . Each underlying node in the network has a CPU resource capacity denoted as C S i . The physical link between node v s i and node v s j is represented by e s ij ∈ E S , with a bandwidth capacity of BW S ij and a latency of L S ij .

SFC and VNF
The set of VNF types is denoted as F, where each VNF has its own type f . coe f f f c represents the CPU resource factor, indicating the amount of CPU resources consumed per unit of traffic passing through the VNF of type f . θ f represents the traffic scaling factor, indicating the multiple by which traffic passing through the VNF is of type f . The set of SFC requests is denoted by R. Each request r j ∈ R can be denoted as where the ingress is v s j,in , and the egress is v s j,out . V NF j denotes the ordered set of NF requests, which is the set of VNF types through which r j passes in order. The CPU resource requirement of the u-th NF v n f j,u ∈ V NF j is c n f j,u . The traffic segment of r j is denoted as logical link e n f j,uv ∈ E NF j , where bw n f j,uv represents its bandwidth requirement, while l j represents the maximum end-to-end delay allowed for the request. We assume that the traffic requested by the SFC does not change after arrival, and [z j s , z j e ] represents the service period of r j .
For each SFC request r j , we use a Boolean variable p j to show whether it has been successfully deployed (the Notations of Decision variables are shown in Table 2).  The variable t f j,u ∈ {0, 1} specifies the type of NFs in the SFC. It is noteworthy that input and output nodes do not deploy VNFs, i.e., t f j,u equals 0.
The decision variable η i,t j,u ∈ {0, 1} is introduced to show whether the u-th virtual node v n f j,u ∈ V NF j of r j is deployed on physical node v s i ∈ V S during t. We partition time into discrete time slots, with each slot representing a specific temporal interval. The variable t is introduced to represent one of these time slots.
The decision variable τ pq j,uv ∈ {0, 1} is introduced to show whether the traffic segment e n f j,uv ∈ E NF j of r j flows through the physical link e s pq ∈ E S during t. If an NF of type f maps to v s i ∈ V S , then v s i must also instantiate a VNF of type f . The The CPU resource requirements of v n f j,v ∈ V NF j can be determined by the incoming traffic bandwidth: The traffic may experience scaling as it passes through each VNF, resulting in changes to the required bandwidth: is a set of sequential NFs. We define the readjustment of SFC as an NF migration process. For instance, the migration of the u-th NF of r j from v s i to v s i can be defined as follows: Simultaneously, the mapping between the logical links e passing through the original deployment route and E s 2 be the set of physical links passing through the post-migration deployment route. Thus, we can express this change as: The above variables satisfy the following constraints: each VNF of the SFC must be deployed on a single physical node, and each virtual link must be mapped onto a single physical link: The CPU resources available on the node must satisfy the following constraint: The underlying link bandwidth must satisfy the following constraint: SFC must satisfy the following latency constraint: Every connected pair of VNFs must satisfy the principle of traffic conservation: and Ω − (v s p ) represent the set of upstream and downstream nodes, respectively. Our objective is to deploy as many SFC as possible, which can be represented by maximizing the SFC request acceptance rate:

Markov Decision Process (MDP)
The dynamic deployment and readjustment of SFC is a complex and multi-faceted process that can be approached using RL, which is based on the MDP. RL is employed to identify the optimal policy, which is a mapping of states to actions aimed at maximizing the final cumulative return.
There are two main categories of methods used in DRL: (1) value-based methods, such as Deep Q Network (DQN), and (2) policy-based methods, such as Policy Gradient (PG). The difference between them is as follows [36]: (1) Value-based methods are more appropriate for action spaces that are discrete and low-dimensional, while policy-based methods are better suited for continuous action spaces. (2) The value-based methods provide the value of actions, whereas the strategy-based methods provide the most valuable action directly. (3) The value-based methods update the agent after each step, whereas the strategy-based methods update them after each episode.
We then formally define dynamic SFC deployment and readjustment as a Markov decision process. Typically, a Markov decision model is defined as S, A, P, R, γ , where S represents the set of states, A represents the set of discrete actions, and P : S × A × S represents the probability distribution function of state transfers. R : S × A represents the reward function. γ ∈ [0, 1] represents the discount factor of the current reward value to the future-the higher the discount factor, the more attentive the agent is to the impact of the current step on the future.
State: We define each state s i ∈ S as a vector . ., C t |V s | ] represents the CPU resource of the underlying network, and BW t = [BW t 1 , BW t 2 , . . . , BW t |E s | ] represents the available bandwidth resource of the underlying physical link. I t = [r j , v j l , l t j ] represents the state of the currently deployed SFC, r j represents the currently arriving SFC request, v l j represents the last NF deployed physical node, and l t j denotes the sum of the latency of all used paths of the SFC.
In DRL, an agent selects an optimal action on the basis of the current state S. In the deployment and readjustment, we define the actions chosen by the agent separately as follows: Deployment Action: Let k = 1, 2, . . . , |V s | be the indexes of the nodes in the network. An action a ∈ A is an integer where A = {0, 1, 2, . . . , |V s |}. If a = 0, no physical nodes are available for deployment. Otherwise, a indicates the node index where the NF is deployed.
Readjustment Action: Let k = 1, 2, . . . , |V s | be the indexes of the nodes in the network. An action a ∈ A is an integer where A = {0, 1, 2, . . . , |V s |}. If a = 0, there are no physical nodes to be adjusted. Otherwise, a indicates the node index where we carry out the readjustment.
Reward: Our objective function is to maximize the service request acceptance rate. Therefore, the reward function is defined as: 1, when sfc is deployed 0, else (17) State Transition: The state transition of MDP is defined as (s t , a t , r t , s t+1 ), where s t denotes the current state, a t is the action taken by the agent in the current state, r t is the reward received for taking a t , and s t+1 is the resulting state.
This section provides a detailed description of the VNF/SFC-based network model, which serves as the basis for formulating the dynamic problem of SFC deployment and readjustment. The problem is then framed as a Markov Decision Process, enabling the application of DRL to address it.

Algorithm Design
This section presents the framework of the DQN-based dynamic SFC deployment and readjustment algorithm, which uses trained RL agents to make decisions directly. Subsequently, a detailed description of the DQN-based dynamic SFC deployment and readjustment algorithm and its training process is provided.

DQN-Based Dynamic SFC Deployment and Readjustment Framework (MQDR)
In the previous section, we adopted the MDP to continually and automatically describe the transition of actions and states. Reinforcement learning can find the optimal policy given a Markov decision process, where the policy maps states to actions. As illustrated in Figure 2, the agent interacts with the environment by perceiving the current state and selecting an action from the available set of actions. Following the execution of the chosen action, the agent receives a reward signal from the environment, and the system transitions to the next state. The agent leverages these transitions to learn from the environment during the training phase.

DQN-Based Dynamic SFC Deployment and Readjustment Framework (MQDR)
In the previous section, we adopted the MDP to continually and autom describe the transition of actions and states. Reinforcement learning can find the o policy given a Markov decision process, where the policy maps states to actio illustrated in Figure 2, the agent interacts with the environment by perceiving the state and selecting an action from the available set of actions. Following the execu the chosen action, the agent receives a reward signal from the environment, a system transitions to the next state. The agent leverages these transitions to learn fr environment during the training phase. Most existing studies on the dynamic SFC deployment problem negl readjustment of the deployed SFCs, which can be beneficial in some cases. Figure 3 an example of such a case, where NodeA and NodeB have equal CPU resources, grey squares represent the CPU resources occupied by the VNFs. Suppose NodeA resources are 75% occupied and NodeB's CPU resources are 70% occupied. A ne request that requires 40% of the CPU resources cannot be deployed successfully. H if we move a VNF that occupies 25% of CPU resources from NodeA to NodeB, deploy the new SFC request after the readjustment [37]. Moreover, the readju enables us to deploy the new SFC on a path with lower latency to meet the req latency requirements.  Most existing studies on the dynamic SFC deployment problem neglect the readjustment of the deployed SFCs, which can be beneficial in some cases. Figure 3 shows an example of such a case, where NodeA and NodeB have equal CPU resources, and the grey squares represent the CPU resources occupied by the VNFs. Suppose NodeA's CPU resources are 75% occupied and NodeB's CPU resources are 70% occupied. A new SFC request that requires 40% of the CPU resources cannot be deployed successfully. However, if we move a VNF that occupies 25% of CPU resources from NodeA to NodeB, we can deploy the new SFC request after the readjustment [37]. Moreover, the readjustment enables us to deploy the new SFC on a path with lower latency to meet the requested latency requirements. In the previous section, we adopted the MDP to continually and automatically describe the transition of actions and states. Reinforcement learning can find the optimal policy given a Markov decision process, where the policy maps states to actions. As illustrated in Figure 2, the agent interacts with the environment by perceiving the current state and selecting an action from the available set of actions. Following the execution of the chosen action, the agent receives a reward signal from the environment, and the system transitions to the next state. The agent leverages these transitions to learn from the environment during the training phase. Most existing studies on the dynamic SFC deployment problem neglect the readjustment of the deployed SFCs, which can be beneficial in some cases. Figure 3 shows an example of such a case, where NodeA and NodeB have equal CPU resources, and the grey squares represent the CPU resources occupied by the VNFs. Suppose NodeA's CPU resources are 75% occupied and NodeB's CPU resources are 70% occupied. A new SFC request that requires 40% of the CPU resources cannot be deployed successfully. However, if we move a VNF that occupies 25% of CPU resources from NodeA to NodeB, we can deploy the new SFC request after the readjustment [37]. Moreover, the readjustment enables us to deploy the new SFC on a path with lower latency to meet the requested latency requirements.  We propose a dynamic SFC deployment and readjustment method based on a deep Q network and M Shortest Path Algorithm (MQDR), which is illustrated in Figure 4. When a request arrives dynamically, it is first readjusted for the underlying network by the welltrained agent B * . Agent B * adjusts some of the serving SFCs in the underlying network on the basis of the current network conditions and the arriving SFC request. It migrates some VNFs on the nodes that are close to resource saturation and creates conditions for the arriving SFC to be deployed. Following the adjustment, agent A * is responsible for the dynamic deployment of the SFCs. The two agents work collaboratively towards the goal of maximizing the number of SFC requests that can be received.  As discussed in the previous section, the action space for both SFC deployment and dynamic readjustment is discrete. Therefore, we utilized t algorithm, which is a value-based algorithm that supports discrete action combines neural networks and Q-learning. It inputs the current state directly to t network, computes the value of all actions and outputs the best one.   As discussed in the previous section, the action space for both SFC dynamic deployment and dynamic readjustment is discrete. Therefore, we utilized the DQN algorithm, which is a value-based algorithm that supports discrete actions. DQN combines neural networks and Q-learning. It inputs the current state directly to the neural network, computes the value of all actions and outputs the best one. Figure 4 depicts our framework which involves two agents, A * and B * . In multi-agent reinforcement learning, the following relationships may exist among agents: (1) cooperative relationship, where multiple agents have the same goal as each other; (2) competitive relationship, where one agent's gain is another agent's loss; (3) mixed relationship, where agents cooperate with some and compete with others; and (4) egoistic relationship, where agents care only about their own gain. Since both A and B aim to deploy more SFCs, they have a cooperative relationship. However, due to the added complexity of multiagent reinforcement learning, directly applying the same training method as single-agent reinforcement learning may result in inadequate performance. For instance, if we train A and B simultaneously and B finds an optimal policy at some point while A does not, then B's policy may change in the next iteration due to A's influence. Consequently, B's original optimal policy may no longer be optimal. This mutual interference may prevent the agents from converging to stable strategies. To avoid this problem, we propose obtaining the optimal deployment policy for A first and then training the readjustment policy for B without changing A.

25%
In this study, the neural network parameters of Agent A depicted in Figure 4 are generated via a process of random initialization. Subsequently, the neural network parameters are trained through dynamic deployment in the training environment to obtain an optimal policy denoted by A * . Next, we randomly initialize the neural network parameters of agent B and perform readjustment training. In each step of this process, agent B first readjusts its underlying network and then deploys the SFC by following agent A * 's policy. This leads to agent B * , which has the optimal readjustment policy.
The neural network of agents consists of an input layer, an output layer, and three fully connected layers for processing intrinsic information. The input to this neural network is the current environmental state vector, and the output is the Q-value of each action. As analyzed in the previous section, the input state s t = [C t , BW t , I t ], where |C t | = |V s |, |BW t | = |E s |, |I t | = 9, and the number of neural nodes in the input layer is |V s | + |E s | + 9. The output corresponds to the index of a node chosen from the current network for deployment or readjustment operations, so the number of output nodes is |V s |.

DQN-Based Dynamic SFC Deployment Algorithm
This section presents the DQN-based dynamic SFC deployment algorithm and its training process. Algorithm 1 shows the pseudocode of our algorithm. It takes as input the initial network state and the set of SFC requests r 1 , r 2 · · · r m . After training, it outputs the dynamic deployment policy Π 1 (agent A * ).
The action-value function Q and the target action-value functionQ are initialized, and Q is reset to Q every C steps. In the following C steps, we useQ to generate the target value y and use it to update Q. With the assistance ofQ, we can enhance the overall stability of the training process [38].
During the training, the environment is reset and a new set of SFC requests is initialized in every episode. It is worth noting that the agent is not informed of the SFCs in advance, only once they arrive.
For DQN, the more actions that can be selected, the more difficult it is to train. We, therefore, consider some rules to narrow down the selectable actions. Firstly, nodes that fail to satisfy resource and delay conditions are excluded. Secondly, we prioritize shorter paths to minimize latency and link resource usage. Nodes are sorted on the basis of the delay from the last Network Function (NF) deployment location (or input node), and the nearest m nodes are selected as the set Φ. Φ represents the truly selectable action space for agent A. The benefits of this approach are demonstrated in our experiments.
At each step, we select actions by ε − greedy to increase the "exploration" space of the agent. The complete state transition e t = (s t , a t , r t , s t+1 ) during this step is subsequently stored in the experience replay pool. During the training process, we sample a small batch from the experience replay pool and use theQ to generate a target y j , which is the sum of the current reward and the discounted future reward. The neural network parameters θ are then updated using the gradient descent method.

DQN-Based Algorithm for Dynamic SFC Readjustment
This section presents the DQN-based dynamic SFC readjustment algorithm and its training process. Algorithm 2 takes as input the network state, SFC requests, and the dynamic SFC deployment policy Π 1 (agent A * ) and outputs the dynamic readjustment policy. The basic training framework of Algorithm 2 is similar to that of Algorithm 1, which also uses target network, experience replay, and gradient descent methods.

Algorithm 1: DQN-based dynamic SFC deployment algorithm
Input: The underlying network state s t the set of dynamically arriving SFC requests r 1 , r 2 · · · r m . Output: Dynamic SFC deployment policy Π 1 . 1: Initialize the action-value function Q(s t , a; θ) where θ are the randomly generated neural network weights. 2: Initialize the target action-value functionQ(s t , a; θ − ), where θ − = θ. 3: Initialize the experience pool D with memory N. 4: for episode in range (EPISODES): 5: Generate a new collection of SFCs. 6: Initialize state s. 7: for step in range (STEPS): 8: Select the nodes that satisfy the resource and delay requirements. 9: Select m nodes that are closest to the last deployed node among the nodes that satisfy the deployment requirements and add them to set Φ.
10: With probability ε, select an action a t at random. 11: Otherwise, select the action a t = argmax a Q(s t , a; θ), a ∈ Φ. 12: Execute action a t and observe reward r t . 13: Store transition e t = (s t , a t , r t , s t+1 ) in D. 14: Sample random minibatch of transitions (s j , a j , r j , s j+1 ) from D.

15:
Set y j = r j , r j = end r j + γmax a Q (s j+1 , a ; θ ), r j = end 16: Perform a gradient descent step on (y j − Q(s j+1 , a; θ)) 2 with respect to the network parameters θ.
At each step, the agent identifies nodes that need readjustment due to excessive CPU resource usage and adds them to the set Φ. Then, the agent selects a node from those that need readjustment. The target node is selected by (1) filtering out nodes with sufficient resources and (2) selecting the nearest one as the target node. Compared with the way the agent selects both the source and target nodes for readjustment, the action space is reduced from V S 2 to V S , which significantly simplifies the neural network training.
Once the readjustment node is selected, the agent moves the NF with the highest resource share on that node to the target node. Readjustment can improve the load balancing of the network and increase its capacity to accommodate more SFCs. After readjustment, the SFC is deployed by the dynamic deployment policy Π 1 .

Conclusions
This section introduces the framework of the DQN-based approach to dynamic SFC deployment and readjustment (MQDR), as well as the specific details of its two main components. MQDR differs from other existing research work in two aspects: (1) it integrates dynamic readjustment of deployed SFCs with the deployment of newly arrived SFC requests, which can improve request acceptance rates, whereas existing research has addressed only either readjustment or deployment; and (2) it reduces the training difficulty and increase the practical value. The performance evaluation will demonstrate the benefits of these two features in detail.

Algorithm 2: DQN-based dynamic SFC readjustment algorithm
Input: The network state s t the set of SFC G 1 , G 2 · · · G m , dynamic SFC deployment policy Π 1 . Output: Dynamic SFC readjustment policy Π 2 . 1: Initialize the action-value function Q(s t , a; θ), where θ is the randomly generated neural network weights. 2: Initialize the target action-value functionQ(s t , a; θ − ), where θ − = θ. 3: Initialize the experience pool D with memory N. 4: for episode in range (EPISODES): 5: Generate a new collection of SFCs. 6: Initialization state s. 7: for step in range (STEPS): 8: Generate the set of nodes that need to be readjusted based on the state of the underlying network. 9: With probability ε, select an action a t at random. 10: Otherwise, select the action a t = argmax a Q(s t , a; θ).

13:
Observe reward r t , s t ⇒ s t+1 . 14: Store transition e t = (s t , a t , r t , s t+1 ) in D.

15:
Sample random minibatch of transitions (s j , a j , r j , s j+1 ) from D.

16:
Set y j = r j , r j = end r j + γmax a Q (s j+1 , a ; θ ), r j = end

17:
Perform a gradient descent step on (y j − Q(s j+1 , a; θ)) 2 with respect to the network parameters θ.

Simulation Setup
The algorithm proposed in this paper is evaluated on the CERNET2 network topology [39] using SFCSim [40], a Python-based SFC simulation platform. CERNET2 is the next-generation backbone network of the China Education and Research Network, which connects universities, research institutions, and other educational organizations across China. SFCSim employs a discrete-time event scheduling engine that supports the simulation of scenarios such as static and dynamic deployment of SFCs, service migration, and mobility management. The platform also implements various benchmark algorithms for experimental comparisons.
The underlying network CERNET2 is depicted in Figure 5, with 21 physical nodes and 23 physical links, which can be abstracted as an undirected connectivity graph with 21 nodes and 23 edges. The server CPU resources are randomly generated within [10,30] (units), the link bandwidth capacity to 20 Mbps, and the transmission delay is randomly generated within [0.7, 1.5] (ms). The SFCs arrive at different times, with their arrival time and service time following a uniform distribution in [1200, 10000] (units) and [1000, 1800] (units), respectively. Each SFC request contains 3~5 NFs, and its maximum SFC delay follows a uniform distribution in [10,18]   The experiments in this paper are conducted on a PC with an Intel i7-10750H CPU (6 cores and 12 threads), an Nvidia GTX2060 GPU (6GB of video memory) and 16GB RAM. The PC runs Windows 10 operating system and uses PyCharm IDE for simulation. The SFCSim simulation platform is integrated with TensorFlow 2.8 library based on Python 3.9 and tf-agents 0.12 library for the DQN simulation. The learning rate is set to 0.0005, the The experiments in this paper are conducted on a PC with an Intel i7-10750H CPU (6 cores and 12 threads), an Nvidia GTX2060 GPU (6 GB of video memory) and 16 GB RAM. The PC runs Windows 10 operating system and uses PyCharm IDE for simulation. The SFCSim simulation platform is integrated with TensorFlow 2.8 library based on Python 3.9 and tf-agents 0.12 library for the DQN simulation. The learning rate is set to 0.0005, the neural network parameters are learned by Adam [41] optimization, and each hidden layer has 512 nodes. Figure 6 compares the request acceptance rates of the five different methods at different request quantities. We conducted 20 experiments for each request quantity and generated different sets of SFC requests according to the same parameters every time. We then obtained the average result. The five methods were as follows: (1) LBSP, a load-balanced shortest path algorithm [40], which deployed SFCs with fewer resources and used the shortest path delay route for stable performance; (2) MSPA, a method that selected the set of m nearest well-resourced nodes to the previous VNF, denoted as Φ m , and then randomly chose a node from Φ m to deploy the next NF; (3) DQN, an online SFC deployment method based on deep Q networks [23]; (4) MQD combined DQN with MSPA, where the agent selected one node from Φ m at each step; and (5) MQDR, which extended MQD by dynamically readjusting already deployed SFCs, as described in Section 4. Figure 7a,b show the training process of DQN, MQD and MQDR 3-5 when the SFC request quantity is 1000, and Figure 8 depicts the request acceptance rate of each algorithm as the requests arrive. and then randomly chose a node from m Φ to deploy the next NF; (3) DQN, a deployment method based on deep Q networks [23]; (4) MQD combined DQN where the agent selected one node from m Φ at each step; and (5) MQDR, wh MQD by dynamically readjusting already deployed SFCs, as described in Sec       In the experiments, we adopted the greedy ε − method in DQN , MQD and reduced ε from 0.9 to 0.05. The discount factor γ was set to 0.9. The of MSPA was set to 5 in MQD and MQDR. As depicted in Figure 6, the request acceptance rate for all five algorithm to different degrees as the quantity of SFC requests increased. As mor resources were occupied, subsequent SFC requests were less likely to be deployed successfully. LBSP outperformed MSPA at low to medium volum they had similar performance at high volumes. This indicates that LBSP w method, and MSPA introduced some uncertainty by randomly selecting from did not achieve better results than SPA.

Experimental Results
MQD was the second most effective method overall. It improved acceptance rate by an average of 5.5% compared with MSPA by using DQ randomly selecting actions from m Φ , and it outperformed LBSP. When com with DQN, as depicted in Figure 7a, MQD also achieved a significant impro DQN by limiting the action space, which saved bandwidth resources and difficulty of training compared with selecting a node from all nodes.
MQDR outperformed MQD and achieved the best performance among a This is because MQDR dynamically adjusted the existing deployment on th current underlying status when each new request arrived, which led to a m allocation of resources and reduced the negative factors that hindered the d In the experiments, we adopted the ε − greedy method in DQN, MQD and MQDR and reduced ε from 0.9 to 0.05. The discount factor γ was set to 0.9. The parameter m of MSPA was set to 5 in MQD and MQDR.
As depicted in Figure 6, the request acceptance rate for all five algorithms decreased to different degrees as the quantity of SFC requests increased. As more underlying resources were occupied, subsequent SFC requests were less likely to be received and deployed successfully. LBSP outperformed MSPA at low to medium volumes, whereas they had similar performance at high volumes. This indicates that LBSP was a superior method, and MSPA introduced some uncertainty by randomly selecting from Φ m , which did not achieve better results than SPA.
MQD was the second most effective method overall. It improved the request acceptance rate by an average of 5.5% compared with MSPA by using DQN instead of randomly selecting actions from Φ m , and it outperformed LBSP. When comparing MQD with DQN, as depicted in Figure 7a, MQD also achieved a significant improvement over DQN by limiting the action space, which saved bandwidth resources and reduced the difficulty of training compared with selecting a node from all nodes.
MQDR outperformed MQD and achieved the best performance among all algorithms. This is because MQDR dynamically adjusted the existing deployment on the basis of the current underlying status when each new request arrived, which led to a more balanced allocation of resources and reduced the negative factors that hindered the deployment of new requests. Figure 9 shows the Q-network's different training processes with different hidden layers. Table 3 presents their performance in the test environment. The figure indicates that the three-layer and four-layer networks achieved similar results in training, while the three-layer network performed better in the test. Too many layers may cause overfitting and gradient instability during training. The two-layer networks performed the worst, as they were not deep enough to perceive all the intrinsic features.

Number of Hidden Layers
Request Acceptanc 2 0.56 3 0.71 4 0.68 Figure 10 shows the effect of different discount factors γ on t agents. The discount factor γ is used to calculate the target value duri regulate the immediate and long-term effects in reinforcement learn ahead the agent considers when making decisions, in the range (0,1]. Th the more steps the agent considers going forward, but the more difficu smaller the γ is, the more the agent focuses on immediate benefits an train. Although we want the agent to think in the long term, excessiv factors may impede algorithm convergence. As observed in Figure 10 Figure 10 shows the effect of different discount factors γ on the training of the agents. The discount factor γ is used to calculate the target value during training and to regulate the immediate and long-term effects in reinforcement learning, i.e., how far ahead the agent considers when making decisions, in the range (0,1]. The larger the γ is, the more steps the agent considers going forward, but the more difficult it is to train; the smaller the γ is, the more the agent focuses on immediate benefits and the easier it is to train. Although we want the agent to think in the long term, excessively high discount factors may impede algorithm convergence. As observed in Figure 10, training proceeded swiftly when γ = 0.4, and the best result was achieved when γ = 0.9. On the other hand, the algorithm trained most slowly when γ = 0.99, implying that, while unfinished services can influence dynamic deployment, it is not necessary to consider overly long-term impacts in a limited training cycle. The training used the ε − greedy approach to enhance the agent's exploration ability. ε represents the probability that the agent will randomly choose an action. The best states are often explored only with a good enough Q-network. If ε is too large, excessive exploration in the later stages of training may affect the utilization of acquired knowledge and hinder further performance improvement. If ε is too small, on the other hand, it may lack sufficient exploration in the early stages. Instead of setting ε to a fixed value, the dynamic decrement strategy started with ε = 0.9, the strongest exploration, and gradually reduced it in a linear fashion until it reached 0.05, and then it remained constant. This approach provided adequate exploration in the initial stages and maximized the use of acquired knowledge in the later stages. The simulation results depicted in Figure 11 demonstrate that the dynamic decrement strategy prolonged effective learning time and ultimately yielded better training outcomes. The training used the greedy ε − approach to enhance the agent's ex ability. ε represents the probability that the agent will randomly choose an ac best states are often explored only with a good enough Q-network. If ε is t excessive exploration in the later stages of training may affect the utilization of knowledge and hinder further performance improvement. If ε is too small, on hand, it may lack sufficient exploration in the early stages. Instead of setting ε value, the dynamic decrement strategy started with 0.9 ε = , the strongest exp and gradually reduced it in a linear fashion until it reached 0.05, and then it r constant. This approach provided adequate exploration in the initial sta maximized the use of acquired knowledge in the later stages. The simulatio depicted in Figure 11 demonstrate that the dynamic decrement strategy p effective learning time and ultimately yielded better training outcomes.   The training used the greedy ε − approach to enhance the agent' ability. ε represents the probability that the agent will randomly choose a best states are often explored only with a good enough Q-network. If ε excessive exploration in the later stages of training may affect the utilizatio knowledge and hinder further performance improvement. If ε is too smal hand, it may lack sufficient exploration in the early stages. Instead of setting value, the dynamic decrement strategy started with 0.9 ε = , the stronges and gradually reduced it in a linear fashion until it reached 0.05, and then constant. This approach provided adequate exploration in the initia maximized the use of acquired knowledge in the later stages. The simu depicted in Figure 11 demonstrate that the dynamic decrement strateg effective learning time and ultimately yielded better training outcomes.  The impact of varying m on agent training is shown in Figure 12. The MDQR algorithm requires an appropriate value of m to balance between exploration and exploitation. When m is too small, it approaches the LBSP algorithm (m = 1), which limits the selection space and may miss better actions. When m is too large, it approaches the original DQN algorithm (m = n), which increases the selection space and may cause training difficulty. Therefore, choosing a suitable value of m is crucial for the performance of the MDQN algorithm. In our simulation experiments, we achieved the best results when m was 5.

2023, 23, x FOR PEER REVIEW
The impact of varying m on agent training is shown in Figure 12. The algorithm requires an appropriate value of m to balance between explorat exploitation. When m is too small, it approaches the LBSP algorithm (m = 1), whi the selection space and may miss better actions. When m is too large, it approa original DQN algorithm (m = n), which increases the selection space and ma training difficulty. Therefore, choosing a suitable value of m is crucial for the perf of the MDQN algorithm. In our simulation experiments, we achieved the bes when m was 5.

Conclusions
This paper focuses on the problem of how to deploy dynamically arriv requests more effectively on the SDN/NFV-enabled network. We propose a DR method MQDR that reduces the action space of the intelligent agents to facilit training and enhance their performance. We also demonstrate that performin readjustment on the basis of the SFC request information and the underlying state before deploying the SFCs can improve the load balancing of the netw increase its capacity to accommodate more SFCs. Simulation experiments indi our approach is suitable for online deployment in dynamic networks; it incre request acceptance rate by approximately 25% compared with the origin algorithm and by 9.3% compared with the Load Balancing Shortest Path (LBSP) al However, despite considering dynamic SFC request arrival, our study assumes SFC traffic remains constant following the request arrival. In future work, we w more deeply into deploying and readjusting service chains when both SFC reque and traffic are dynamic.

Conclusions
This paper focuses on the problem of how to deploy dynamically arriving SFC requests more effectively on the SDN/NFV-enabled network. We propose a DRL-based method MQDR that reduces the action space of the intelligent agents to facilitate their training and enhance their performance. We also demonstrate that performing some readjustment on the basis of the SFC request information and the underlying network state before deploying the SFCs can improve the load balancing of the network and increase its capacity to accommodate more SFCs. Simulation experiments indicate that our approach is suitable for online deployment in dynamic networks; it increases the request acceptance rate by approximately 25% compared with the original DQN algorithm and by 9.3% compared with the Load Balancing Shortest Path (LBSP) algorithm. However, despite considering dynamic SFC request arrival, our study assumes that the SFC traffic remains constant following the request arrival. In future work, we will delve more deeply into deploying and readjusting service chains when both SFC request arrival and traffic are dynamic.