A Q-Learning-Based Approach for Deploying Dynamic Service Function Chains

: As the size and service requirements of today’s networks gradually increase, large numbers of proprietary devices are deployed, which leads to network complexity, information security crises and makes network service and service provider management increasingly difﬁcult. Network function virtualization (NFV) technology is one solution to this problem. NFV separates network functions from hardware and deploys them as software on a common server. NFV can be used to improve service ﬂexibility and isolate the services provided for each user, thus guaranteeing the security of user data. Therefore, the use of NFV technology includes many problems worth studying. For example, when there is a free choice of network path, one problem is how to choose a service function chain (SFC) that both meets the requirements and offers the service provider maximum proﬁt. Most existing solutions are heuristic algorithms with high time efﬁciency, or integer linear programming (ILP) algorithms with high accuracy. It’s necessary to design an algorithm that symmetrically considers both time efﬁciency and accuracy. In this paper, we propose the Q-learning Framework Hybrid Module algorithm (QLFHM), which includes reinforcement learning to solve this SFC deployment problem in dynamic networks. The reinforcement learning module in QLFHM is responsible for the output of alternative paths, while the load balancing module in QLFHM is responsible for picking the optimal solution from them. The results of a comparison simulation experiment on a dynamic network topology show that the proposed algorithm can output the approximate optimal solution in a relatively short time while also considering the network load balance. Thus, it achieves the goal of maximizing the beneﬁt to the service provider.


Introduction
Currently, most networks use a large number of dedicated hardware devices that provide features such as firewalls and network address translation (NAT).The various services provided by service providers usually require specialized hardware devices.As the network grows in size and emerging industries such as big data [1,2] and cloud computing [3][4][5] are expanding rapidly, starting a new service requires deploying a variety of dedicated hardware devices, making it extremely difficult Symmetry 2018, 10, 646 3 of 21 play to the advantages of the algorithm.Therefore, the use of the Q-learning algorithm can greatly reduce the training time and computational complexity.
To optimize the deployment of SFCs in a dynamic network, we integrated RL into the problem and designed a new deployment decision algorithm.This paper studies the problem of deploying SFCs in a multiserver dynamic network.Unlike the data center network [26,27], the nodes of the multi-server network have fewer resources and the SFC deployment is more difficult.Due to the characteristics of dynamic networks, new SFCs may need to be deployed at any time, and some services should be cancelled.To accomplish these tasks, we propose a real-time online deployment decision algorithm called QLFHM.After learning the entire topology and the use of virtual resources, the algorithm uses the RL module and the load balancing module to output an SFC immediately.We compared our proposed algorithm with other algorithms in a simulation experiment and evaluated it repeatedly.The simulation results show that the algorithm achieves good performance with regard to decision time, load balancing, deployment success rate and deployment profit.
The rest of this paper is organized as follows.Section 2 provides an overview of the current work related to the field.In Section 3, we describe the problem models we want to solve, including network models, user requests, and dynamic deployment adjustments.To solve these problems, we propose our algorithm model in Section 4. We present a comparison with other algorithms in Section 5. Finally, Section 6 summarizes the paper.

Related Work
In NFV networks, network functions are implemented as VNFs in software form.The characteristics of VNFs allow them to be deployed flexibly and ensure the security of users.Therefore, key consideration needs to be given to the placement of VNFs to meet service requirements, quality of service, and the interests of service providers.This type of problem is called the VNF Placement (VNF-P) problem and has been proven to be a non-deterministic polynomial-time hard (NP-hard) problem [28].Consequently, it is often difficult to find the optimal solution of a VNF-P problem.
The study of deployment problems is divided into static deployment problems and dynamic deployment problems.The difference is that during static deployment, the SFC in the network is always there; in contrast, during dynamic deployment it will be withdrawn after some period.
In a static problem, deployment is the equivalent of an offline decision: all the requirements are considered when choosing the deployment.Because the SFC being deployed is not retracted after placement, the main consideration is how to arrange more SFCs, which is also the main evaluation criterion.For example, the BSVR algorithm proposed by Li et al. [29] mainly considers load balancing and the number of accepted SFCs.In addition, unlike us, they set up a consistent type of VNF that can be shared by multiple SFCs.
Here, we study the dynamic problem, which is closer to the real network situation [30].In a dynamic situation, an SFC will be withdrawn after some deployment period, making the network more fluid.
Facing those problems, the methods are similar.To obtain the optimal solution of the VNF-P problem, mathematical programming methods such as integer linear programming (ILP) and mixed ILP (MILP) are the most popular approach [31].The next most popular approaches involve heuristic algorithms [32] or a combination of heuristic algorithms and ILP.Although there are different optimization approaches to VNF-P problems, the limitations of these approaches are generally similar and include bandwidth resources, IT resources, link delay, VNF deployment, and cost and profit considerations [33][34][35].
For example, Bari et al. [28] approximately expressed the VNF-P problem as an ILP model and solved it with a heuristic algorithm that attempted to minimize operating expense (OPEX) and maximize network utilization.Gupta et al. [33] tried to minimize bandwidth consumption.Luizelli et al. [31] also developed an ILP model that seeks to minimize both the end-to-end delay and the resource overhang ratio.J. Liu et al. [14] proposed the column generation (CG) model algorithm based on ILP and attempted to maximize the service provider's profit and the request acceptance ratio.Some of the papers mentioned above have reported that the execution times for solving these two mathematical models increase exponentially with the size of the network.After solving the model with optimization software or a precision algorithm, they immediately proposed a corresponding heuristic algorithm.
Although the execution times of heuristic algorithms is much lower than that of ILP, most existing heuristic algorithms provide only near-optimal solutions.However, considering the time savings, heuristic algorithms form the main approach to solving VNF-P problems.
Some recent solutions to the VNF-P problem have applied machine learning techniques.Kim et al. [36] constructed the entire problem as an RL model.Although the results of this approach may not differ much from the optimal solution, using it in complex network situations results in extremely long training times.
We also tried to avoid the shortcoming of too long training time while using intensive learning.Given the tradeoff between the accuracy of the ILP algorithm and the time efficiency of the heuristic algorithm, this paper proposes a QLFHM algorithm that combines RL and heuristic algorithms.After comparing QLFHM with benchmark algorithms, we conclude that the QLFHM algorithm not only guarantees an approximately optimal solution but also guarantees the time efficiency when dynamically deploying a SFC.

Problem Description
We studied the problem of deploying an SFC across multiple servers in a dynamic network.Our goal is to make the service provider most profitable while providing security guaranteed services.
We consider a scenario in which multiple input requests need to be deployed from the source server to the target server over an appropriate link.The link must support the VNFs included in the request.Due to the limited capacity of all servers and links, consideration should be given to the distribution of SFCs to be deployed for the request that will allow more SFCs to be deployed.
We describe the problem model in the next sections, including the research motivation, network model, request model and dynamic SFC deployment.

Research Motivation
Given a network with multiple servers, and each server supports the deployment of VNFs, but IT resources are limited.Link bandwidth resources are also limited.Requests dynamically switch between active and offline states.Thus, effectively determining how to deploy the SFCs can maximize the request acceptance ratio and the service provider's profit and minimize the computation time, while satisfying all the constraints.

Network Model
The network can be seen as a graph G = (V, E), where V denotes the set of nodes, and E is the set of links between nodes.Each e ∈ E represents a physical link between two network nodes; we use B e ∈ N to show a link's bandwidth capacity.Each v ∈ V is a network server, which functions as both the users' access point and a switch; each server also has an IT resource capacity; we use I v ∈ N to denote the IT resources of v. T represents the set of all the VNFs.We use T v ∈ T to denote the set of VNFs that can be deployed at each v.All servers can offer NFV services; however, some servers only support partial services.We assume that bandwidth resources and node computing resources are limited.We use a Boolean variable p j,v to represent the state in which the j-th VNF vn f j is deployed on v ∈ V.A p j,v value of 1 denotes deployable and a value of 0 denotes undeployable: Symmetry 2018, 10, 646 5 of 21

Request Model
RE is used to represent all incoming requests.Each request i ∈ RE is represented by the following variables: s i , d i , P i , r i .Here, s i ∈ V refers to the client's access node, d i ∈ V is the data provider node required by the user, P i ∈ T is a vector that includes the required VNFs sequence on the request SFC, and r i ∈ R + refers to the unit compensation paid after the successful deployment of the request, which is related to the number of VNFs represented by num_vn f s i .We use ω to represent the unit value.For convenience, we make ω = 1: The profit gained after successful deployment of the SFC of user i is represented by pro f it i , and l i represents the chain length of a successfully deployed SFC: the number of nodes in the SFC: ( A successfully deployed s f c i should match the starting point s i and destination d i , select an appropriate chain, and arrange the VNFs sequence sequentially on the chain nodes.The link length l i is limited by the compensation r i .Service providers need to ensure their profitability: x i is a Boolean variable that indicates whether the request of user i is successfully deployed.If its SFC is online, x i is 1; otherwise, x i is 0. Node_SFC i represents all the nodes in s f c i .Node_vNFs i represents the nodes that can deploy VNFs in s f c i .Link_SFC i represents all the links in the s f c i .z i,j,π is a Boolean variable that equals 1 if user i uses the path π ∈ Link_SFC i and the next node that deploys VNFs of its node-deployed j-th VNF is node deployed-the (j + 1)-th VNF-and 0 otherwise.
Equation (5) ensures that the nodes deploying VNFs do not include the user access node s i or the service access node d i : Equations ( 6) and (7) ensure that when the s f c i is online, the P i is deployed in the s f c i in sequence: ∩

Dynamic SFC Deployment
We assume that during the arrival of a dynamic request scenario, the service provider will provide service for the new request and will cancel the service function of a previous SFC at the end of the service request time.The arrival time of requests occurs at a certain time interval.Therefore, at every moment, the service provider addresses two types of user requests.These mainly affect the service provider's operation expenses and involve checking whether a new request is available and whether there a chain of online services exists that need to be cancelled.
The goal of dynamic deployment is to maximize service provider profits.The IT resources and bandwidth capacity exist as the deployment constraint condition; however, they affect only the deployment ability and not the operation cost.
Node_vNFs_put i represents the node set that deployed VNFs for s f c i .I_max v represents the maximum IT capacity of node v. I_use j means the IT capacity needed by j-th VNF.Link_vNFs_put i is the set of links that belong to s f c i .B_max e denotes the maximum bandwidth capacity of link e, and B_use i is the bandwidth capacity needed by the s f c i .For every v ∈ V: and for every e ∈ E: Equations ( 8) and ( 9) ensure that the user will not use more bandwidth and IT resources than the total capacity available during any service time.
It should be noted, however, that some requests may be blocked because their long deployment chains result in no profits.The Boolean variable y i indicates whether the request of user i has been successfully deployed.The goal of this problem is to maximize the service provider's profit K, described as follows: We depict the dynamic SFC deployment and revocation process in Figure 1.At each moment, the situations represented by Figure 1a,b may occur in the network.
and for every ∈ : Equations ( 8) and ( 9) ensure that the user will not use more bandwidth and IT resources than the total capacity available during any service time.
It should be noted, however, that some requests may be blocked because their long deployment chains result in no profits.The Boolean variable indicates whether the request of user has been successfully deployed.The goal of this problem is to maximize the service provider's profit , described as follows: We depict the dynamic SFC deployment and revocation process in Figure 1.At each moment, the situations represented by Figure 1a,b may occur in the network.
In Figure 1a, after the SFC is generated for a request, the SFC will be deployed to the corresponding path when sufficient resources are available.The path's bandwidth resources will be consumed, assuming a unit is consumed.The IT resources of the server deploying VNF will also be consumed, assuming a unit is consumed, but the user access node and the data provider node do not consume IT resources.
In Figure 1b, after the service time of an SFC expires, the SFC will be dropped from the network.The path's bandwidth resources and the IT resources of the server that deployed the VNF will also be recovered, if they used units.
To maximize profits, service providers should first attempt make the SFC shorter while meeting the demand.For the overall network, IT resources and bandwidth resources must be balanced properly, which means that SFCs should be distributed as widely as possible, rather than crowding them together, which can create problems.In this way, at any given time, we will have more nodes and paths to choose from to form new SFCs.

Q-Learning Framework Hybrid Module Algorithm
In this section, we use the Q-learning framework hybrid module algorithm (QLFHM) to address dynamic SFC deployment.First, to reduce the problem complexity, we divide the solution process into two parts: we use the RL module to output several of the shortest paths that meet certain requirements.The load balancing module obtains multiple routing outputs from the previous In Figure 1a, after the SFC is generated for a request, the SFC will be deployed to the corresponding path when sufficient resources are available.The path's bandwidth resources will be consumed, assuming a unit is consumed.The IT resources of the server deploying VNF will also be consumed, assuming a unit is consumed, but the user access node and the data provider node do not consume IT resources.
In Figure 1b, after the service time of an SFC expires, the SFC will be dropped from the network.The path's bandwidth resources and the IT resources of the server that deployed the VNF will also be recovered, if they used units.
To maximize profits, service providers should first attempt make the SFC shorter while meeting the demand.For the overall network, IT resources and bandwidth resources must be balanced properly, which means that SFCs should be distributed as widely as possible, rather than crowding them together, which can create problems.In this way, at any given time, we will have more nodes and paths to choose from to form new SFCs.

Q-Learning Framework Hybrid Module Algorithm
In this section, we use the Q-learning framework hybrid module algorithm (QLFHM) to address dynamic SFC deployment.First, to reduce the problem complexity, we divide the solution process into two parts: we use the RL module to output several of the shortest paths that meet certain requirements.The load balancing module obtains multiple routing outputs from the previous module and finally obtains the solution of the problem.The goal of hierarchical processing is to reduce the training and learning times and achieve an efficient output scheme.The architecture of QLFHM is shown in Figure 2.

Q-Learning Framework Hybrid Module Algorithm
In this section, we use the Q-learning framework hybrid module algorithm (QLFHM) to address dynamic SFC deployment.First, to reduce the problem complexity, we divide the solution process into two parts: we use the RL module to output several of the shortest paths that meet certain requirements.The load balancing module obtains multiple routing outputs from the previous module and finally obtains the solution of the problem.The goal of hierarchical processing is to reduce the training and learning times and achieve an efficient output scheme.The architecture of QLFHM is shown in Figure 2.

Preliminaries
Problems that can be solved by Q learning generally conform to the Markov decision-making process (MDP) and have no aftereffect.That is to say, the next state of the system is only related to the current state information and is unrelated to the earlier state.Unlike the Markov chain and Markov models, the MDP considers actions, that is, the next state of the system is related not only to the current state, but also to the current action taken.The dynamic process of MDP is shown in Figure 3: The return value r is based on state s and action a, each combination of s and a has its own value of return, then MDP can also be represented as the following Figure 4:

Preliminaries
Problems that can be solved by Q learning generally conform to the Markov decision-making process (MDP) and have no aftereffect.That is to say, the next state of the system is only related to the current state information and is unrelated to the earlier state.Unlike the Markov chain and Markov models, the MDP considers actions, that is, the next state of the system is related not only to the current state, but also to the current action taken.The dynamic process of MDP is shown in Figure 3:

Q-Learning Framework Hybrid Module Algorithm
In this section, we use the Q-learning framework hybrid module algorithm (QLFHM) to address dynamic SFC deployment.First, to reduce the problem complexity, we divide the solution process into two parts: we use the RL module to output several of the shortest paths that meet certain requirements.The load balancing module obtains multiple routing outputs from the previous module and finally obtains the solution of the problem.The goal of hierarchical processing is to reduce the training and learning times and achieve an efficient output scheme.The architecture of QLFHM is shown in Figure 2.

Preliminaries
Problems that can be solved by Q learning generally conform to the Markov decision-making process (MDP) and have no aftereffect.That is to say, the next state of the system is only related to the current state information and is unrelated to the earlier state.Unlike the Markov chain and Markov models, the MDP considers actions, that is, the next state of the system is related not only to the current state, but also to the current action taken.The dynamic process of MDP is shown in Figure 3: The return value r is based on state s and action a, each combination of s and a has its own value of return, then MDP can also be represented as the following Figure 4: The return value r is based on state s and action a, each combination of s and a has its own value of return, then MDP can also be represented as the following Figure 4: In relation to the problem in this paper, it conforms to the Markov decision making process, but the difference is that the times of decision making for the problem in this paper is limited.Therefore, after our study, we combine the Q-learning with this problem and optimize the Q-learning algorithm for this problem.In this way, we can not only take advantage of the reinforcement learning, but also avoid its defects, which can provide new ideas for dynamic SFC deployment.
A key point of the algorithm proposed in this paper is that it improves the Q matrix and changes the original two-dimensional matrix into a five-dimensional matrix.The five subscripts are now_h, now_node, action_node, end_node, h.The subscript now_h refers to the number of hops that have been visited in the current state; now_node is the node in which the agent is in the current state; action_node is the next available node set; end_node is the node that this SFC will eventually reach; and h is the minimum number of hops that can meet the deployment requirements.In relation to the problem in this paper, it conforms to the Markov decision making process, but the difference is that the times of decision making for the problem in this paper is limited.Therefore, after our study, we combine the Q-learning with this problem and optimize the Q-learning algorithm for this problem.In this way, we can not only take advantage of the reinforcement learning, but also avoid its defects, which can provide new ideas for dynamic SFC deployment.
A key point of the algorithm proposed in this paper is that it improves the matrix and changes the original two-dimensional matrix into a five-dimensional matrix.The five subscripts are _ℎ, _ , _ , _ , ℎ.The subscript _ℎ refers to the number of hops that have been visited in the current state; _ is the node in which the agent is in the current state; _ is the next available node set; _ is the node that this SFC will eventually reach; and ℎ is the minimum number of hops that can meet the deployment requirements.
As shown in Figure 5, after the agent responds to the environmental state, the environmental state changes and returns r.The matrix is updated with r.The agent will repeat the above behavior until the matrix converges.The Q-learning algorithm is shown in Equation ( 11): ( , ) = ( , ) + + ( , ) − ( , ) .
The values stored in are the recommended values for the next action in each state.The higher a recommended value is, the more it is worth performing.The recommended values are formed during the training phase of the matrix.In Equation (11), is a matrix that stores the recommended values of the executable actions in the current state.Depending on these values, the agent can decide which action to take next.In the matrix, the subscript refers to the state, to the action, ′ to the future state, ′ to the future action, and is the reward value, which comes from the reward matrix .Here, and are the studied ratio between 0 and 1.
In , we set the value of the element to 1000, which has same _ and _ in the subscript.This value represents the reward given when completing the pathfinding task.
For  As shown in Figure 5, after the agent responds to the environmental state, the environmental state changes and returns r.The Q matrix is updated with r.The agent will repeat the above behavior until the Q matrix converges.The Q-learning algorithm is shown in Equation ( 11): In relation to the problem in this paper, it conforms to the Markov decision making process, but the difference is that the times of decision making for the problem in this paper is limited.Therefore, after our study, we combine the Q-learning with this problem and optimize the Q-learning algorithm for this problem.In this way, we can not only take advantage of the reinforcement learning, but also avoid its defects, which can provide new ideas for dynamic SFC deployment.
A key point of the algorithm proposed in this paper is that it improves the matrix and changes the original two-dimensional matrix into a five-dimensional matrix.The five subscripts are _ℎ, _ , _ , _ , ℎ.The subscript _ℎ refers to the number of hops that have been visited in the current state; _ is the node in which the agent is in the current state; _ is the next available node set; _ is the node that this SFC will eventually reach; and ℎ is the minimum number of hops that can meet the deployment requirements.
As shown in Figure 5, after the agent responds to the environmental state, the environmental state changes and returns r.The matrix is updated with r.The agent will repeat the above behavior until the matrix converges.The Q-learning algorithm is shown in Equation ( 11): The values stored in are the recommended values for the next action in each state.The higher a recommended value is, the more it is worth performing.The recommended values are formed during the training phase of the matrix.In Equation (11), is a matrix that stores the recommended values of the executable actions in the current state.Depending on these values, the agent can decide which action to take next.In the matrix, the subscript refers to the state, to the action, ′ to the future state, ′ to the future action, and is the reward value, which comes from the reward matrix .Here, and are the studied ratio between 0 and 1.
In , we set the value of the element to 1000, which has same _ and _ in the subscript.This value represents the reward given when completing the pathfinding task.
For [ _ℎ, _ , _ , _ , ℎ] , Specifically, when the four subscripts _ℎ , _ , _ , and ℎ , which represent states, are determined, the subscripts of The values stored in Q are the recommended values for the next action in each state.The higher a recommended value is, the more it is worth performing.The recommended values are formed during the training phase of the Q matrix.
In Equation ( 11), Q is a matrix that stores the recommended values of the executable actions in the current state.Depending on these values, the agent can decide which action to take next.In the Q matrix, the subscript s refers to the state, a to the action, s to the future state, a to the future action, and r is the reward value, which comes from the reward matrix R. Here, α and γ are the studied ratio between 0 and 1.
In R, we set the value of the element to 1000, which has same now_node and end_node in the subscript.This value represents the reward given when completing the pathfinding task.
For Q [now_h, now_node, action_node, end_node, h], Specifically, when the four subscripts now_h, now_node, end_node, and h, which represent states, are determined, the subscripts of action_node are iterated.The maximum value is selected and its subscript action_node is executed.The advantages of this approach are that (1) it makes the state more observable, and (2) it divides the states into several independent parts that are conducive to parallel programming techniques.The recommended values stored in Q are determined by Equation (11), which is a simplified version of Equation (12).
Some of the parameters and variables used in the QLFHM algorithm are described in Table 1: Table 1.Parameters and variables in the Q-learning Framework Hybrid Module algorithm (QLFHM) algorithm.

G Information about network topology V
A list of nodes in a network topology V i

Node adjacent to node i h max
The maximum number of hops allowed by the model h min The minimum number of hops allowed by the model

Reinforcement Learning Module
In this section, we propose the RL module, which is responsible for outputting alternative paths based on the network topology.The content is divided into two parts: a training stage and a decision stage.In the first part, we first provide the original Q-learning training algorithm and then provide the optimized training algorithm for this problem.Both algorithms have advantages and disadvantages.In the second part, we will propose the algorithm used in the decision stage.

Original Q-Learning Training Algorithm
In the training phase, training data are not required; this phase automatically generates the RL model according to the basic network topology information.
Algorithm 1 adopts the standard Q-learning training method, that is, iterative trial and error.The rewards attained during the repeated attempts finally cause the Q matrix to converge.The variable u is added for the greedy algorithm that enables the agent to improve the explored path in most cases, while making it possible to explore new paths.
The advantages of using the Q-learning algorithm in this paper are as follows: (1) we can observe and understand the decision-making process, and the use of the Q matrix is more intuitive and comprehensible; (2) we can quickly convert this algorithm to a deep Q-learning algorithm, which uses a neural network (DQN) to replace the Q matrix for decision making; and (3) Q-learning is conducive to the improvement of the algorithm proposed in this paper and is suitable for solving the problems in this paper.
The algorithm based on Q-learning obtained satisfactory results, but it also faces some problems.For example, in the face of complex situations or too-large networks, the training period will be excessive.Consequently, an improved version is presented in Algorithm 2, which is optimized for our problem.

Optimized Q-Learning Training Algorithm
Algorithm 2 is an improved version of the Q-learning algorithm based on our problem.It abandons the trial-and-error learning mode of the original algorithm and adopts a method similar to neural diffusion, which results in a hundredfold reduction in training time.
In the Q matrix, we did not list the state with one index as in the original algorithm of Q-learning; instead, we divided the state into four indexes.The advantages of this approach are as follows: Algorithm 2 normally executes on a single computer; however, when greater efficiency is required, the algorithm can begin working in a distributed operation starting on line 5.Because the four indexes make some states independent, we can use distributed computing to reduce the execution time.And the main functions in Algorithm 2 are described in Algorithm 3. Find_way (Q, R, G, h min , h max , h, chain) 8: End For Algorithm 3. Find_way (Q, R, G, h min, h max, h, chain) For each node v2 ∈ V v0 do 6: If v2 is not in chain_tmp then 7: chain_tmp = v2 + chain_tmp; 8: Find_way (Q, R, G, h min , h max , h, chain); 9: If h ≥ h min then 10: For i in chain_tmp do 11: Write the link to the Q matrix, End For 14: End If 15: End If 16: End For 17: End While

Complexity Analysis of Original and Optimized Q-Learning Training Algorithm
In this section, we give the time complexity of the original and optimized Q-learning training algorithm.
We use ite to represent the number of iterations of the original Q-learning training algorithm in the trial and error training process, which is a very large number and also the reason for the long training time.
The time complexity of original Q-learning training algorithm is And the time complexity of optimized Q-learning training algorithm is where h max is less than 13; and n represents the number of nodes in the topology.However, ite is not a constant, which will increase significantly with the increase of n and h max .And Equation (14) shows the worst case of a full-connected-network.Therefore, for the problem solved in this paper, the optimized algorithm will consume much less time.

Q-Learning Decision Algorithm
After the Q matrix has largely converged, the training phase terminates.During the decision phase, we use the Q matrix to output multiple alternative paths CA that meet the input requirements.These paths are then sent to the load balancing module, which makes a final selection.The whole process is described in Algorithm 4.
It is worth mentioning that even in the decision stage, the Q matrix is not necessarily permanently static; it can be adjusted based on the actual situation to support supplementary learning for new paths or nodes or the removal of expired paths and nodes.Select some optional paths PA from Q; 5: For every pa in PA do 6: If the pa can deploy the required VNFs then 7: add pa to the candidate list CA; 8: End If 9: End For 10: If the candidate list CA is empty then 11: deployment for this re failed; 12: continue; 13: End If 14: Send the candidate list CA to the load balancing module; 15: End For

Load Balancing Module
The load balancing module adopts a scoring system.It scores each SFC output from the previous module, and the optimal choice will be deployed.
First, consider the link weight weight link ∈ [0, 1] and the node weight node ∈ [0, 1], which represent a proportion that focuses on the link or node.When no special requirement exists, we set the weights to 0.5, 0.5: Next, we consider the weights of specific nodes weight node v ∈ N and specific links weight link e ∈ N is considered.To urge the SFC to go through a node or link, we increase its weight, which increases the probability that the SFC will traverse that node or link.When no special requirement exists, the weights remain unchanged.
Finally, the link bandwidth resources B e and node computing resources I v are combined with the weights mentioned above to obtain the final score.The higher the score is, the more the path represented is worth deploying.The score is calculated by Equation ( 16): Using Equation ( 16), we can construct Algorithm 5, which takes the output of the RL module as input and outputs the final decision results.calculate the score of pa using Equation (15); 5: End For 6: take the path with the highest score from the candidate list CA; 7: record the start time t start−re , and record the end time t end−re 8: add re to the online SFC list ONL; 9: change the resource residuals in the topology; 10: If any re in ONL reaches t end−re then 11: return the related resources in the topology; 12: End If Due to the flexibility of the independent scoring system, it can be customized for problems that involve required traversal nodes as well as nodes that need to be bypassed.By further adjusting parameters and structures, this algorithm can also be used to solve the problems related to virtual machine consolidation and dynamic application sizing [37].
Dividing the SFC dynamic network deployment problem into two parts reduces the scale of the problem and improves the execution efficiency.First, the improved Q-learning training algorithm results in a training time that is one hundred times smaller.In addition, the independent scoring system is highly flexible and can be customized for specific problems.

Performance Evaluation and Discussion
In this section, we compare the QLFHM algorithm with two other algorithms to evaluate the performance of the proposed dynamic SFC deployment method.We first describe the simulation environment and then provide several performance metrics used for comparisons in the simulation.Finally, we describe the main simulation results.

Simulation Environment
The simulation uses the US network topology, which has 24 nodes and 43 edges.Here, we assume that the server and switch are combined, which means that all nodes have local servers but not necessarily all VNFs.The server's IT resource capacity is 4 units, and the bandwidth capacity of each physical link is 3 units.Note that each VNF occupies 1 unit of IT resources, and each traversed link occupies only 1 unit of bandwidth resources.The online time of each request follows a uniform distribution, and the arrival time is subject to a Poisson distribution.Some servers can support 5 VNF types, but not all servers support all VNF types.For each VNF type, the IT resources of each VNF consume one unit, and each unit can serve one user.Assume that the number of VNFs per user requested in SFC is normally distributed from [2][3][4].To compare the proposed algorithm with existing algorithms, we implement the algorithm in [14], which has a high success rate due to its use of ILP.Although the algorithm is optimized for time, it still requires considerable time; thus, it is not shown on the time comparison graph.We also implement the algorithm in [28], which has good execution efficiency.

Performance Metrics
We used the following metrics in the simulation to evaluate the performance of our proposed algorithm.For the dynamic network with limited resources, we selected three sets of data for analysis: the request acceptance ratio, the average service provider profit, and the calculation time per request.
(1) Request acceptance ratio: This value is the ratio of incoming service requests that have been successfully deployed on the network to all incoming request.Ratio A is defined as (2) Average service provider profit: This value is the total profit earned by the service provider after processing the input service requests.The average service provider profit K can be calculated as follows: (3) Calculation time per request: This value reflects the decision time required before each SFC is deployed.The calculation time per request C is expressed as follows: Symmetry 2018, 10, 646 14 of 21

Simulation Results and Analysis
We divide the experiment into two parts and present and analyze the results separately.The first part involves comparing and analyzing the performances of the QLFHM algorithm and the other two selected benchmark algorithms in the simulated network.The second part compares and discusses some parameters that can affect the performance of the QLFHM algorithm and demonstrate its flexibility and modular capabilities.
We obtained each data point by averaging the results of multiple simulations.We executed the simulations on an Ubuntu virtual machine running on a computer with a 3.7 GHz Intel Core i3-4170 and 4 GB of RAM.The algorithm models were coded in Python.

Performance Comparison in a Dynamic Network
This section provides comparison results from simulating the algorithm proposed in this paper and the two other selected algorithms [14,28] on an SFC dynamic deployment problem.Three sets of data were selected for analysis: the request acceptance ratio, service provider average profit, and the calculation time required for each request.
Figure 6 shows a comparison of the request acceptance ratio achieved by the three algorithms.As Figure 4 shows, when the number of requests is less than 400, the request acceptance ratio is unstable due to insufficient data.However, the QLFHM and CG algorithms always achieve better results than does the Viterbi algorithm.The request acceptance ratio of the CG algorithm is slightly different because more than one optimal path exists in some cases, and the two algorithms use different path selection strategies.After the algorithms select different paths, the overall situation will also differ, resulting in some overall differences.After the number of requests exceeds 400, the request acceptance ratio of the three algorithms tends to become stable; at that point, the request acceptance ratio of the QLFHM algorithm is roughly the same as that of the CG algorithm using ILP.This result demonstrates that the deployment success ratio of the QLFHM algorithm is higher than that of the Viterbi algorithm and is close to the optimal solution at any request scale.Figure 7 shows a comparison of the profits of the service providers achieved by the three algorithms.There is almost no profit difference between the QLFHM and CG algorithms.However, as the number of requests increases, the profit difference between the QLFHM algorithm and the Viterbi algorithm gradually increases.Because the CG algorithm uses ILP, its deployment scheme is close to optimal.The QLFHM algorithm obtains results not much different from CG, indicating that the deployment scheme of the QLFHM algorithm is close to optimal.We are confident that under larger numbers of requests, the profit obtained using the QLFHM algorithm will never be less than that obtained by the other two algorithms.In Figure 8, we compare the average operation time of only two algorithms.We know from [14] that the operation time of the CG algorithm is much longer than that of the other two; therefore, the CG algorithm's performances are not included in the figure.Figure 8 shows that the average operation time of the QLFHM algorithm is approximately 6 times less than that of the Viterbi algorithm.This result indicates that among the three algorithms, the QLFHM algorithm yields the fastest result.In Figure 8, we compare the average operation time of only two algorithms.We know from [14] that the operation time of the CG algorithm is much longer than that of the other two; therefore, the CG algorithm's performances are not included in the figure.Figure 8 shows that the average operation time of the QLFHM algorithm is approximately 6 times less than that of the Viterbi algorithm.This result indicates that among the three algorithms, the QLFHM algorithm yields the fastest result.By comparison, we can draw the following conclusion.Under the condition that the output deployment scheme is close to the optimal solution, the Q algorithm provides a result faster than does the general heuristic algorithm.The request acceptance ratio and the service provider average profit values indicate that the algorithm considers the global network insofar as possible-that is, it better guarantees the load balance.

Effects of the Use Ratio
We know that the matrix stores several link strategies between any two points in the topology.However, in the actual output process, scoring all the links may not be the best option.Here, we test the proportion of the use of the RL output, which we call the use ratio .
There are three parameters 1 , 2 , 3 associated with in step 4 of Algorithm 4. The By comparison, we can draw the following conclusion.Under the condition that the output deployment scheme is close to the optimal solution, the Q algorithm provides a result faster than does the general heuristic algorithm.The request acceptance ratio and the service provider average profit values indicate that the algorithm considers the global network insofar as possible-that is, it better guarantees the load balance.

Effects of the Use Ratio λ
We know that the Q matrix stores several link strategies between any two points in the topology.However, in the actual output process, scoring all the links may not be the best option.Here, we test the proportion of the use of the RL output, which we call the use ratio λ.
There are three parameters x1, x2, x3 associated with λ in step 4 of Algorithm 4. The parameter x1 represents the ratio of the recommended value of the next action in a state.For example, if x1 is set to 0.6, alternative paths can be added if their recommended value is greater than 0.6 multiplied by the maximum recommended value.The parameter x2 limits the number of paths to find.For example, when x2 is set to 100, the algorithm will stop looking after finding 100 candidate paths.The parameter x3 represents the longest length of a single path, depending on the number of VNFs required.
To perform a comparison, we divide λ into three scenarios: low λ, balanced λ and high λ.The use ratio design is listed in Table 2.After completing this analysis, we selected the balanced parameter (which is the λ parameter used in the previous section).We used the same three metrics (request acceptance ratio, service provider average profit, and computation time per request) for analysis.
From Figure 9, we can see the acceptance ratio for requests in the three λ states.When the number of requests is less than 400, the acceptance ratio of the three λ states is not particularly stable; of these, the fluctuations of λ high and λ low are high, while the fluctuation of λ balanced is much higher than the other two λ states.When the request number is greater than 400, the acceptance ratio of all three λ states tends to be stable, but the acceptance ratio of λ balanced is approximately 10% higher than those of the other two states.This is because when the λ state is λ high , some longer paths may obtain high scores; thus, they will be selected for deployment, occupy more bandwidth resources, and affect other SFC deployments.When the λ state is λ low , the number of options for participation is insufficient, and the optimal choice cannot be found.of all three states tends to be stable, but the acceptance ratio of is approximately 10% higher than those of the other two states.This is because when the state is , some longer paths may obtain high scores; thus, they will be selected for deployment, occupy more bandwidth resources, and affect other SFC deployments.When the state is , the number of options for participation is insufficient, and the optimal choice cannot be found.In Figure 10, we compare the service provider's profits in the three states.The profit difference among the three states is not obvious when the number of requests is low; however, as the number of requests increases, the profit margins in the and states remain close and their slopes are approximately the same.In contrast, the profit of increases at a greater slope, increasing the gap between its profits and those of the other states.This result is related to the deployment In Figure 10, we compare the service provider's profits in the three λ states.The profit difference among the three states is not obvious when the number of requests is low; however, as the number of requests increases, the profit margins in the λ high and λ low states remain close and their slopes are approximately the same.In contrast, the profit of λ balanced increases at a greater slope, increasing the gap between its profits and those of the other λ states.This result is related to the deployment success rate and the length of the deployed SFCs.We are confident that when the λ state is λ balanced , the profit obtained by using this algorithm will be larger than the profits obtainable using the other two states of λ.

Comparison of Training Time
In this sub-section, we show a comparison of operation time between original and optimized Qlearning training algorithm.We change ℎ to make the path we need to find longer, and the number of paths increases, which is the same as the case when the network topology keeps getting bigger.Taking the convergence of matrix as the termination time, the comparison results are shown in Figure 12.
From Figure 12 we can see that the operation time is not at an order of magnitude, and the gap grows larger and larger with the increase of ℎ .This is because the times of trial and error training iterations of the RL is very large.Take ℎ as 7 for example, there are 10,000 iterations in 5 s, and 2,252,000 iterations are needed to make the matrix to converge.When ℎ is larger, it is more difficult to achieve the convergence of matrix.From the comparison results, it can be seen that in any practical situation, the optimized   to make the path we need to find longer, and the number of paths increases, which is the same as the case when the network topology keeps getting bigger.Taking the convergence of matrix as the termination time, the comparison results are routing scheme through the RL module.Then, it uses the load balancing module to select the optimal solution from several candidate schemes output by the RL module.The improved learning algorithm improves the efficiency of addressing this specific type of problem; it not only capitalizes on the decision-making advantages of RL but also avoids a lengthy training process.Finally, we conducted extensive simulation experiments in a simulated network environment to evaluate the performance of our proposed algorithm.The experimental results show that the performance of the proposed QLFHM algorithm is superior to that of the benchmark algorithms CG and Viterbi when processing service requests.We are confident that while QLFHM algorithm ensures the security of user data, its performance advantages are reflected by the decision time, load balancing, deployment success rate and deployment profit when deploying SFCs.
In future work, we will further carry out other related researches such as the migration of virtual machines for the deployed VNFs [38], the energy-saving operation of servers in the network [39], and the decentralization of resource allocation controllers [40] to extend our current study.
e B e -1/B_max e B e -1/B_max e B e -1/B_max e B e /B_max e B e /B_max e B e /B_max e Be/B_maxe B e /B_max e

Figure 3 .
Figure 3. Dynamic process of the Markov decision-making process (MDP).

Figure 3 .
Figure 3. Dynamic process of the Markov decision-making process (MDP).

Figure 3 .
Figure 3. Dynamic process of the Markov decision-making process (MDP).

Figure 4 .
Figure 4. Diagram of r with a and s.
, Specifically, when the four subscripts _ℎ , _ , _ , and ℎ , which represent states, are determined, the subscripts of

Figure 4 .
Figure 4. Diagram of r with a and s.

Algorithm 4 . 3 :
Q-learning decision-making process 1: read the trained matrix Q 2: read the user request list RE For every re in RE do 4:

Algorithm 5 . 3 :
The load balancing scoring process 1: read the information from G 2: read the candidate list CA For every pa in CA do 4:

Figure 6 .
Figure 6.Comparison of the request acceptance ratio.

Figure 6 .
Figure 6.Comparison of the request acceptance ratio.

Figure 6 .
Comparison of the request acceptance ratio.

Figure 7 .
Figure 7.Comparison of service provider average profit.

Figure 7 .
Figure 7.Comparison of service provider average profit.

Figure 8 .
Figure 8.Comparison of computation time per request.

Figure 8 .
Figure 8.Comparison of computation time per request.

Figure 9 .
Figure 9.Comparison of the request acceptance ratio among different states.

Figure 9 .
Figure 9.Comparison of the request acceptance ratio among different λ states.

Figure 11 .
Figure 11.Comparison of computation time per request among different states.

Figure 10 .
Figure 10.Comparison of service provider average profit among different λ states.

Figure 11 21 Figure 10 .
Figure11shows a comparison of operation times under the three λ states.The average operation time in the λ low and λ balanced states is largely stable, while some fluctuation occurs in the λ high state.The value of λ determines the number of candidate paths that participate in the score stage; consequently, the operation time is proportional to the value of λ.However, as shown, we found that the average operation time of the λ balanced value is closer to the λ low state and better matches the desired time efficiency.

Figure 11 .
Figure 11.Comparison of computation time per request among different states.

Figure 11 .
Figure 11.Comparison of computation time per request among different λ states.

Table 2 .
The design of the use ratio λ .