Multi-Path Routing Algorithm Based on Deep Reinforcement Learning for SDN

: Software-Deﬁned Networking (SDN) enhances network control but faces Distributed Denial of Service (DDoS) attacks due to centralized control and ﬂow-table constraints in network devices. To overcome this limitation, we introduce a multi-path routing algorithm for SDN called Trust-Based Proximal Policy Optimization (TBPPO). TBPPO incorporates a Kullback–Leibler divergence (KL divergence) trust value and a node diversity mechanism as the security assessment criterion, aiming to mitigate issues such as network ﬂuctuations, low robustness, and congestion, with a particular emphasis on countering DDoS attacks. To avoid routing loops, differently from conventional ‘Next Hop’ routing decision methodology, we implemented an enhanced Depth-First Search (DFS) approach involving the pre-computation of path sets, from which we select the best path. To optimize the routing efﬁciency, we introduced an improved Proximal Policy Optimization (PPO) algorithm based on deep reinforcement learning. This enhanced PPO algorithm focuses on optimizing multi-path routing, considering security, network delay, and variations in multi-path delays. The TBPPO outperforms traditional methods in the Germany-50 evaluation, reducing average delay by 20%, cutting delay variation by 50%, and leading in trust value by 0.5, improving security and routing efﬁciency in SDN. TBPPO provides a practical and effective solution to enhance SDN security and routing efﬁciency.


Introduction
The advancement of communication networks has witnessed swift development in mobile communication, the Industrial Internet of Things (IIoT), and emerging internet technologies.The result is the emergence of a varied spectrum of network types molding the current communication panorama.The interplay and coexistence of these networks have grown crucial due to their deployment in unattended or hostile settings and their inherent openness, rendering them vulnerable to potential intrusions by malicious actors.Moreover, the proliferation of mobile devices connecting to networks has challenged traditional network access methods and policies, necessitating the refinement of routing strategies to meet dynamic information transmission needs.One promising solution to these evolving challenges is Software-Defined Networking (SDN).SDN is a flexible and efficient networking approach that offers a global view of the network and enables the selection of optimal paths based on real-time network conditions [1].Its fundamental characteristic is the separation of the control plane from the data plane, where a centralized control function maintains the network's state and issues instructions to data plane devices [2].This architecture provides clear security advantages, allowing for real-time analysis and correlation of network feedback.Also, due to the special programmability of SDN, programmable routing based on SDN has also become quite popular in recent years [3][4][5][6].However, it also presents concerns related to open programmability and trust between network elements, particularly when using technologies like OpenFlow.OpenFlow, the most commonly used SDN technology, is a focal point of security analysis.The STRIDE threat analysis method applied to OpenFlow reveals vulnerabilities, including susceptibility to Denial of Service (DoS) attacks [7].DoS attacks on SDN can exhaust controller resources by flooding it with fake flow table requests and disrupt communication by targeting the southbound interface between the controller and network devices.To address these security threats, various technologies like intrusion detection, authentication, and access control have been used [8].However, these are typically passive measures against specific vulnerabilities.It is crucial to include proactive security measures in network system design to enhance immunity against threats.
Recently, a trust management-based security mechanism has been proposed [9][10][11][12].Trust levels are assigned to network nodes based on direct and prior observations, guiding decisions in future interactions [13].This approach enhances network security by avoiding nodes with security vulnerabilities during route discovery [14][15][16][17].Despite trust-based methods showing potential, their complexity, when coupled with deep learning, hinders real-time applications.The traditional Dijkstra shortest path algorithm [17] faces issues like slow convergence and responsiveness, causing network congestion, especially in dynamic environments with increasing traffic [18].
Machine learning, particularly deep reinforcement learning, has gained prominence due to its exceptional performance in data processing, classification, and intelligent decision making.This has led to the development of several popular algorithms, including the Deep Q Network (DQN) [19], Actor-Critic (AC) [20], Deep Deterministic Policy Gradient (DDPG) [21], Trust Region Policy Optimization (TRPO) [22], etc.More recently, researchers have explored integrating deep reinforcement learning into SDN routing specifications to achieve intelligent routing and fine-grained management [23][24][25][26].In their pioneering work, Casas et al. [25] proposed a DQN-based deep reinforcement learning (DRL) scheme to generate dynamic traffic changes for SDN network routing.Likewise, Alkhalaf et al. [26] introduced a Proximal Policy Optimization (PPO)based deep reinforcement learning technique to improve SDN network routing, allowing for real-time intelligent control and administration.However, these efforts often neglect security considerations in real-time communication.Therefore, there is a pressing need for an efficient deep learning algorithm that offers strong privacy protection, real-time communication capabilities, and high efficiency.
In this study, a novel multi-path routing approach called Trust-Based Proximal Policy Optimization (TBPPO) is introduced for SDN.TBPPO leverages a trust value mechanism based on Kullback-Leibler divergence (KL divergence) [27] and a node diversity assessment mechanism to enhance SDN robustness, address congestion problems, and notably fortify defenses against Distributed Denial of Service (DDoS) attacks.Furthermore, an improved PPO algorithm is tailored for optimizing multi-path routing in SDN, while also considering security, network delay, and variations in multi-path delays.This advancement is crucial for improving SDN routing optimization.To enhance computational efficiency, an enhanced Depth-First Search (DFS) algorithm is incorporated, simplifying the action space.The experimental results validate TBPPO's superior performance in terms of convergence and overall effectiveness when compared to the traditional methods.The contribution of this work can be summarized as follows: 1.
We present TBPPO, a multi-path routing algorithm designed specifically for SDN.This innovative approach integrates a trust value mechanism based on KL-divergence and node diversity as a key security assessment criterion.TBPPO addresses network fluctuations, enhances robustness, and mitigates congestion issues, with a primary focus on countering DDoS attacks.

2.
We present an enhanced PPO algorithm to optimize security and efficiency in SDN routing.This novel algorithm involves the optimization of multi-path routing, con-sidering factors like security, network delay, and variations in multi-path delays.To address the recurrent gradient explosion issues observed during the experimental process, we introduced learning rate decay and layer normalization.Furthermore, we incorporated a trust value-based routing selection approach, resulting in enhanced security stability and a reduced delay performance.3.
To avoid routing loops, we abandoned the traditional Next Hop routing mechanism and adopted a path selection approach.The improved DFS uses a path selection method to choose a set of promising routes, which are called routing groups.Then, we search for the best multi-path route within these routing groups.This enhancement significantly reduces packet loss and improves the overall efficiency and practicality of the algorithm.4.
Finally, we conducted experiments using the NSFNet-14 and Germany-50 network topologies.The results demonstrate that our TBPPO technique outperforms traditional approaches in terms of both convergence and overall performance.Additionally, the obtained findings emphasize TBPPO's potential as an effective solution for improving SDN security and routing efficiency.
The subsequent sections of this paper will be organized as follows: Section 2 provides an overview of related works.Section 3 offers a detailed description of the system model, encompassing crucial elements such as the SDN network environment.Section 4 delves into the design and specific details of the TBPPO algorithm.This includes its operational principles and key algorithms.Section 5 presents simulation results and corresponding analyses to validate and evaluate the performance of the TBPPO algorithm in various scenarios.Section 6 summarizes the experimental conclusions.

Trust-Based Security Mechanisms
Trust-based security mechanisms have gained popularity in Wireless Sensor Networks (WSN) and the Internet of Things (IoT) [28][29][30][31][32][33].These mechanisms focus on ensuring the privacy and security of network nodes.Several trust-based algorithms have been introduced in recent studies.Ashwin et al. [34] introduced a weighted clustering trust model algorithm that demonstrated significant improvements in identifying malicious nodes.Rajeswari et al. [35] proposed a trust-based next-hop node selection algorithm, which improved the network performance in terms of the data packet transmission, delay, and error rates.Zhang et al. [36] presented a cloud-based trust evaluation approach that is sensitive to various attacks and capable of improving malicious node detection accuracy.Mingwu et al. [37] introduced trust entropy and a standard structural entropy mechanism for detecting malicious behavior in sensor systems.Subhash et al. [38] utilized machine learning to include parameters such as friendship and community interest, shedding light on the evolution of trust in an entity over time.In another study, Subhash et al. [39] employed a heuristic approach based on machine learning to amalgamate trust-related attributes, successfully distinguishing between trustworthy and untrustworthy nodes within the network.Claudio et al. [40] introduced an incremental Support Vector Machine (iSVM) method for simulating various attack patterns, outperforming other methods.Besat et al. [41] used the K-Nearest Neighbors (KNN) algorithm to detect selfish behavior in entities effectively.Wafa et al. [42] employed machine learning methods for trust parameter aggregation and a mixed propagation approach to classify users and detect attack types.However, many of these methods have high computational complexity and are unsuitable for real-time communication scenarios when combined with deep learning.They did not further investigate the application of security mechanisms in route planning.Currently, there is not a market solution that simultaneously considers low delay, high security, and computational efficiency for multi-route planning.Therefore, our work intends to fill this blank.

Reinforcement Learning in SDN
SDN has shifted traditional network operations from hardware to software, both simplifying and integrating the functionality of the network control plane while also enhancing the reliability of network hardware devices.With the continuous growth in network complexity and traffic demands, traditional shortest path algorithms suffer from drawbacks such as slow convergence and network congestion.Researchers have turned to machine learning, specifically deep reinforcement learning, to optimize SDN route selection.Compared to traditional methods, DRL may introduce some additional overhead.Deep learning models have a large number of parameters, requiring more storage space to save and load the model.DRL models need to be trained, which might require a significant amount of time and data.Traditional methods are usually based on fixed algorithms and do not require a training process.DRL might experience unstable phases when learning and adapting to new environments, which could lead to short-term instability in routing policies.Some DRL approaches use an experience replay mechanism, necessitating the storage and processing of vast amounts of historical data.Despite these overheads, DRL remains attractive for SDN routing decisions because it can learn and adapt to complex network environments and often outperforms traditional methods in many scenarios.Various studies have explored the application of deep reinforcement learning in SDN route optimization, focusing on intelligent routing, customization, and fine-grained management.Table 1 presents a succinct overview of recent research employing DRL in SDN.
The above table summary elucidates the learning techniques applied, the formulation of actions, and the criteria used for assessment.In the studies [43,44], the authors employed the RL method Q-Learning, which relies on Q-tables and demands significant memory, data, and time resources.Conversely,refs. [25,45,49] used the DRL method DQN, and [46] employed the Dueling Double DQN method, which optimizes DQN.These methods utilize deep learning networks to approximate values instead of Q-tables, making them more practical and scalable than Q-Learning.Additionally, refs.[26,47] respectively used the Advanced AC (A2C) and PPO methods, which employ policy gradients instead of value approximation.These methods can enhance convergence compared to DQN.In studies [26,[43][44][45][46]48], the choice of the next hop is used as the action.These studies reveal that this approach fulfills end-to-end performance requirements only when the appropriate next hop is selected.An incorrect choice can lead to significant performance degradation and even routing loops.In contrast, ref. [47] employs the adjustment of graph neural network parameters as the action, which requires reinforcement learning assisted by a specific graph neural network, making it less generalizable.Furthermore, the concept of the takeover decision of the switches was employed as the action in [48] to adaptively mitigate the propagation of attacks, thereby enhancing the resilience of Software Defined Industrial Networks (SDIN).The authors in [25] select routes from a preselected route group, which eliminates the possibility of routing loops and provides stable connections.However, as the number of nodes increases, the count of preselected paths may grow rapidly, introducing security vulnerabilities.Therefore, there is a compelling need to devise high-efficiency DRL algorithms capable of addressing these concerns.Contrary to what is described in Section 2.1, these studies only considered single-route optimization for the entire route without taking into account multi-route performance.More critically, they did not consider the security performance under a DDoS attack environment, thereby overlooking the potential adverse impacts, such as network congestion, that the crucial factor of security might bring to the network.In contrast to prior research, our emphasis is on redressing the imbalance in recent studies, which prioritized network performance but overlooked vital aspects such as privacy, security, and real-time communication delay.We introduce an enhanced PPO algorithm tailored for optimizing multi-path routing in SDN, with a significant focus on security, network delay, and variations in multi-path delays.

Network Module
The overall objective of the system is to find a given number of paths within a given network topology while also achieving optimal performance in terms of delay, delay variation, security, and other comprehensive factors.Specifically, for a given network, its topology is represented as G =< V, E >, where V = {v 1 , v 2 , . . . . . . ,v N , } represents the set of all vertices, and E = {e 1 , e 2 , . . . . . . ,e M } represents the set of all edges.The network topology G is represented using two adjacency matrices, M topo and M(τ).M topo is the initial adjacency matrix that includes delay information and is used for preliminary exploration in the depth-first algorithm.M(τ) is an adjacency matrix that represents the connectivity of links and their corresponding delay.It serves as the input state for the deep reinforcement learning component.For nodes v i and v j , if there is a connected edge e x with a current delay of d, then e x = e i,j = d.Conversely, if there is no connected edge or i = j, then e i,j = −1.In this paper, the policy-based network topology optimization problem is transformed into a maximization problem of the objective function: where θ represents the parameters of the deep reinforcement neural network, J(θ) represents the objective function of the neural network, and E τ∼π θ [R(τ)] is the expected reward R at time τ, following the distribution of the neural network policy π θ .

Attack Module
The system aims to assess the vulnerability of a network over a series of time windows T in the presence of external attacks characterized by the strategy π att .Each node in the network, denoted as v i , is assigned a probability of being attacked, represented as p(v i ) = π att (v i ).Within each time window τ, node v i generates a set of send-receive records that, to some extent, indicate the node's susceptibility to attacks: where Generator π att represents the mapping of the external environment generating attack records using the strategy π att .Group(τ, v i ) represents the records of node v i at time τ, and Group(τ) represents the record group of all nodes at time τ.When defining the time window τ and using M(τ) to represent the delay matrix, the changes in the delay matrix caused by Group(τ, v i ) are described by the mapping relationship Trans as follows:

Deep Reinforcement Learning Module
Deep reinforcement learning is a deep learning algorithm that involves two main components, including the environment and the agent.The agent selects an action according to the environment state and receives the reward and the next state from the environment.The main goal of the DRL algorithm is to train the agent to select the actions in the environment that maximize its rewards.The environment for DRL is designed as a Markov decision process, so it can only solve Markov decision process problems.DRL is divided into off-policy methods and on-policy methods.Off-policy methods use different strategies to collect experiences and use these experiences to improve other target strategies.On-policy methods directly collect experiences based on the current policy and use these experiences to improve the policy.Compared to off-policy methods, on-policy methods, because they cannot use the experiences of old policies, need to interact with the environment more frequently to collect experiences, which leads to a decrease in learning efficiency.However, because they are updated according to a consistent policy, the updates obtained are more effective and the decision-making performance stability of the new policy is higher.Typical off-policy methods for SDN include DQN [19].Typical on-policy methods for SDN are, for example, the AC [20], DDPG [21], and PPO [26], which are called policy gradient algorithms (PG).Compared to DQN, PPO has greater stability, which makes PPO perform better on complex problems than DQN.Transitioning from traditional policy gradient algorithms like DDPG and AC to PPO is driven by the quest for stability and efficiency.While DDPG excels in continuous action spaces, it is sensitive to hyperparameters; AC leverages asynchronous updates for diverse data but faces challenges in distributed training.PPO, on the other hand, employs a clipping mechanism to prevent large policy updates, offering a more stable and consistent learning experience.This makes PPO often outperform DDPG and AC without intricate tuning or handling asynchronous complexities (response to Reviewer 4, comment 10).Additionally, PPO usually requires less hyperparameter tuning, making it more convenient than DQN.In this work, we chose the on-policy PPO algorithm as our decision-making module and propose a TBPPO.We will provide specific descriptions of the algorithm in Section 4.5.

Trust-Based Proximal Policy Optimization (TBPPO)
This section introduces the TBPPO algorithm, which is a multi-objective, multi-path routing planning algorithm providing a secure multi-route scheme designed to provide low delay for real-time communication.The algorithm includes a preprocessing module based on DFS, a trust value calculation module using KL divergence, a Markov process transition mechanism, and a deep reinforcement learning decision module based on PPO. Figure 1 illustrates the overall workflow of the algorithm.

Trust-Based Proximal Policy Optimization (TBPPO)
This section introduces the TBPPO algorithm, which is a multi-objective, multi-path routing planning algorithm providing a secure multi-route scheme designed to provide low delay for real-time communication.The algorithm includes a preprocessing module based on DFS, a trust value calculation module using KL divergence, a Markov process transition mechanism, and a deep reinforcement learning decision module based on PPO. Figure 1 illustrates the overall workflow of the algorithm.In this algorithm, the data is divided and transmitted over multiple routes, thereby dispersing traffic across several paths to increase the effective bandwidth of the network, allowing multiple connections to work in parallel.If one path fails, the traffic can automatically switch to another, thereby enhancing network reliability.This method also takes into account security considerations.By dispersing traffic across multiple routes, it enhances network security, as it requires an attacker to compromise several paths at once, thereby raising the complexity and difficulty of network attacks.Our strategy accounts for real-time communication and security along each route, crafting a balanced approach that weighs various metrics including average delay, delay variations, KL trust value, and node diversity.
Figure 1 depicts the structure of our algorithm.For each time interval, the control panel gathers the current network topology, DDoS attack logs, and node types from the data panel.After processing via the trust value module and the DFS module, these features are concatenated to form a state that enters the DRL module.Within this module, the critic network calculates Q-values while the actor network determines the action.This In this algorithm, the data is divided and transmitted over multiple routes, thereby dispersing traffic across several paths to increase the effective bandwidth of the network, allowing multiple connections to work in parallel.If one path fails, the traffic can automatically switch to another, thereby enhancing network reliability.This method also takes into account security considerations.By dispersing traffic across multiple routes, it enhances network security, as it requires an attacker to compromise several paths at once, thereby raising the complexity and difficulty of network attacks.Our strategy accounts for real-time communication and security along each route, crafting a balanced approach that weighs various metrics including average delay, delay variations, KL trust value, and node diversity.
Figure 1 depicts the structure of our algorithm.For each time interval, the control panel gathers the current network topology, DDoS attack logs, and node types from the data panel.After processing via the trust value module and the DFS module, these features are concatenated to form a state that enters the DRL module.Within this module, the critic network calculates Q-values while the actor network determines the action.This action is then executed on the data panel and the experience is stored in the replay buffer.Upon completion of an episode, experiences are extracted from the replay buffer in sample batches to update the critic network before proceeding to the next episode.

The Improved DFS Module
Most deep reinforcement learning algorithms are designed based on Next Hop routing policies, as indicated in [44][45][46][47].However, it has been shown that such policies are prone to leading to routing loops, causing longer delays and some degree of packet loss [50].These issues have not been adequately addressed in the aforementioned studies.To tackle these challenges, we introduce a path selection approach.The number of available paths in the network increases exponentially with the network's scale, which can be computationally prohibitive.Therefore, we introduce the concept of limiting the number of paths.We use the path selection approach in conjunction with the DFS algorithm to pre-select the K best paths, effectively sorting and selecting the optimal K paths from the available options.The initial topology structure, M topo , containing delay information, is used as the input to the network.K best paths are computed to form the preselected routing group.The delay values associated with these preselected routes effectively replace the delay matrix M(τ) as part of the state representation.This preprocessing step results in a considerable reduction in the computational complexity of our algorithm.This preprocessing step can be expressed as: where, M topo is the matrix topological structure, L represents the set of all possible paths generated after the application of the DFS algorithm, and its cardinality is denoted as L 0 .DFS K signifies the utilization of an enhanced DFS algorithm to select the shortest K routes.

Security Module
DDoS attacks represent a prevalent and disruptive network threat.These attacks involve malicious nodes intentionally discarding messages from legitimate nodes, causing severe disruptions to the data transmission.What makes these attacks particularly insidious is that these malicious nodes can mimic legitimate behavior when not receiving data packets, making them hard to detect and highly destructive [51][52][53].To address the challenge of DDoS attacks and establish trust levels for network nodes, we introduce a KL divergence as a trust mechanism to counter network attacks.Initially, specific features are carefully selected as the KL benchmark, drawing from the normal network operation records.The choice of these features can be tailored to suit specific circumstances.The records are configured to combat DDoS attacks and take the following form: where SIP, DIP, SYN, and DP represent the frequency of the source IP, the destination IP, the SYN packets, and the different destination ports in the records of node v.At the end of each predefined time interval denoted as τ, we compute the KL divergence between the records of the node within that time frame and the baseline.A higher KL divergence value serves as an indicator of a higher probability that the node is currently under attack, consequently leading to a reduced security level.The formula for computing this divergence is given as follows: where Group(0, v i ) represents the KL baseline, Group(τ, v i ) corresponds to the records of node v i within the time window τ, Trust(τ, v i ) signifies the trust value of node v i at time τ, and Trust(τ) denotes the trust values of all nodes at time τ.The term KL refers to the enhanced iterative KL divergence calculation formula, which is enhanced from the equation in [23] and defined as follows: where I represents the total data volume in the records.It is worth noting that the size of I may differ among nodes due to the varying number of destination ports.Moreover, β i signifies the weights assigned to different parameters, and these weights are allocated based on specific mechanisms tailored to different attack models.Next, we introduce a novel node-type mechanism to enhance the management of various nodes within the system.Currently, different Internet ISP (ISP) are responsible for handling distinct nodes.
Appl.Sci.2023, 13, 12520 9 of 24 However, depending heavily on a single ISP can lead to critical security issues, including server outages, attacks on ISPs, and potential data breaches by ISPs [54].In response to these challenges, we present a solution that focuses on diversifying the selection of multiple paths to reduce dependence on a single ISP.This approach aims to minimize security vulnerabilities associated with ISP-related issues.We design a metric Node Diversity presented as the ISP variance of an individual route to describe the node diversity of this route.In this context, the variable t is used to represent different ISP, and the Node Diversity in a single path is formulated as: where t represents the node type currently under consideration.For all t belonging to the set T, Count a l (t) is used to signify the number of nodes of this ISP along route a l .Here, a l represents the l-th route within the currently selected route group A(τ).T denotes the set encompassing all node types, expressed as follows: In the following, we use the notation δ v i to represent the node type of any given node v i , and it is expressed as δ v i = t.Collectively, we refer to the overall node-type configurations for all nodes as δ.

Markov Process Transition Module
After employing the trust value mechanism relying on the KL divergence for transformation, the SDN control plane becomes capable of presenting the network's real-time status, which encompasses information like the delay matrix, node categories, and node trust values.This issue is subsequently structured as a Markov Decision Process (MDP) and is characterized by a four-tuple (S, A, P, R); here, S represents the state space, A is the action space, P is the probability distribution function, and R represents the reward function, which is composed of five components: delay reward, delay variation reward, KL trust value reward, node diversity reward, and node redundancy reward.The state, denoted as S, is composed of three components at time τ: the delay values of the selected route group L, represented as D L (τ); the trust values of each node at time τ, denoted as Trust(τ); and the types of each node, referred to as δ.This can be formulated as: where the action space A has a size of L 0 , which corresponds to the size of the preselected route group L. A(τ) is a subset of L, and it consists of L routes, where the l-th route is expressed as a l : where the probability distribution P is based on the current policy π θ τ−1 , representing the probabilities of different actions for different states S(τ).For any given time window τ, the current policy π θ τ−1 based on parameters θ will generate an action A(τ) according to the probabilities in P:

Reward Module
The Reward Mechanism's role is to assign rewards to the network with respect to multiple objectives.By adjusting the weighting factors α o and the rewards r o associated with different objectives, the system can guide the network's learning process.The formula for calculating the reward R(τ) obtained by the network at the τ-th time window is represented as: here α o represents the weight of reward type o, and r o (τ) represents the reward value of type o obtained during the τ time window.O represents the set of reward types, including the delay reward, delay variation reward, KL trust value reward, node diversity reward, and node redundancy reward.These can be expressed as: Here, delay represents the average delay, variation represents the delay variation, diversity represents the node diversity, sa f ety represents the route trust value, and repeat represents the node redundancy.The delay reward is defined as the average delay of the current path group A(τ) and it is defined as: Here, r delay (τ) represents the delay reward at time τ, D M(τ) (a l ) is the delay of route a l under M(τ), and d M(τ) (e) is the delay of edge e under M(τ).The delay variation reward is expressed as the average delay variation for each path in the current path group A(τ) and is formulated as: Here, r variation (τ) represents the delay variation reward at time τ, and mean indicates the calculation of the mean value.The KL trust reward is defined as the mean of the minimum safety node trust values for each path in the current path set A(τ) and it can be expressed as: Next, the node diversity reward is defined as the average of the variance of the number of nodes of each type in each path within the current path group A(τ) as follows: where the node redundancy reward is represented by the number of nodes that appear repeatedly in the current routing group A(τ).

Enhanced PPO Module
In traditional policy gradient algorithms, policy weights are typically updated by calculating the gradient of the objective function and applying it with a step size.However, this update process may encounter problems like overshooting or undershooting.To address these issues, we adopt the PPO algorithm.PPO is a policy gradient method in reinforcement learning, which enhances the policy by optimizing a surrogate objective function using stochastic gradients obtained by sampling the data from interactions with the environment.It allows for multiple small-batch updates, as opposed to a single gradient update for each data sample.The specific PPO variant employed in this research is the PPO-clip algorithm, which relies on a clipping mechanism.To prevent the importance sampling function from exceeding the predefined upper or lower bounds, a truncation function denoted as J θ k PPO (θ) is applied.This function automatically limits the importance sampling values when they go beyond the specified upper or lower limits.This can be expressed as Equation: Here, J θ k PPO (θ) is used to assess the expected cumulative reward (performance) of the policy, θ is the current policy's parameter, and θ k is the policy parameters at a previous time step or iteration step k. s τ represent S(τ). a τ represents the action taken at time step τ, and a τ ∈ A(τ).clip p θ (a τ |s τ ) p θ k (a τ |s τ ) , 1 − ε, 1 + ε is a clipping function used to limit the ratio between 1−ε and 1+ε, where ε is a small positive value usually employed to ensure that policy updates are bounded.The function A θ k represents the function, which provides an estimate of the advantage when taking action a τ with the parameters set θ k : Here, γ is the discount factor, λ is the GAE parameter, and δ V t is the temporal difference function: We utilize its gradient as the loss function for parameter θ: In our experiments, we observed that fully connected layers exhibit relatively weak fitting capabilities for the relationship between states and actions and they can often lead to the problem of gradient explosion.Therefore, we made some improvements to the PPO algorithm.We introduced a learning rate decay technique, which can enhance the stability of training in the later stages and improve the training effectiveness.Here, we employed linear learning rate decay, where the Actor Network's learning rate decreases linearly from an initial value of 1 × 10 −4 to 0 as the training steps progress.This is formulated as: where α presents the parameter operated.By default, PPO uses the ReLU activation function, but experimental findings suggest that PPO performs more effectively with the Tanh activation function.Therefore, we replaced ReLU with the Tanh activation function.
In the feedforward neural network (FNN), we added a layer normalization layer (LN), following the formula [55]: After these adjustments, the issue of gradient explosion has been significantly alleviated.The pseudocode of the algorithm is given in Algorithm 1 and a flow chart is given in Figure 2.

Results and Validation
In this section, we thoroughly examine the details of the experiment and the results to assess the reliability of our proposed system.We compare the performance of the TBPPO algorithm with three other algorithms: DRSIR, PPO, and Dijkstra.Additionally, we explore the performance of TBDRSIR, which integrates the DRSIR algorithm with our trust value mechanism within the network topology.

Experiment Setup
Our experiments were executed on hardware consisting of an Intel (R) Core (TM) i5-9400 CPU running at 2.90 GHz and an NVIDIA GeForce GTX 1660Ti for training the TBPPO agent.We implemented the AC model using PyTorch and optimized the training process using the Adam Optimizer as the loss function.In our experimental model, the state representation includes several components, such as the current KL trust values for each node, node types, node utilization, the current average delay, and the number of remaining steps.The action space in our experiments is constructed from a set of feasible routes calculated using an improved DFS algorithm.The Actor network is designed as a six-layer fully connected neural network, with each layer consisting of 256 neurons and layer normalization applied between these layers.The output size of the network matches the action space size, which is set to 5 in our experimental model.Similarly, the Critic network is constructed as a six-layer fully connected network with 256 neurons in each layer, featuring layer normalization between layers.The output of the Critic network corresponds to the Q-value, resulting in an output size of 1.For a comprehensive understanding of the remaining parameters of our experimental model, please refer to Table 2.

., T do
Obtain Group(τ, v i ), M(τ), S(τ) from the environment under the policy π att .Calculate trust values Trust(τ, v i ) for all nodes using Group(τ, v i ) and Group(0, v i ) as input.

Results and Validation
In this section, we thoroughly examine the details of the experiment and the results to assess the reliability of our proposed system.We compare the performance of the TBPPO algorithm with three other algorithms: DRSIR, PPO, and Dijkstra.Additionally, we explore the performance of TBDRSIR, which integrates the DRSIR algorithm with our trust value mechanism within the network topology.

Experiment Setup
Our experiments were executed on hardware consisting of an Intel (R) Core (TM) i5-9400 CPU running at 2.90 GHz and an NVIDIA GeForce GTX 1660Ti for training the TBPPO agent.We implemented the AC model using PyTorch and optimized the training process using the Adam Optimizer as the loss function.In our experimental model, the state representation includes several components, such as the current KL trust values for each node, node types, node utilization, the current average delay, and the number of remaining steps.The action space in our experiments is constructed from a set of feasible routes calculated using an improved DFS algorithm.The Actor network is designed as a six-layer fully connected neural network, with each layer consisting of 256 neurons and layer normalization applied between these layers.The output size of the network matches the action space size, which is set to 5 in our experimental model.Similarly, the Critic network is constructed as a six-layer fully connected network with 256 neurons in each layer, featuring layer normalization between layers.The output of the Critic network corresponds to the Q-value, resulting in an output size of 1.For a comprehensive understanding of the remaining parameters of our experimental model, please refer to Table 2.

Validation of the KL Trust Mechanism
To assess the effectiveness of the KL trust value mechanism, we conducted experiments using the CICIDS2017 dataset [56], which offers insights into packet exchange patterns within a network over a specific time frame.This validation process involved simulating a botnet network with three key stages: probing, propagation, and launching DDoS attacks.These stages encompassed diverse characteristics, such as creating packets with fixed IP addresses but varying TCP destination port numbers, generating packets with a multitude of uniform source addresses but with distinct destination addresses and port numbers, introducing a significant number of unique source addresses accessing a specific destination IP address.Furthermore, our SDN was inundated with SYN packets, but the number of ACK packets did not align with these SYN packets.
The decision to employ KL divergence as a metric for the trust value was driven by our objective to evaluate the impact of attack interference on a node's communication capability, as opposed to classifying the type of attack.KL divergence serves as a suitable measure for gauging the degree to which a node's communication capabilities are affected, making it a pragmatic choice for our purposes.Our methodology involves identifying records in which a node functions as both the source and destination IP without being subjected to any attacks, deeming these records as valid for that particular node.We randomly sampled 3000 such records for each test node.Out of the 84 distinctive features within the CICIDS2017 dataset, we classify a node as malicious if it falls into any of the following categories: an attacker node itself, a node with identified port vulnerabilities, a node affected by worm infestations, or a node subject to DDoS attacks.As a result, we established four key features as our criteria for classification: source IP frequency, destination IP frequency, SYN packet frequency, and the frequency of distinct destination ports.These criteria were utilized as benchmarks for calculating the KL divergence.It is worth noting that the data volume for the number of distinct ports recorded (DP) may vary across different nodes.Therefore, we combined attack records with a portion of normal communication records in varying proportions and subsequently calculated their trust values.The weightings assigned to the parameters, namely SIP (Source IP), DIP (Destination IP), SYN, and DP (Distinct Ports), are detailed in Table 3.In the experiment, several nodes were tested, including one with the IP address 172.16.0.1.The test dataset included a mix of DDoS attack records and regular communication, with proportions of 20%, 40%, and 60%, respectively.The results are presented in Table 4.As can be observed, there is a positive correlation between the KL trust value and the proportion of DDoS attacks.This result demonstrates that the KL trust value can reflect the security performance of a given node.The TBPPO algorithm was evaluated in comparison to the DRSIR [21], PPO [22], and Dijkstra [13] algorithms.Furthermore, we integrated the trust mechanism introduced in our paper with DRSIR to form TBDRSIR and conducted comparative evaluations.
(1) Dijkstra Algorithm [13]: The Dijkstra algorithm is based on the weights of a graph to select paths.This algorithm is simple and feasible and is an important component of the OSPF protocol.We hope to use this algorithm to demonstrate the deviation of our algorithm's delay portion from the theoretical optimal value and to evaluate its security performance.(2) DRSIR Algorithm [21]: The DRSIR algorithm is based on DQN's deep reinforcement learning algorithm and its effectiveness has been thoroughly demonstrated in [21].However, it does not explore multiple objectives.We aim to use this algorithm to showcase our algorithm's optimization ability for multiple objectives.(3) TBDRSIR (Trust-Based DRSIR): TBDRSIR is a variant of our ablation experiment introduced to explore the performance of TBPPO.It draws inspiration from DRSIR and combines the dynamic security assessment capability we designed.We aim to test the performance of our security mechanism on other algorithms.(4) PPO Algorithm [22]: The PPO algorithm has been used to solve routing optimization problems in SDNs in [22].Its advantage lies in its adaptability to diverse environments and good performance in terms of delay.However, the authors lacked consideration for security capabilities.Additionally, we have made improvements to PPO, enabling it to achieve a superior performance.We hope to use this algorithm as a baseline to demonstrate our algorithm's superior security performance and better performance in terms of delay variation.

Network Topologies
These experiments were performed in two distinct network topologies: NSFNet-14 [57] and Germany-50 [58].NSFNet-14 consists of 14 nodes and 19 undirected edges, with an initial route count of 61 determined using the DFS algorithm.In contrast, the Germany-50 network comprises 50 nodes and 176 undirected edges, with an initial route count (K) capped at 10,000 and calculated using an enhanced DFS algorithm.Figure 3 shows the design and network topology of NSFNet-14 and Germany-50.

Convergence Comparison
Figure 4 shows the convergence performance comparison in the NSFNet-14 and Germany-50 topologies after 3000-episode iterations of simulation, as represented by the deep reinforcement learning metric of reward convergence.It is worth noting that because PPO

Convergence Comparison
Figure 4 shows the convergence performance comparison in the NSFNet-14 and Germany-50 topologies after 3000-episode iterations of simulation, as represented by the deep reinforcement learning metric of reward convergence.It is worth noting that because PPO and DRSIR do not incorporate our security mechanism, their rewards will lack the corresponding negative values of r sa f ety (τ) and r t (τ), which leads to a certain numerical difference in the performance of TBPPO versus PPO and TBDRSIR versus DRSIR.Figure 4a shows the rewards collected per cycle in the NSFNet-14 topology.After 3000-episode iterations, TBPPO converged to −20 after 800 episodes, while PPO initially reached −20 after 500-episode iterations and later re-converged to −12 after 2700 episodes, both exhibiting good convergence trends.DRSIR showed poorer convergence, quickly rising to −60 in the first 100-episode iterations and fluctuating widely until the end of 3000 episodes.In contrast, TBDRSIR displayed greater volatility and ultimately failed to converge.These findings indicate that PPO is more suited to mechanisms that combine the trust value and the nodes' diversity than DQN is.
Appl.Sci.2023, 13, x FOR PEER REVIEW 17 of 25 shows the rewards collected per cycle in the NSFNet-14 topology.After 3000-episode iterations, TBPPO converged to −20 after 800 episodes, while PPO initially reached −20 after 500-episode iterations and later re-converged to −12 after 2700 episodes, both exhibiting good convergence trends.DRSIR showed poorer convergence, quickly rising to −60 in the first 100-episode iterations and fluctuating widely until the end of 3000 episodes.In contrast, TBDRSIR displayed greater volatility and ultimately failed to converge.These findings indicate that PPO is more suited to mechanisms that combine the trust value and the nodes' diversity than DQN is.
In Figure 4b, rewards collected per cycle in the Germany-50 network topology are presented, which offers more potential routes than NSFNet-14.According to the reward comparison in the figure, the convergence patterns of various algorithms can be observed.TBPPO starts to approach −85 after roughly 1400-episode iterations, while PPO achieves convergence at −75 after just 1200-episode iterations.DRSIR rapidly climbs to around −150 within the first 300-episode iterations, but exhibits higher volatility compared to PPObased algorithms.In contrast, TBDRSIR does not reach convergence.Clearly, TBPPO consistently performs well on larger networks, while TBDRSIR seems to show an inferior performance in our analysis, possibly due to the inclusion of the trust mechanism, which appears to negatively impact DQN's performance.Compared to NSFNet-14, both PPO and TBPPO continue to show superior convergence and optimization capabilities.

Delay Comparison
Figure 5 illustrates the delay performance comparison in the NSFNet-14 and Germany-50 topologies conducted following the simulation of 3000-episode iterations.Figure 5a shows the delay performance in the NSFNet-14 topology.After 3000-episode iterations, TBPPO, TBDRSIR, PPO, and DRSIR divided packets into multiple routes, while Dijkstra only used the shortest path with a fixed delay of 160 ms.TBPPO converged to 223.3 ms after 600-episode iterations, and PPO reached 216.67 ms after 500-episode iterations in terms of average delay.DRSIR displayed gradual convergence, approaching around 260 ms after 1200-episode iterations, albeit with some noticeable fluctuations.In contrast, TBDRSIR initially demonstrated convergence to 300 ms, but experienced a sudden divergence around 2700-episode iterations, ultimately failing to converge.These findings indicate that in the context of low-delay network exploration, PPO outperforms DQN.Additionally, TBPPO, which needs to balance both exploration and security considerations while also not significantly surpassing PPO, remains competitive.Furthermore, when compared to the theoretically optimal Dijkstra, both TBPPO and PPO exhibit a relatively similar performance.
In Figure 5b, the delay performance in the Germany-50 network topology is presented.This network comprises 50 nodes and 176 edges, resulting in a larger number of In Figure 4b, rewards collected per cycle in the Germany-50 network topology are presented, which offers more potential routes than NSFNet-14.According to the reward comparison in the figure, the convergence patterns of various algorithms can be observed.TBPPO starts to approach −85 after roughly 1400-episode iterations, while PPO achieves convergence at −75 after just 1200-episode iterations.DRSIR rapidly climbs to around −150 within the first 300-episode iterations, but exhibits higher volatility compared to PPO-based algorithms.In contrast, TBDRSIR does not reach convergence.Clearly, TBPPO consistently performs well on larger networks, while TBDRSIR seems to show an inferior performance in our analysis, possibly due to the inclusion of the trust mechanism, which appears to negatively impact DQN's performance.Compared to NSFNet-14, both PPO and TBPPO continue to show superior convergence and optimization capabilities.

Delay Comparison
Figure 5 illustrates the delay performance comparison in the NSFNet-14 and Germany-50 topologies conducted following the simulation of 3000-episode iterations.Figure 5a shows the delay performance in the NSFNet-14 topology.After 3000-episode iterations, TBPPO, TBDRSIR, PPO, and DRSIR divided packets into multiple routes, while Dijkstra only used the shortest path with a fixed delay of 160 ms.TBPPO converged to 223.3 ms after 600-episode iterations, and PPO reached 216.67 ms after 500-episode iterations in terms of average delay.DRSIR displayed gradual convergence, approaching around 260 ms after 1200-episode iterations, albeit with some noticeable fluctuations.In contrast, TBDRSIR initially demonstrated convergence to 300 ms, but experienced a sudden divergence around 2700-episode iterations, ultimately failing to converge.These findings indicate that in the context of low-delay network exploration, PPO outperforms DQN.Additionally, TBPPO, which needs to balance both exploration and security considerations while also not significantly surpassing PPO, remains competitive.Furthermore, when compared to the theoretically optimal Dijkstra, both TBPPO and PPO exhibit a relatively similar performance.
ing to the delay comparison in the figure, the convergence patterns of various algorithms can be observed.TBPPO starts to approach 423.3 ms after roughly 1400-episode iterations, while PPO achieves convergence at 503.3 ms after 1200-episode iterations.DRSIR begins to approach the 657 ms mark after 300-episode iterations, but its performance continues to fluctuate.In contrast, TBDRSIR does not reach convergence.It is evident that TBPPO consistently performs well on large networks, while TBDRSIR, in our analysis, seems to exhibit an inferior performance, possibly due to the inclusion of the trust mechanism, which appears to negatively impact DQN's performance.When compared to NSFNet-14, PPO and TBPPO still showcase superior convergence and optimization capabilities.Moreover, as the network scales up, PPO's advantage in securing routing becomes more apparent, resulting in lower delay due to fewer instances of attacks.

Trust Value Comparison
Figure 6 illustrates a comparative analysis of the KL trust values across five algorithms.Specifically, in the context of the NSFNet-14 dataset, as shown in Figure 6a, it is evident that the Dijkstra algorithm fails to take trust values into account.Consequently, it directs all network traffic along a single path, leading to a consistently static KL trust value of approximately 1.This static value serves as a clear indicator of the Dijkstra algorithm's inability to achieve routing convergence.The TBPPO algorithm, on the other hand, displays a noticeable convergence trend and eventually stabilizes at around 0.5, with intermittent fluctuations hovering around 0.2.In contrast, the PPO algorithm, which does not explicitly address security concerns, exhibits a performance similar to Dijkstra.Both the DRSIR and TBDRSIR algorithms exhibit inadequate adaptation to dynamic DDoS attack defense scenarios, failing to achieve trust value convergence.As a result, the TBPPO algorithm consistently maintains lower KL trust values, suggesting a reduced vulnerability to attacks and an enhanced security performance.
Figure 6b compares the performance of the KL trust values on Germany-50.Much like the scenario in NSFNet-14, the Dijkstra algorithm does not consider trust values when assessing its performance.It channels all traffic through a single path, resulting in a consistently stable KL trust value of around 1.5 without achieving convergence.Examining In Figure 5b, the delay performance in the Germany-50 network topology is presented.This network comprises 50 nodes and 176 edges, resulting in a larger number of possible routes compared to NSFNet-14.To address resource constraints, the paths were sorted by delay, with the top 10,000 routes retained.The action space for testing the algorithm consisted of these 10,000 routes.After a 3000-cycle simulation, TBPPO, TBDRSIR, PPO, and DRSIR employed a data packet division strategy using three distinct paths for the data transmission.In contrast, the Dijkstra algorithm, in its quest for the shortest route within the network topology, funneled all traffic along that singular path.Within the simulated network configuration, only one shortest path with a 160 ms delay was available, which ultimately led to Dijkstra maintaining a consistent average delay of 250 ms.According to the delay comparison in the figure, the convergence patterns of various algorithms can be observed.TBPPO starts to approach 423.3 ms after roughly 1400-episode iterations, while PPO achieves convergence at 503.3 ms after 1200-episode iterations.DRSIR begins to approach the 657 ms mark after 300-episode iterations, but its performance continues to fluctuate.In contrast, TBDRSIR does not reach convergence.It is evident that TBPPO consistently performs well on large networks, while TBDRSIR, in our analysis, seems to exhibit an inferior performance, possibly due to the inclusion of the trust mechanism, which appears to negatively impact DQN's performance.When compared to NSFNet-14, PPO and TBPPO still showcase superior convergence and optimization capabilities.Moreover, as the network scales up, PPO's advantage in securing routing becomes more apparent, resulting in lower delay due to fewer instances of attacks.

Trust Value Comparison
Figure 6 illustrates a comparative analysis of the KL trust values across five algorithms.Specifically, in the context of the NSFNet-14 dataset, as shown in Figure 6a, it is evident that the Dijkstra algorithm fails to take trust values into account.Consequently, it directs all network traffic along a single path, leading to a consistently static KL trust value of approximately 1.This static value serves as a clear indicator of the Dijkstra algorithm's inability to achieve routing convergence.The TBPPO algorithm, on the other hand, displays a noticeable convergence trend and eventually stabilizes at around 0.5, with intermittent fluctuations hovering around 0.2.In contrast, the PPO algorithm, which does not explicitly address security concerns, exhibits a performance similar to Dijkstra.Both the DRSIR and TBDRSIR algorithms exhibit inadequate adaptation to dynamic DDoS attack defense scenarios, failing to achieve trust value convergence.As a result, the TBPPO algorithm consistently maintains lower KL trust values, suggesting a reduced vulnerability to attacks and an enhanced security performance.

Delay Variation Comparison
In Figure 7, the performance of different routing algorithms in terms of delay variation is compared.Dijkstra, which does not take into account delay variation attributes, is not part of the comparison.A comparison is made between TBPPO, TBDRSIR, PPO, and DRSIR.On the NSFNet-14 data, as depicted in Figure 7a, the TBPPO algorithm starts to converge around 60 ms after roughly 1300-episode iterations, showing a clear trend towards convergence.PPO, on the other hand, converges around 60 ms after 600-episode iterations.TBDRSIR reaches approximately 125 ms after 1700-episode iterations, with significant fluctuations.DRSIR, however, fails to converge, with an average delay of around 70 ms.According to the shaded area in the graph, it continues to fluctuate between 15 ms and 175 ms even after about 3000-episode iterations.These results suggest that TBPPO, while emphasizing security, maintains a competitive delay performance compared to PPO.
In our performance analysis of the Germany-50 data, when comparing delay variations (as shown in Figure 7b), we evaluated TBPPO, TBDRSIR, PPO, and DRSIR.Dijkstra was omitted from this comparison due to its lack of multiple delay-aware paths.As illustrated in Figure 7b, the TBPPO algorithm exhibits a clear convergence trend, starting to converge around the 1500th iteration to an approximate 50 ms delay.In contrast, PPO achieves convergence earlier, at around 1000-episode iterations, with an average delay of approximately 140 ms.TBDRSIR fails to reach convergence, while DRSIR exhibits more noticeable convergence at around 210 ms.Interestingly, even though TBPPO prioritizes security, its delay variation performance is not significantly inferior to PPO.In the context of Germany, TBPPO demonstrates the benefits of secure route discovery by reducing the number of attacks, resulting in lower delay compared to PPO and showcasing improved real-time communication capabilities.

Delay Variation Comparison
In Figure 7, the performance of different routing algorithms in terms of delay variation is compared.Dijkstra, which does not take into account delay variation attributes, is not part of the comparison.A comparison is made between TBPPO, TBDRSIR, PPO, and DRSIR.On the NSFNet-14 data, as depicted in Figure 7a, the TBPPO algorithm starts to converge around 60 ms after roughly 1300-episode iterations, showing a clear trend towards convergence.PPO, on the other hand, converges around 60 ms after 600-episode iterations.TBDRSIR reaches approximately 125 ms after 1700-episode iterations, with significant fluctuations.DRSIR, however, fails to converge, with an average delay of around 70 ms.According to the shaded area in the graph, it continues to fluctuate between 15 ms and 175 ms even after about 3000-episode iterations.These results suggest that TBPPO, while emphasizing security, maintains a competitive delay performance compared to PPO.In Figure 8, we performed an extensive examination of node diversity across different routing algorithms.Specifically, when observing the NSFNet-14 (see Figure 8a), the Dijkstra algorithm consistently maintains a fixed path for the traffic, resulting in a constant In our performance analysis of the Germany-50 data, when comparing delay variations (as shown in Figure 7b), we evaluated TBPPO, TBDRSIR, PPO, and DRSIR.Dijkstra was omitted from this comparison due to its lack of multiple delay-aware paths.As illustrated in Figure 7b, the TBPPO algorithm exhibits a clear convergence trend, starting to converge around the 1500th iteration to an approximate 50 ms delay.In contrast, PPO achieves convergence earlier, at around 1000-episode iterations, with an average delay of approximately 140 ms.TBDRSIR fails to reach convergence, while DRSIR exhibits more noticeable convergence at around 210 ms.Interestingly, even though TBPPO prioritizes security, its delay variation performance is not significantly inferior to PPO.In the context of Germany, TBPPO demonstrates the benefits of secure route discovery by reducing the number of attacks, resulting in lower delay compared to PPO and showcasing improved real-time communication capabilities.

Node Diversity Analysis
In Figure 8, we performed an extensive examination of node diversity across different routing algorithms.Specifically, when observing the NSFNet-14 (see Figure 8a), the Dijkstra algorithm consistently maintains a fixed path for the traffic, resulting in a constant node diversity value of 11.In contrast, the TBPPO algorithm displays an intriguing behavior, as it gradually converges to a node diversity of 8 over approximately 600-episode iterations, demonstrating a clear convergence trend.The PPO algorithm also exhibits a commendable convergence performance but stabilizes at a node diversity value of 15.Conversely, the TBDRSIR and DRSIR algorithms both reach a node diversity of 13, albeit with relatively significant fluctuations.These findings indicate that the TBPPO algorithm places greater emphasis on optimizing the composition of routing nodes compared to the other algorithms.Moreover, the TBPPO algorithm distinguishes itself by offering notable advantages over traditional Dijkstra and PPO approaches concerning security.Additionally, it outperforms the DRSIR algorithm and TBDRSIR in terms of stability and various performance metrics.The algorithm excels in providing multiple dependable routing options in network topologies with consistent structures dynamically changing the link conditions, striking a balance between link delay and security.

Node Diversity Analysis
In Figure 8, we performed an extensive examination of node diversity across different routing algorithms.Specifically, when observing the NSFNet-14 (see Figure 8a), the Dijkstra algorithm consistently maintains a fixed path for the traffic, resulting in a constant node diversity value of 11.In contrast, the TBPPO algorithm displays an intriguing behavior, as it gradually converges to a node diversity of 8 over approximately 600-episode iterations, demonstrating a clear convergence trend.The PPO algorithm also exhibits a commendable convergence performance but stabilizes at a node diversity value of 15.Conversely, the TBDRSIR and DRSIR algorithms both reach a node diversity of 13, albeit with relatively significant fluctuations.These findings indicate that the TBPPO algorithm places greater emphasis on optimizing the composition of routing nodes compared to the other algorithms.Moreover, the TBPPO algorithm distinguishes itself by offering notable advantages over traditional Dijkstra and PPO approaches concerning security.Additionally, it outperforms the DRSIR algorithm and TBDRSIR in terms of stability and various performance metrics.The algorithm excels in providing multiple dependable routing options in network topologies with consistent structures dynamically changing the link conditions, striking a balance between link delay and security.In Figure 8b of the Germany-50 data, the Dijkstra algorithm consistently follows a fixed path, maintaining a constant node diversity value of 38.On the other hand, the TBPPO algorithm begins to converge towards a node diversity of 23 after about 1500-episode iterations, displaying a clear convergence pattern.PPO, which does not take into account node types, shows a performance similar to Dijkstra.Both TBDRSIR and DRSIR algorithms still exhibit relatively large fluctuations in node diversity.These findings imply that the TBPPO algorithm places greater emphasis on the composition of routing nodes, In Figure 8b of the Germany-50 data, the Dijkstra algorithm consistently follows a fixed path, maintaining a constant node diversity value of 38.On the other hand, the TBPPO algorithm begins to converge towards a node diversity of 23 after about 1500-episode iterations, displaying a clear convergence pattern.PPO, which does not take into account node types, shows a performance similar to Dijkstra.Both TBDRSIR and DRSIR algorithms still exhibit relatively large fluctuations in node diversity.These findings imply that the TBPPO algorithm places greater emphasis on the composition of routing nodes, thereby improving its ability to prevent failures in similar nodes when sudden events occur from operators.In Figure 9, we executed a comprehensive evaluation of the multi-path routing performance of the TBPPO algorithm.This assessment involved an investigation into the effectiveness of TBPPO under various output settings, specifically producing 3, 4, and 5 route paths.We employed a range of essential performance metrics including but not limited to average delay, average trust value, delay variation, and node diversity.
best average delay at 500 milliseconds, which was 100 milliseconds faster than the 600 milliseconds of four routes.On the other hand, five routes were slower than four routes by about 200 milliseconds at 800 milliseconds.Both three routes and four routes achieved convergence, with similar convergence speeds.However, under five routes, the algorithm exhibited larger fluctuations but still showed some degree of convergence.
In Figure 9b, regarding the trust value performance, three routes continued to achieve the best trust value at around 0.7, while four routes closely followed with 1.2.In contrast, five routes had the poorest performance, with a trust value of around 1.5.The convergence performance among these three scenarios exhibited slight variations.
In Figure 9c, in terms of the delay variation performance, four routes showed similar results to three routes, with both achieving an average delay variation of around 60 milliseconds with clear convergence trends.However, five routes did not converge to the same extent, resulting in an average delay variation of around 100 milliseconds.In Figure 9d, with respect to node diversity, four routes exhibited a comparable trend to three routes, with both converging at approximately 50, as opposed to the 40 for three routes.In contrast, five routes ultimately converged at around 80. When comparing their convergence characteristics, three routes and four routes demonstrated a robust performance, indicating strong convergence abilities.On the other hand, five routes exhibited In Figure 9a, in terms of the average delay performance, three routes achieved the best average delay at 500 milliseconds, which was 100 milliseconds faster than the 600 milliseconds of four routes.On the other hand, five routes were slower than four routes by about 200 milliseconds at 800 milliseconds.Both three routes and four routes achieved convergence, with similar convergence speeds.However, under five routes, the algorithm exhibited larger fluctuations but still showed some degree of convergence.
In Figure 9b, regarding the trust value performance, three routes continued to achieve the best trust value at around 0.7, while four routes closely followed with 1.2.In contrast, five routes had the poorest performance, with a trust value of around 1.5.The convergence performance among these three scenarios exhibited slight variations.
In Figure 9c, in terms of the delay variation performance, four routes showed similar results to three routes, with both achieving an average delay variation of around 60 milliseconds with clear convergence trends.However, five routes did not converge to the same extent, resulting in an average delay variation of around 100 milliseconds.
In Figure 9d, with respect to node diversity, four routes exhibited a comparable trend to three routes, with both converging at approximately 50, as opposed to the 40 for three routes.In contrast, five routes ultimately converged at around 80. When comparing their convergence characteristics, three routes and four routes demonstrated a robust performance, indicating strong convergence abilities.On the other hand, five routes exhibited some degree of convergence; however, it displayed a wider range of fluctuations.These findings suggest that TBPPO exhibited good performance with both three routes and four routes displaying strong convergence capabilities.However, its performance declined when using five routes compared to and four routes.Therefore, we recommend not selecting multi-path routing tasks with more than four routes when using TBPPO.

Conclusions
This paper presents the TBPPO algorithm, a deep reinforcement learning-based approach that addresses the limitations of existing intelligent routing algorithms in SDN environments.TBPPO leverages a unique trust value mechanism based on KL divergence and optimizes DFS and PPO algorithms to establish secure, low-delay routing solutions.Initially, it refines the DFS algorithm for route selection, effectively reducing the complexity of the reinforcement learning network's action space and avoiding routing loops.Furthermore, it employs an improved KL divergence algorithm to estimate the trust values related to potential node attacks and a node diversity assessment method to estimate the ISP balance, providing an easy-to-calculate security assessment method for this algorithm.Lastly, the algorithm enhances PPO to explore multiple performancebalanced paths, enabling diverse network configurations and communication pairs to select routes that align with the network characteristics.We demonstrated the effectiveness of the TBPPO algorithm compared to cutting-edge methods in medium to large network topologies.The result show that, in large networks, TBPPO with security mechanisms experienced fewer attacks in route selection compared to PPO.Both its delay and variation performance were reduced by about 100 ms, its trust value led by 0.5, and the diversity of nodes was ahead by 20.Moreover, when comparing TBPPO with PPO and TBDRSIR, PPO still converges stably under complex multi-objective dynamic conditions.This result can be attributed to the excellent stability of PPO.In multi-route performance tests, we discovered that when TBPPO is used with more than four routes, its performance deteriorates significantly; therefore, we recommend applying TBPPO to three or four route path tasks to showcase its best capabilities.We hope that TBPPO can offer a better multi-path secure routing optimization method for real-time communication, such as in scenarios like online classes, streaming videos, live web broadcasts, and real-time meetings.In future research, we will focus on the application of the algorithm in heterogeneous network scenarios, making it adaptable to a wider range of practical application scenarios, like smart grid.We plan to introduce the graph convolutional networks (GCNs) into the DRL process for a better performance.

Figure 2 .
Figure 2. Flowchart of parameter tuning and model establishing.

Figure 2 .
Figure 2. Flowchart of parameter tuning and model establishing.

Figure 5 .
Figure 5. Performance comparison of the average time delay of five algorithms: (a) NSFNet-14, and (b) Germany-50.

Figure 5 .
Figure 5. Performance comparison of the average time delay of five algorithms: (a) NSFNet-14, and (b) Germany-50.

Figure
Figure 6b compares the performance of the KL trust values on Germany-50.Much like the scenario in NSFNet-14, the Dijkstra algorithm does not consider trust values when assessing its performance.It channels all traffic through a single path, resulting in a consistently stable KL trust value of around 1.5 without achieving convergence.Examining the graph, it becomes apparent that the TBPPO algorithm exhibits a noticeable trend towards convergence, although it experiences more fluctuations compared to NSFNet-14.Eventually, it stabilizes at approximately 0.8, with fluctuations hovering around 0.4.In contrast, PPO reaches convergence at 1.3.Regrettably, both the DRSIR and TBDRSIR algorithms fail to achieve trust value convergence.This result shows the capability of the TBPPO model to identify relatively secure routes in a large network environment.

Figure 9 .
Figure 9. Performance of TBPPO with respect to the quantity of available paths: (a)average delay, (b) average trust value, (c)delay variation, and (d) node diversity.

Figure 9 .
Figure 9. Performance of TBPPO with respect to the quantity of available paths: (a)average delay, (b) average trust value, (c)delay variation, and (d) node diversity.

Table 1 .
Reinforcement Learning in SDN related research.

Start DRL algorithm: While
EndSample a batch from the experience replay buffer.Calculate the TD error.
next state is not final state do Actor network selects action a l Calculate the reward R a l (τ) for this action.Store the current experience in the experience replay buffer.

Table 2 .
Parameters of neural network and reward function.

Table 4 .
The relationship between KL trust value and DDoS attack ratio of tested node.