Deep Reinforcement Learning-Based Network Routing Technology for Data Recovery in Exa-Scale Cloud Distributed Clustering Systems

: Research has been conducted to efﬁciently transfer blocks and reduce network costs when decoding and recovering data from an erasure coding-based distributed ﬁle system. Technologies using software-deﬁned network (SDN) controllers can collect and more efﬁciently manage network data. However, the bandwidth depends dynamically on the number of data transmitted on the network, and the data transfer time is inefﬁcient owing to the longer latency of existing routing paths when nodes and switches fail. We propose deep Q-network erasure coding (DQN-EC) to solve routing problems by converging erasure coding with DQN to learn dynamically changing network elements. Using the SDN controller, DQN-EC collects the status, number, and block size of nodes possessing stored blocks during erasure coding. The fat-tree network topology used for experimental evaluation collects elements of typical network packets, the bandwidth of the nodes and switches, and other information. The data collected undergo deep reinforcement learning to avoid node and switch failures and provide optimized routing paths by selecting switches that efﬁciently conduct block transfers. DQN-EC achieves a 2.5-times-faster block transmission time and 0.4-times-higher network throughput than open shortest path ﬁrst (OSPF) routing algorithms. The bottleneck bandwidth and transmission link cost can be reduced, improving the recovery time approximately twofold.


Introduction
Owing to the recent development of technologies such as smartphones, IoT, artificial intelligence, and big data, large-capacity big data are being generated and utilized. We previously used a replication technology-based distributed file system to store such data; however, with the spread of cloud computing, we have recently begun storing big data more efficiently using an erasure coding-based distributed file system [1][2][3]. Replication techniques divide the original data into multiple data blocks and store them on each distributed server through n-duplex replication. Because such techniques replicate n data, the efficiency of disk storage is extremely low owing to the disadvantage of data availability n times. To improve this, a distributed file system based on the erasure coding technique is used. The erasure coding technique divides the original data into data blocks and generates parity blocks through encoding operations. Decoding is a method for recovering original data through an operation that combines distributed and stored data blocks with parity blocks [4]. In such erasure coding-based distributed file systems, although space efficiency problems have been overcome, a large disk overhead occurs during encoding and decoding. Furthermore, block transfers are not efficiently achieved owing to the occurrence of countless variables through the routing path of the network. When decoding, it is important to efficiently send data and parity blocks to the destination node without

Principle of Erasure Coding
The typical method of data storage in a distributed file system is to divide the original data into blocks of data of a certain size using replication techniques, and to replicate the divided blocks of n and store them separately on multiple nodes. However, traditional replication technique-based distributed file systems store replicated blocks of data, resulting in an n-fold increase in spatial efficiency. Therefore, to improve this, a recent erasure coding technology-based distributed file system that has been extended from existing replication technique-based distributed file systems is being utilized. Erasure coding is represented in Figure 1 as a technique for storing data through the encoding and recovery of the decoding data, unlike how data blocks are divided into constant sizes, replicated, and stored [10][11][12].  Figure 1 shows a structure in which a Reed-Solomon (RS) code (6, 3) is used to construct six data blocks and three parity blocks generated through encoding. That is, when the original data are entered, they are divided by the RS erasure coding (6, 3) using as many blocks of data. In addition, each data block generates parity blocks through an encoding matrix and calculations, and thus all blocks (data blocks and parity blocks) are distributed and stored for each disk or network environment they use, as shown in Equation (1).
The parity block P i is generated by multiplying the data block D j by the encoding matrix coefficient α i,j (i = row, j = column) in Figure 1; although the decoding operations have been expressed through multiplication, they are actually computed through various algorithms, such as the RS erasure coding, liberation, AVX:CRS, and XOR algorithms. Figure 2 shows the behavioral process of the decoding operations in which nodes possessing data block 2 and parity block 1 fail owing to factors such as a network connectivity delay or a natural disaster and do not respond. However, a failure is recoverable by decoding the parity block and the original data block generated through encoding. The number of blocks that can be decoded and recovered reaches only up to three, which is the number of parity blocks, when configured in an RS code (6,3). In other words, three out of nine blocks can be recovered during a failure, as shown in Equation (2). Blocks d* that require recovery are regenerated into new nodes by decoding the decoding matrix coefficients (β 1 , β 2 , β 3 , . . . , β i ) and the generated parity blocks (P 1 , P 2 , P 3 , . . . , P i ).

Principle of Deep Q-Network
Q-learning, the reinforcement learning underlying a DQN, teaches agents that interact with the environment. In the early stages of learning, the agent learns slowly over time, taking random actions in various situations called 'states'. Policies are configured by updating all values to the table-type Q-table, depending on the status operation, and then referring to them and selecting the action. This is expressed in Figure 3 and Equation (3) as a way to update the Q-table to maximize the final cumulative reward while sequentially progressing the episode [13]. The sum Q(s, a) of all rewards that can be received when taking action a in the current state s can be calculated as the sum of the immediate rewards that can be taken and received in the future. On the right side of the equation, r(s, a) represents the immediate reward value to be received when action a is taken in the current state s. In addition, s is the very next state that is reached by taking action a in the current state s, and max a Q(s , a ) is the maximum value of the reward that can be received in the next state s . The goal of the agent is to select an action that maximizes this value. In addition, γ is a value called a discount, which controls the importance of a future value. The larger the value of γ is, the greater the value of the future reward, and the smaller the value is, the greater the importance of the immediate reward.
Because Q-learning must have all values for a state-action corresponding to the Qtable, there is a limit to how many states can be stored. To improve this, there was a way to learn Q-learning using neural network models; however, when applying high correlation problems between learning data and gradient descent methods, the target value, Target, is moving. Therefore, using neural network models instead of a Q-table, DQN combines Q-learning with deep learning, enabling neural network models to approximate the Q values, and constructs the target neural networks separate from a replay of the experience. This is expressed in Figure 4 and Equation (4) [14]. In the existing Q-learning in Equation (3), the final goal is to converge and reach the true value when the left and right sides are the same. The DQN shown in Equation (4) also has the same goal, i.e., to learn the neural network models in the direction of minimizing the difference between the two values through the loss function of the neural network models. In addition, in Equation (4), Q(s , a ; θ ) is the Q-Target and refers to the state, action, and parameters used in a neural network applied in deep-learning-based reinforcement learning. The target network optimizes θ of the real network, holding the weight of the value of the Q-Target θ , even when minimizing the loss function, similar to how it holds the target when conducting supervised learning. In other words, when learning a predictive network representing Q(s, a; θ), as shown in Figure 4, a target network representing Q(s , a ; θ ) is created separately to fix the weights. It then passes the weights of the forecasting network to the target network at every step. This learning is not affected by the weights the target is learning when narrowing the difference between the values of the target network and the predictive network. Therefore, more stable learning is possible than with Q-learning.
As shown in Figure 4, the DQN takes actions that are considered optimal at every moment through the learned neural network model and stores data on the status, behavior, and rewards in the playback memory until the end of the episode. In other words, if the batch size required for learning is n, the agent takes n actions and stores n data sequentially in the replay memory. As a difference from Q-learning, it does not learn from the recently stored data in replay memory, and instead randomly extracts mini-batch-sized samples from n data stored thus far to learn the neural network. This is because, if the recently stored data in the replay memory are used as is, the correlation between the data will be too large, and learning is therefore not properly applied.
Therefore, DQN has two improvements. First, because the network is separated using two deep learning neural network models, the learning instability of the target network can be improved. Second, the correlation among the data can be resolved because they are trained by randomly extracting mini-batch sized samples using replay memory.

Related Studies and Motivation
This section first introduces algorithms and methodologies related to erasure coding, which will consist of a network environment for an experimental evaluation. Second, we introduce network routing algorithms and methodologies that utilize supervised learning and reinforcement learning applied in machine learning. Finally, the need for this study is explained.

Studies Related to Erasure Coding
Erasure coding has a significantly increased space efficiency, taking up less disk space than traditional replication-technique-based distributed file systems. However, because various cost problems, such as disk overhead and network bottlenecks, occur in data encoding and decoding, various algorithms and methodologies have been proposed to resolve such issues. A distributed erasure-coding-based file system can divide the data encoding/decoding process into five stages: client overhead, master process, parity calculation, data distribution, and slave process. A cost analysis that occurs when the network bandwidth is sequentially increased to 1 G, 10 G, 40 G, and 100 G suggests that client overhead and data distribution are the costliest [15,16]. Disk contention availability is applied to efficiently distribute multiple recovery requests using a parallel recovery to reduce client overhead costs, random chunk allocation is used to increase the load balancing efficiency, and asynchronous recovery is utilized to increase the concurrency [17]. To reduce the costs incurred during data distribution, a study has significantly reduced the recovery bandwidth associated with the decoding performance by appropriately adjusting the recovery speed [18]. A study on a top-down data transfer technique also used distributed data calculations for updating the parity nodes [19]. In another previous study, the recovery of failed data was achieved at a small scale throughout the storage node, reducing the recovery time to approximately equal the normal read time for the same amount of data [20]. Other studies [3,4] also identified and avoided the traffic transmission paths of nodes and associated nodes where bottlenecks occur, thereby reconfiguring the routing paths and eliminating bottlenecks across the entire system to reduce the traffic transmission costs.
In erasure-coding-related studies, network routing methodologies and algorithms improve the routing paths to the destination nodes where the data and parity blocks are transmitted when decoding the data, reducing the costs of bottlenecks and transmission links. However, because aggregated hardware-based switches are used to reduce the bottlenecks and traffic transmission costs, additional hardware costs are required; in addition, bandwidth is always evaluated at a fixed value and the block transmission rates are inefficient when the bandwidth is suddenly lowered. Therefore, because DQN-EC collects dynamically changing data in a network environment using erasure coding and identifies failed switches, DQN neural network model learning can efficiently yield the optimal routing paths and transfer blocks.

Supervised Learning-Based Network Routing
Supervised learning is a method of learning the correct answer data to predict values and categories for new data [21]. In [22], the authors proposed graph-aware deep-learningbased intelligent routing to solve such problems as slow fusion and degradation that occur when constructing complex dynamic networking conditions. In addition, in [23], the authors proposed an ML-assisted left-loaded algorithm, which is a combination of the Bayes classifier and the existing least-loaded algorithm, configured to solve the connection failure problem in the following situations of a circuit-switched network. Moreover, the authors of [24,25] proposed a RouteNet algorithm to help optimize routing schemes for mean delays by leveraging the ability of graph neural networks (GNNs) to learn and model graphbased information. In [26], the authors proposed a deep learning routing algorithm that configures a deep-learning-based neural network, with the input of the model consisting of a set of numbers indicating the number of packets forwarded through each node of the network, and the output using network traffic information in the form of an interface for forwarding packets. In addition, in [27], the authors studied the input of the deep learning model as a routing strategy based on the deep learning architecture. The deep believe architecture (DBA) was defined between the number of inbound packets observed on each router and the time interval, and an algorithm was proposed to construct a restricted Boltzmann machine on the detailed hidden layer of the DBA. In [28], the authors also proposed an automated protocol design method for controlling and managing a network by applying two important aspects of the GNN model-i.e., distributing the topology information between different nodes and calculating the topology and link-weighted paths. In addition, in [29], the authors also proposed MLProp by leveraging a variety of input parameters-such as the buffer capacity and node access rates, respectively-to indicate whether packets can be delivered successfully along these links, improving the network link performance and probabilistic routing approaches to intermittent opportunistic networking. Finally, in [30], the authors developed a flexible supervised learning framework through a training model that includes learning paths and all path pairs of mixed integral linear programming and proposed a routing methodology that utilizes deep neural network methods to minimize the identity of network systems instead of always selecting the shortest paths.
Related studies on supervised-learning-based network routing methodologies predict the routing paths by learning the machine learning models, such as naïve Bayes classification, decision tree, and deep learning-based neural network models. However, to conduct supervised learning, a dataset must be prepared in advance, and the performance varies greatly depending on the type of model used. In addition, although the correct labels for the datasets used must exist, it is difficult to produce an efficient predictive path because the correct labels can always vary depending on the bandwidth and traffic intensity generated by the actual network. Therefore, DQN-EC can yield efficient routing paths compared to studies related to supervised learning-based networks because the approach collects both static and dynamic data through SDN controllers and utilizes deep reinforcement learning, which does not select specific models.

Reinforcement-Learning-Based Network Routing
Reinforcement learning is a method of learning how to behave in each environment, and proceeds by receiving feedback as positive or negative rewards for the actions conducted by the agent under various situations [31]. In [32], the authors proposed a QAR algorithm for achieving time-efficient and adaptive QoS provisioning packet delivery through reinforcement learning and QoS-aware compensation capabilities. Furthermore, the simulation results demonstrate that QAR outperforms existing learning solutions and provides fast convergence with QoS provisioning to facilitate practical implementation in large software-service-defined networks. In [33], the authors proposed RL4Net, which is a type of deep deterministic policy gradient-based Q-learning. The state is encapsulated into the amount of traffic flowing between each pair of routers (total size), which corresponds to a weight update determining how the router selects the interface to forward, and the reward is calculated based on the delay. In addition, in [34], the authors proposed applying Q-learning directly to packet routing, and proposed an efficient routing policy in dynamically changing networks without needing to know the network topology or traffic patterns in advance. In [35], the authors proposed CQ-routing, through which the learned routing policy under a high load shows more than twice the performance of Q-routing in terms of the average packet delivery time. It has also been indicated that CQ-routing can sustain higher load levels than both Q-routing and shortest path routing. Moreover, in [36], the authors proposed NNQ-routing using a neural network approximation to improve the scalability of Q-routing. The Q function was replaced by a three-layer-conceptron model, and the successful packet delivery of Q-learning adopted positive compensation, delay, and loss packets as negative compensation. In [37], the authors proposed two adaptive reinforcement-learning-based spectral recognition routing protocols, with Q-learning and dual-reinforcement learning, respectively. Spectrum-aware DRQ routing learns the optimal routing policies 1.5-times faster than spectrum-aware Q-routing at low and middle network loads. Under high network loads, the routing policies learned are seven times better than those of spectrum-aware Q-routing. Moreover, in [38], the authors proposed critical flow rerouting-reinforcement learning (CFR-RL), a reinforcement learning-based scheme that learns policies to automatically select critical flows for traffic matrices. The state uses a traffic matrix for time t, and the action generates multiple actions for time step t at each state. A reward is used to reroute the critical flows to balance the link utilization and set up the flows to reflect the network performance. In [39], the authors also proposed positive and negative compensation values using network throughput and delay to optimize the reward for network throughput. After proper learning, the agent proposes dueling double-deep Q-learning-based RL-routing, which predicts the future behavior of the underlying network and suggests better routing paths between switches. Moreover, in [40], the researchers improved the problem of a poor response ability of neural networks when generating multicast routing trees by always learning the DQN for the same environment. In other words, we propose a multicast routing tree methodology that learns and responds to a new environment each time it learns.
Related studies on reinforcement-learning-based network routing methodologies yield routing paths by learning in the direction in which the routing nodes become agents and reduce the latency based on Q-learning without specific models. In particular, deep reinforcement learning is efficient because agents yield optimal routing paths through the learning of neural network models based on data collected from the network. However, related studies have not considered switches that fail in the routing paths, and studies on applying DQN when transferring requested data blocks, and parity blocks during the decoding of erasure coding remain insufficient. Therefore, DQN-EC leverages DQN, a type of deep-reinforcement-learning-based Q-Learning, to collect network data and yield efficient routing transmission paths. Figure 5 shows when a node and switch fail in an existing network routing path and the network routing transmission path is disconnected. In the event of a failure, it is difficult to find an efficient path to which links should be used to transmit data to the destination node. In other words, an optimized routing path is needed to measure the network bandwidth that changes in real time to avoid the link with the best link bandwidth and when nodes fail due to natural geological and other factors. In addition, when selecting links, reducing the link transfer cost as much as possible should also be considered, as mentioned in the erasure coding-based improved study. Therefore, in this study, we apply DQN-based deep reinforcement learning to a network environment configured with an erasure-coding-based distributed file system. In addition, we propose a methodology that adds not only the dataset related to network routing as input values to the neural network model applied, but also the input parameter values required for decoding in erasure coding, and modifies the layers used in the neural network model.

System Design
This section discusses the methodology used for applying DQN-based network routing algorithms to the erasure coding network topology. Figure 6 shows the overall structure of applying the DQN-based network routing algorithm to the erasure coding network topology environment proposed in this paper. The algorithm utilizes SDN controllers, a network architecture approach that can intelligently and centrally control or program networks using software applications. Therefore, it can collect relevant data from the erasure coding network topology environment. Data required when configuring network routing paths using dynamically changed network bandwidth and failed node and switch information are collected and stored in the database at regular time intervals. The stored data correspond to the input layer parameters of the neural network model in a DQN-based network routing algorithm.

Database Storage
The parameters entered into the DQN neural network model were learned from data collected over a period of time in the database. We collected general data from the network and data from the erasure coding network topology. The main data collected from the networks are the link bandwidth, link-available bandwidth, and hop count, and the main data collected from erasure coding are the failed nodes and switches, block sizes to be transmitted, source nodes, and destination nodes. The list of data stored in the database and the parameters entered into the DQN neural network model are shown in Table 1. The parameters entered into the DQN neural network model correspond to symbols x 1 through x 18 , which corresponds to the period of time in which x 1 is collected. When transferring data and parity blocks, elements that can be collected and stored in a typical network topology correspond to x 2 through x 9 , and elements that can be collected and stored in an erasure coding network topology correspond to x 10 through x 18 . In particular, among the data that can be collected through the SDN controller, dynamic data are stored through a calculation process conducted once before the data collection. The RTT sends the immediately returned packets to the connected switch and measures them divided by two, when considering the bidirection received through the response. Here, AB sw is measured by calculating the difference between the maximum bandwidth of the link and the data rate of the next connected switch. The data rate is the difference between the number of bytes transmitted at time t and the number of bytes transmitted at time t+1.
The x 2 element, G(V, E), represents the location and orientation in a two-dimensional planar form, where nodes and switches can be located in a network topology. The x 3 and x 4 elements, RTT and MSS, are the elements of network packets needed to transmit parity blocks, x 5 represents the current available bandwidth in the link, and x 6 , x 7 , and x 8 are the bandwidths of nodes and switches present before and after the present time. In addition, x 9 and x 10 represent the hop count of the nodes and switches that exist before and after the current location, and x 11 , x 12 , and x 13 represent the total number of nodes with data blocks and parity blocks in the erasure coding network topology, and the number of data nodes and parity nodes for each block. Moreover, x 14 and x 15 are currently active and inactive nodes used to determine the number of failed nodes, and x 16 , x 17 , and x 18 are the data blocks, the source nodes to which the parity blocks are to be transmitted, and the size of the destination nodes and blocks.

Performance of Deep Q-Network
In Section 4.2, the state is described as the input parameters required by the agent to derive actions from the DQN neural network model. All elements were used as inputs to the neural network model, along with the amount of change before each element. For example, the bandwidth of the following nodes and switches is shown in Equation (5) through the current time t, which is the time information, and the previous time t-1, which is the amount of change.
The action is passed on to the erasure coding network topology according to the policy configured by the agent based on any state value. In addition, when a proper action is applied according to the state value, the agent can obtain a positive reward value for nodes that need to transfer blocks in the erasure coding network topology, and a negative reward value for moving to the wrong node and switch. Thus, the entire Action_set group performs k different actions for each time period t, as expressed through Equation (6).
A reward is utilized to update the weights in the DQN neural network model, and unlike using labels for correct answers in supervised learning, each weight is updated through backpropagation using action-driven rewards. Therefore, the learning outcomes of neural network models may vary depending on how the reward is defined. In this study, the Reward total values were configured in three forms. Reward 1 has a positive value when data are sent using the correct link, Reward 2 has a negative value when data are sent using the wrong link, and Reward 3 has an extremely small negative value because it enters into a loop unless it chooses a switch link with another node. The overall reward value is shown in Equation (7).
Reward total = Reward 1 + Reward 2 + Reward 3 Reward 1 is shown in Equation (8) by setting it as a positive integer value when the current node and switch location (Now sw ), the node and switch location (Prehob sw ) of the previous link, and the next node and switch location (Nexthob sw ) are all moved throughout the entire path.
Reward 2 is shown in Equation (9) by setting the value to a negative integer when the link is moved from the current node and switch position (Now sw ) to the failed node and switch (Node dead ).
Reward 3 is a negative value for exiting the loop state as a constant cycle is repeated. If a large negative value is set, one can exit the loop and select another link; however, because there is an error that allows one to select the wrong link, an extremely small negative value can be set, as shown in Equation (10).
Replay memory is the memory space that stores the entire history that the agent applies during repetitive learning. The DQN neural network model proceeds with learning over time from the data contained in the constant cycle time collected. At this time, the learning is unstable owing to the high correlation because the data were collected sequentially. Therefore, when learning from replay memory is carried out, the correlation is reduced as much as possible through mini-batch learning using randomly extracted data, as previously mentioned in the description of the DQN principle.
The input layer, hidden layer, and output layer design of the DQN neural network model are listed in Table 2. The input layer was designed with 18 elements corresponding to the state, and three hidden layers were used. The first hidden layer was set to 32, the second hidden layer to 64, and the last hidden layer to 128. In addition, the activation function used by the hidden layer was specified as a leaky ReLU [41]. The output layer is set to 16 to select the highest Q-value of the links between the following nodes and switches on the source node holding the data and parity blocks. Finally, the learning rate was tested by varying the values from 0.001 to 0.1, where 0.01 was the best value obtained, and the discount rate was designated as 0.95.

Optimized Routing Paths
When the optimal routing path is calculated through a DQN, the routing path is transmitted through OpenFlow to the erasure coding network topology. OpenFlow has a flow table that contains information regarding the path and method of packet forwarding used to the transmit blocks. When a packet occurs, it first ensures that the flow table contains information about the packet. If information about a packet exists, the packet is processed accordingly, and if information does not exist, the OpenFlow controller requests control information about that packet. Upon receiving control information from the switch, the OpenFlow controller checks the packet control information that exists inside and forwards the result to the OpenFlow switch. Packet control information within the OpenFlow controller can be entered through an API in an external program. The OpenFlow switch stores the control information received from the controller in the flow table and then uses the information in the flow table to forward the packet when the same packet occurs.
In the erasure coding network topology, the source nodes that transmit the data blocks and parity blocks required for decoding to the destination node are transmitted using optimized network routing paths. The process of calculating the optimized routing paths is shown in Algorithm 1.
First, the input parameters stored in the database, the hyperparameters used to design the DQN neural network model, and the discount factors (step 1) are specified, and the elements that are entered into the DQN neural network model are initialized. An element is a replay memory used to solve the correlation problem, a Q-value of the action from the state, a Q-value of the target network to calculate the next action, a P list that stores routes from each source node to the destination node, and a 2d LP list that stores routes from all source nodes to destination nodes (step 2).
The detailed algorithm works through a double loop (step 3) because the block transfer required for the decoding operation must consider all routing paths from each source node to the destination node (step 4). The following indicates the same e-greedy methodology as Q-Learning and selects an action on which link to select from the source node to the next node and acquires a reward value to move to the next state (step 5). To reduce the correlation problems described, previous records are stored in the replay memory, and mini-batch learning is applied (step 6). The action is determined, the value is calculated to estimate the Q-value of the target network, and the weight is updated to optimize the learning model by applying gradient detection (step 7). Algorithm 1 DQN-based network routing in Erasure Coding network topology 1 Input: hyper parameters of neural network, discount factor γ 2 Initialize experience replay memory RD Initialize action-value function Q with random weights θ Initialize target action-value function Q with weights θ = θ Initialize network route list P Initialize complete network route 2d list LP 3 For 1: i do ((i = number o f source nodes) 4 For episodes = 1:100 do 5 With the probability select random action a t , otherwise select a t = argmax a Q(s t , a; θ) Execution action a t , get reward r t and next state s t then update the network state 6 Store the experience (s t , a t , r t , s t ) to RD Sample random mini batch perform (s j , a j , r j , s j ) from in RD 7 If episode terminates at step, j + 1 then Set target y j = r j Else Set target y j = r j + γmax a Q s j , a ; θ End If Perform a gradient descent step on learning rate and y j − Q s , a ; θ 2 with θ Every C steps reset Q = Q End For 8 Append the last route path to list P List P append to list LP 9 Initialize all variable except LP (RD, Q, Q, P i , a t , r t , y t , . . .) End For 10 Output: Complete network route path LP from all source node to destination node When a routing route from one source node to the destination node is completed, it is stored in a P-list as well as in a 2d LP list composed of dual lists (step 8). In addition, because another route from the source node to the destination node must be produced, the remaining variables are initialized except for the LP list that stores the final optimized route (step 9). At the end of every iteration, the LP list stores the entire routing path from each source node to the destination node, and thus in the erasure coding network topology, each node transfers the blocks required for decoding to the destination node according to the path (step 10).

Evaluation
This section describes the results measured in the DQN simulation environment and the erasure coding network topology used to identify and analyze the experimental evaluation results of the proposed method.

Simulation Environment
This section describes the specifications and simulation environments of deep-learning workstations to evaluate the performance of DQN-based erasure coding network topologies. The deep learning simulation was conducted on a workstation running the Ubuntu 20.04.4 operating system, with a XEON 4110 (8 core × 2) CPU, 128 GB of DDR4 memory, and four RTX 2080 graphics cards.
OpenvSwitch was used to set up an environment containing virtual nodes and a set of links connected to the switch, and the SDN network controllers used open-sourcebased Ryu as an OpenFlow network controller. For network bandwidth fluctuations, the link bandwidth of each node and switch reached up to 1 Gbps and was configured to be randomly changed within the range of 300 to 600 Mbps. We randomly specified a 15% probability of switch failures. The traffic intensity was set to 30% to 50% because it would be necessary to assume that some data were being transmitted. The key variables and the set values used in the experiments are listed in Table 3. Table 3. Parameters of network topology simulation.

Configure Value
Training steps 100 DQN episodes 100 Link maximum bandwidth 1 Gbps Link real-time bandwidth 300-600 Mbps Traffic intensity 30-50% Probability of link failure 0-15% The network topology used to experiment with the DQN-EC performance was given a fat tree [42]. The fat-tree network topology is configured using parameter α and consists of α 3 /4 servers and 5 × α 2 /4 switches. In the experimental evaluation, parameter α was specified as 8, resulting in a network topology with 128 servers and 80 switches. For 128 servers, RS (6, 3) erasure coding is applied to randomly generate six data blocks and three parity blocks, which are distributed and stored on nodes, and the size of the blocks stored is 128 MB. For various experimental evaluations, the test was conducted by increasing the probability of a switch failure from 0% to 15% when requesting a decoding. Figure 7 shows a topology example consisting of 16 servers and 20 switches by specifying parameter α as 4 to help understand the fat-tree network topology with erasure coding. The topology configuration shows that green-shaded nodes are source nodes for transferring blocks in a decoding operation, and blue-shaded nodes are destination nodes for receiving blocks and performing decoding operations. In the above simulation environment and EC network topology, the DQN-EC-based network routing method was labeled DQN-EC and compared with the underlying network routing algorithm OSPF.

Evaluation Result of DQN-EC
This section introduces the results of the rate of change in compensation per episode associated with the designed DQN neural network model. It also introduces the time and network throughput results of transferring blocks from the erasure coding network topology to the target nodes, depending on the probability of a switch failure. Finally, we introduce the recovery time results based on the number of recovery requests. Figure 8 shows the cumulative change in reward values according to the learning process of the agent when using DQN-EC and OSPF.  Figure 8 shows the results of measuring the change in the cumulative reward value, with the x-axis representing the number of episodes and the y-axis representing the sum of the rewards according to the number of episodes. On average, 95 different routing paths were generated, some quite similar, and some completely different. However, as the number of episodes was repeated, a similar route was created in the last 80-100 paths.
Using the e-greedy method, the agent initially does not have sufficient knowledge of the current network topology. Thus, most of agents explored the environment and obtained a lower reward value; however, as the number of episodes increased, the reward value increased and converged to the maximum value, indicating that optimized network routing paths were created during the last episode. Figure 9 shows the block transfer time according to the probability of failure of 0% to 15% on a switch corresponding to the simulation parameters when using DQN-EC and OSPF.  Figure 9 shows the measurement of the block transfer time from the source node to the destination node according to the probability of a switch failure, in which the decoding request was conducted only once. When a decoding request is made, nine blocks with a size of 128 MB are sent to the destination node. Compared to the existing OSPF routing algorithms, the block transfer time is reduced. For up to 15% of the maximum probability of a switch failure, DQN-EC has a slight increase in the block transfer time from 5.2 to 11.8 s. However, OSPF increases significantly from 12.48 to 30.68 s as it reaches 15%. DQN-EC measured a block transfer time of 8.25 s on average, and OSPF averaged 20.91 s, showing a difference in the block transfer rate of approximately 2.5. The DQN-EC neural network model provides a faster block transfer time than the Dijkstra algorithm-based shortest-path OSPF because the agent avoids failed switches and selects links corresponding to optimized routing paths when considering the bandwidth. Figure 10 shows the network throughput according to the probability of a failure of 0% to 15% on a switch when using DQN-EC and OSPF.  Figure 10 shows the improvement in network throughput compared to traditional OSPF routing algorithms because of the measurement of the network throughput. The performance of OSPF is measured similar to that of DQN-EC under a probability of failure of 0-4%. However, from 5% of the probability of a switch failure, the network throughput of DQN-EC and OSPF becomes increasingly different owing to increased network traffic and congestion probability in the network. In OSPF, when the probability of a switch failure is 12%, the measured network throughput of 455 MB/s decreases slightly as the probability of a switch failure increases to 15%. Conversely, DQN-EC can maintain the congestion probability as low as possible throughout learning, and thus the network throughput is not reduced by up to 15%, which is the maximum probability of a switch failure, and can continue to improve, reaching up to 515 MB/s. The performance of the recovery times that occur during decoding using optimized routing paths is assessed. The recovery time is the sum of the network bottleneck costs and the link count costs and is compared and evaluated using OSPF routing network algorithms. The relevant content is shown in Equation (11). Because the measured recovery time varies depending on the probability of failure of the switch, the probability of a switch failure was fixed and measured at 5% to maintain a constant block transfer time.
where BB r indicates the bottleneck bandwidth, TL r indicates the transmission link, NB k indicates the network bandwidth and L k is the number of links. Above, RT is the recovery time, which is the sum of the BB and TL costs, where the former indicates the bottleneck bandwidth in routing paths, and the latter indicates the transmission link costs in the routes. The lower the measured value, the better, because the network bottleneck bandwidth indicates the bandwidth at the point where network traffic occurs at a particular point while configuring the network routing path. The transmission link cost is the cost of summing the measured network bandwidth in the optimized routing path and dividing the total sum of the link counts of the links. The results related to network bottlenecks and transmission link costs are shown in Figure 11.  Figure 11a shows that the network bottleneck bandwidth of DQN-EC averaged 493 Mbps, which is slightly lower than that of OSPF at 633 Mbps. The network bottlenecks were reduced compared to OSPF because under intense traffic decoding operations using optimized routing paths that avoid high transport links are applied as much as possible. Figure 11b shows that the transmission link costs measured are high in OSPF, which can simply maintain short transmission links up to two recovery requests. However, the more recovery requests that exist, the more network routes that occur, and the greater the complexity, the lower the DQN-EC, which considers the bandwidth and transport links. In addition, the bottleneck bandwidth is likely to be due to the processing limitations of the internal bus bandwidth of the deep learning workstation. Therefore, checking the GPU, memory, and GPU entries in the monitoring log recorded during this experiment, it was shown that a pull-load phenomenon did not occur, and thus was less relevant to the internal bus bandwidth. Figure 12 shows the combined recovery time for network bottlenecks and link count costs. The measurement of the recovery time of DQN-EC and OSPF gradually varies as the number of recovery requests increases. DQN-EC takes 77 s to recover when the maximum number of recovery requests is 10, whereas OSPF requires 165 s for the same number of maximum requests. The optimized routing path of DQN-EC avoids bottlenecks as much as possible and chooses links in which the transmission link costs are low. Therefore, even if the bandwidth of the transfer link is low, the recovery time is shorter than that of the OSPF used to reduce the cost of the simple transfer link.

Comparison and Summary of DQN-EC and Experiment Results
Similar to DQN-EC, RouteNet [24,25] improves the delay and jitter to the source and destination by up to 1.4-times compared to OSPF. In [26], the authors compared between the deep learning system and OSPF; in this case, the overall network throughput was improved by approximately 2.3-fold, and the average hop delay was measured to be approximately 12-times lower in the proposed deep learning system. In addition, the authors of [27] indicated that the results of a methodology experiment showed approximately a 3-fold improvement in network throughput and average delay per hop compared to OSPF. In [34], when the network load level is between 2 and 2.5, the average delivery time OSPF converges to the maximum, whereas Q-routing records an average delivery time similar to that of OSPF when the network load level is 4, which is an improvement of approximately 1.8-fold. In [35], CQ-routing recorded the same average packet delivery time as OSPF until the route had 1500 learning times. However, as the learning count reached 3000, OSPF continued to increase, recording an average packet delivery time of 70 s; however, CQ-routing converges at close to zero. In [37], DRQ-routing learns the optimal routing policies 1.5-times faster than spectrum-aware Q-routing at low and intermediate network loads, and the average packet delivery time differs by up to 30 s compared to Q-routing. In addition, in [39], RL-routing improved the average file transfer rate by approximately 3-fold and the average utilization rate by approximately 2-fold compared to those of OSPF.
The DQN-EC proposed in this paper constructs an experimental environment by using a DQN, deep reinforcement learning, the addition of elements necessary for decoding in erasure coding, and the selection of OSPF, the subject of the comparison of the experiment results. The block transmission time, which has a similar meaning as the average packet delivery time, improved 2.5-fold when the probability of a link failure was 15%. Network throughput improved 0.4-times, and bottleneck bandwidth was measured up to 100 Mbps lower than OSPF, reducing bottleneck bandwidth. The data may be recovered twice as fast as when OSPF is used and the number of decoding requests is 10.
Compared with the above related studies, the purpose of calculating the optimal routing path, which is the final goal of DQN-EC, is the same. However, DQN-EC incorporates a large-capacity distributed clustering system using erasure coding into the network topology. In addition, it is possible to quickly recover the data because a switch that has failed in the network is avoided, and the transmission link is not simply reduced, but instead the bandwidth of the transmission link is detected and the block is transmitted using an efficient routing path when applying the decoding.

Conclusions
In this study, various hyperparameters and erasure coding elements were applied to the fat-tree network topology for a deep-reinforcement learning-based DQN algorithm applied to neural network models. We proposed a DQN-EC in which blocks are efficiently transmitted when a decoding operation is requested by designing a suitable reward value and applying it to an erasure coding network topology. DQN-EC identifies switches that fail in a simulated environment in which the bandwidth of each node and switch is not fixed and dynamically changes, and thus avoids them by providing optimized routing paths. Compared to the OSPF algorithm, the block transfer time was measured as shorter as the probability of a switch experiencing a failure and the network throughput both increased. Even when the number of decoding requests gradually increased, the recovery time could be reduced by 15-20% overall compared to that of the OSPF algorithms. The DQN-EC proposed in this paper showed a more efficient appearance than the underlying OSPF algorithm in a simulation environment using RS (6, 3) erase codes, which are commonly applied; however, the results may vary depending on the parameters that make up the erasure coding. Therefore, we analyzed other network topologies-B-Cube, NSFNet, and APRANet-and classified and extracted elements that can apply DQN-EC. Finally, we will further study an optimized neural network model design methodology suitable for each network topology and demonstrate the efficiency of DQN-EC through an application and experimental evaluation.