Achieving Balanced Load Distribution with Reinforcement Learning-Based Switch Migration in Distributed SDN Controllers

Distributed controllers in software-defined networking (SDN) become a promising approach because of their scalable and reliable deployments in current SDN environments. Since the network traffic varies with time and space, a static mapping between switches and controllers causes uneven load distribution among controllers. Dynamic migration of switches methods can provide a balanced load distribution between SDN controllers. Recently, existing reinforcement learning (RL) methods for dynamic switch migration such as MARVEL are modeling the load balancing of each controller as linear optimization. Even if it is widely used for network flow modeling, this type of linear optimization is not well fitted to the real-world workload of SDN controllers because correlations between resource types are unexpectedly and continuously changed. Consequently, using the linear model for resource utilization makes it difficult to distinguish which resource types are currently overloaded. In addition, this yields a high time cost. In this paper, we propose a reinforcement learning-based switch and controller selection scheme for switch migration, switch-aware reinforcement learning load balancing (SAR-LB). SAR-LB uses the utilization ratio of various resource types in both controllers and switches as the inputs of the neural network. It also considers switches as RL agents to reduce the action space of learning, while it considers all cases of migrations. Our experimental results show that SAR-LB achieved better (close to the even) load distribution among SDN controllers because of the accurate decision-making of switch migration. The proposed scheme achieves better normalized standard deviation among distributed SDN controllers than existing schemes by up to 34%.


Introduction
SDN provides powerful programmable network architecture and can be used to design a logical topology that defines the placement of a network entity (e.g., hardware equipment such as a router, switch, load balancer, firewall, and VPN) [1]. SDN separates network space into a control plane and a data plane. In the data plane, the switches transfer actual data between hosts, while the control plane defines the flow rule, which is the rule for routing data among switches [2]. Decoupling of the data plane and the control plane makes the efficient management of the network logic possible.
However, when incoming packets are massive, a single SDN controller [3] could become a performance bottleneck where all SDN switches are connected to a single SDN controller. In addition, the controller could be a single point of failure [4] since all packet processing must be stopped. To address these two major problems, distributed SDN controllers are proposed [5]. In distributed SDN controllers, each SDN controller manages its own domain network while SDN switches are statically mapped to the specific controller. Therefore, load imbalances between controllers could occur, and, owing to the static mapping between switches and controllers, uneven load distribution between controllers becomes significant.

•
We enhance the switch migration decision-making scheme by considering various types of resource utilization of controllers and a switch to migrate as an input state of the RL model to cope with changed correlation among resource types; • We efficiently consider all possible cases of migration with relatively small action sizes (i.e., the number of controllers) by modeling the switches as RL agents in the proposed scheme; • We conduct the performance evaluation of our proposed scheme and other competing schemes on stochastic and various environment settings. We evaluate SAR-LB in two network conditions (i.e., with/without bursty network traffic) and five cases with different numbers of switches and controllers. The results show that our scheme performs better when the number of switches is much larger than the number of controllers, and this confirms that our SAR-LB is well suited to the common data center where the controller manages more than dozens of switches [19][20][21][22].
The rest of the paper is organized as follow. We describe the background of our study in Section 2. In Section 3, we present the related works. We describe the details of our proposed scheme and present the experiment setups and the evaluation results in Sections 4 and 5, respectively. The discussion is presented in Section 6, and we conclude our paper in Section 7.

Background
In this section, we describe the background of the switch migration for the load balancing in distributed SDN controllers. Then, we introduce the RL used in the switch migration domain of the load balancing in distributed SDN controllers.

Switch Migration in Load Balancing of Distributed SDN Controllers
In distributed SDN controllers, a switch can be controlled by multiple SDN controllers. However, only one controller can manage the switch to process the PACKET_IN requests and installation of flow rules. The relationship between switch and controller is specified according to the southbound interface protocol.
For example, OpenFlow [2] provides the multiple connections of SDN controllers to switches with different types of roles: (i) "master", (ii) "equal", and (iii) "slave". An SDN switch can be controlled by multiple controllers with "equal" and "slave" roles. The difference is that SDN controllers with "slave" roles have read-only access to the SDN switches. However, controllers with the "equal" role receive all asynchronous messages and have full access to manage these messages send from the SDN switches. The "master" role is similar to the "equal" role; however, the difference is that there is only one "master" for an SDN switch to manage asynchronous messages and other controllers become "slave" for the switch. Controllers with different types of roles can provide fault-tolerance to the SDN network. Additionally, even load distribution among the distributed SDN controllers can be achieved by migrating the switches from overutilized to an underutilized controller with role changes between controllers. As a result, PACKET_IN requests from the switches can be controlled by multiple controllers.

Reinforcement Learning
RL is a type of learning concerned with what to do in dynamic situations of the environment [23]. Unlike other learning categories (e.g., supervised learning, unsupervised learning), the learning of RL is processed by interacting with the environment. This interaction is made by the basic elements of RL framing: the agent (i.e., it receives a reward from the environment and sends back the action to the environment), the state (i.e., it represents the current situation of the environment), and the reward (i.e., it is the feedback from the environment to change the state). RL can deal with complex problems. For example, RL is used as a tool for training AI models in complex systems such as healthcare, robotics, and network automation [24]. Figure 1 presents an overview of deep Q networks (DQN). DQN [25] is the most well-known deep reinforcement learning method, and it has two characteristics. The first characteristic is the replay buffer. By using the replay buffer, DQN can store and utilize previous experiences in the current training steps. The second is the target Q network. The target Q network is a supportive network for the Q network and is slowly trained by the Q network, while the Q network is trained directly via samples from the replay buffer. By setting the target Q network, the Q network can be trained more stably due to the conservative change of the target Q network. setting the target Q network, the Q network can be trained more stably due to the conservative change of the target Q network. Algorithm 1 describes the whole process of DQN. In line 4, the DQN agent selects an action by the -decay policy. In the -decay policy, the action is selected randomly or by the Q network, and the probability of selecting the random action or the action by the neural network is determined by the value of . After selecting the action, the experience, which consists of state, action, next state, reward, and done (i.e., Boolean value for defining the end of the episode), is saved to the replay buffer. The training process of DQN is defined in lines [11][12][13][14][15][16][17][18][19]. Especially, the loss function of DQN is an important part of the training. To define the loss function, is defined by the combination of reward and the maximum target Q value of the next state. This is utilized as the labeling value of supervised learning. Thus, the mean square error of the value of Q network and become the loss function of DQN. Algorithm 1 describes the whole process of DQN. In line 4, the DQN agent selects an action by the -decay policy. In the -decay policy, the action is selected randomly or by the Q network, and the probability of selecting the random action or the action by the neural network is determined by the value of . After selecting the action, the experience, which consists of state, action, next state, reward, and done (i.e., Boolean value for defining the end of the episode), is saved to the replay buffer. The training process of DQN is defined in lines [11][12][13][14][15][16][17][18][19]. Especially, the loss function of DQN is an important part of the training. To define the loss function, y x is defined by the combination of reward and the maximum target Q value of the next state. This y x is utilized as the labeling value of supervised learning. Thus, the mean square error of the value of Q network and y x become the loss function of DQN.

Algorithm 1 DQN
Input: ReplayBuffer: data store for saving experience (i.e., state, reward, next state, action, done) Q: the neural network which is directly trained for the approximate value of actions in each state, target Q: the neural network which is trained through the Q neural network for stable training, EPISODE_NUM: total number of episodes, STEP_NUM: total step numbers in each episode 1. time_step = 0 2. for i in range (EPISODE_NUM): 3.
Select action by -decay policy 5.
Send action to the environment and get state time_step+1 and reward by action 6.
Optimize a Q network by loss (y x -Q(state x , action x )) 2 .
Apply weight parameter of Q to target Q 20.

Related Works
Many researchers have studied various solutions for load balancing the distributed SDN controllers. This section discusses the existing studies based on conventional optimization and RL-based decision-making for switch migration.

Switch Migration Schemes for Load Balancing in SDN Environment
Switch migration schemes in the previous studies focused on PACKET_IN processing, resource utilization, and switch group migration for migration efficiency. In the studies of [6][7][8], the decisions of switch migration schemes are considered according to the number of PACKET_IN requests to the distributed SDN controllers. In [6], the authors designed a migration algorithm with the distributed hopping algorithm (DHA). Their proposed algorithm randomly selected a switch from the overutilized controller and broadcast the coming migration activity to the neighbor controllers in each iteration. Therefore, communication overheads increase along with the synchronization of the states for global network view after migration. Al-Tam et al. [7] proposed a heuristic algorithm with "shift" and "swap" moves for switch migration. However, their proposed approach caused an increase in the decision-making time for migration. This was because their proposed "shift" and "swap" moves iteratively occur until the optimal switch-to-controller pair is decided for even load distribution among SDN controllers. Jie et al. [8] also proposed a switch migration scheme based on the response time of the PACKET_IN request. The response time depends on the network condition of the southbound communication. For example, the link failures between the switch and controller may affect the delay in the response time.
In [9,10], the authors also proposed a switch migration schemes focusing on maximizing the resource usages of distributed SDN controllers. In their approaches, the overutilized controller selects a switch with less load and at the farthest distance to eliminate the overutilization status of the controllers. In Cheng et al. [9], each underutilized controller played the game independently until there was only one controller (i.e., targeted controller), indicating that the controller migrated. The decision-making for switch migration is repetitively occurring until the optimal targeted controller is selected for the migration. Sahoo et al. [10] also proposed a framework called ESMLB. Their proposed framework uses the preliminary iterative calculation for every pair of switch-to-controller prior to decision-making. Therefore, the computation overhead of the preliminary calculation may affect the response time of the load balancing along with the scalability in the distributed SDN controller's environment.
Switch migration decision with switch groups' migration has been studied in [11] and [12]. Therein, switch group migration from an overutilized to an underutilized controller was iteratively performed. Switch group migration increased the migration cost. In Cello et al. [11], the authors also proposed a decision algorithm based on the graph partitioning technique. In their approach, a switch migration decision was conducted among the partitions in the short distance between the switches and the controller. Therefore, their approach was suitable for small networks. Wu et al. [12] designed a switch migration-based load balancing scheme by utilizing a non-cooperative game with iterative migration of switches. In their approach, switch migration activities occurred greedily in each round of the game. Moreover, switch group migration increases the extra overhead in the migration of switches.

Reinforcement Learning-Based Load Balancing in SDN Environment
The load balancing issue in distributed SDN controllers has been addressed by the integration of artificial intelligence (AI) in recent studies. RL is a widely used AI method in the SDN environment. Particularly, it is used for the load balancing through an optimal policy, which is trained by the RL agent. In recent studies, RL-based research in the SDN environment can be divided into two categories: (i) use of RL in load balancing of the data plane, and (ii) use of RL in switch migration for load balancing of the control plane.
In the case of RL used in the load balancing of the data plane, Mu et al. [26] detect elephant flows in advance to reduce traffic between the controllers and switches. They employed two types of RL algorithms (i.e., traditional reinforcement learning and deep reinforcement learning algorithms) to better control overhead up to 60% and 14% improvement in table-hit ratio. In [27], Wu et al. proposed a deep Q-network (DQN) empowered dynamic flow data-driven approach for controller placement problem (D4CPP) that takes full consideration in the flow fluctuating, latency reducing, and load balancing. The results of their simulation show that D4CPP outperforms the traditional scheme by 13% in latency and 50% in load balance in the SDN environment with dynamic flow fluctuating. Tosounidis et al. [28] concentrate on dynamic load balancing with predictive methods using CNN and Q-learning. These methods try to predict the traffic and the state of controllers and networks in order to maximize the availability of the network. Their proposed method compares the performance of their approach with traditional methods only, such as round-robin, weighted round-robin, without the comparison with other RL-based solutions.
In the case of RL used in load balancing of the control plane, the switch migrationbased load balancing of a distributed SDN controllers has been addressed with a multiagent RL called MARVEL [13]. MARVEL trained the agents with the controller load generated by the Poisson process, revealing the overutilization of the controller caused by the distributed SDN controllers in real-world SDN controller load. MARVEL simplified the switch migration action by defining the action space as three cases (i.e., STAY, IMPORT, and EXPORT). However, simplified action space does not consider many other possible switch migration cases in current steps. Li et al. [14] also proposed RL-based switch migration decision-making for SDN controller load balancing. Their proposed approach is focused on the selection of an underutilized controller with minimal hop counts between the switch and the controller. This limits the selection space of underutilized controllers and cannot provide the scalability of the distributed SDN controller architecture. Min et al. [15] addressed the switch migration problem of distributed SDN controllers with a Q-learning algorithm. In their paper, the authors described the analysis of the proposed algorithm. However, there is no comparison with other switch migration algorithms in the evaluation of their algorithm.

Design of SAR-LB
Our proposed switch-aware reinforcement learning load balancing (SAR-LB) is a reinforcement learning-based switch and controller selection scheme for switch migration. It takes the utilization ratio of various resource types as input state of RL agent and defines a switch as an RL agent to limit the number of action size while considering all possible cases of switch migration. We present the design of SAR-LB as three-part: (i) the definition of RL communication channel, (ii) training and working phase, and (iii) DNN model for DQN agent.
To describe SAR-LB in detail, it would be effective to compare our proposed schemes, SAR-LB, with existing RL approaches in switch migration. Table 1 shows the comparison between the conventional RL based switch migration methods and our SAR-LB. Both MARVEL and SAR-LB use the DNN model to approximate the value of actions in the current SDN environment status, while other methods in [14,15] use tabular Q learning in the computation of the Q value. If the methods use the tabular Q learning, the only limited number of states can be defined as states because of the limited space of the Q table. In the Q table, a row index represents the controller selected as the initial controller, and a column index represents the selected switch migration cases. When the action is selected by the Q table, it is passively updated by reward. Thus, it is difficult to adapt to a dynamic variation of loads.
To adapt promptly to a dynamic variation of the load from controllers, a load of each controller should be used as an input state of a neural network. Or a machine-learning method should be used to approximate the value of actions in each load information of controllers. MARVEL and SAR-LB are taken these considerations into their methods. However, there are differences between MARVEL and SAR-LB. Unlike SAR-LB, MARVEL is based on multi-agent reinforcement learning (MARL), which is a distributed version of RL. However, MARVEL does not consider the utilization of resources in each resource type; it models resource utilization of each controller as a linear optimization instead. This yields a less-optimized switch migration since the utilization of each resource type has not been consistently correlated. For example, the memory and CPU utilization of the controller is exponentially increased as the number of PACKET_IN message increases [16].
Thus, it is almost impossible that the linear model with a constant weight of a resource type models the dynamically changing correlation among the various resource types. Moreover, when the network bandwidth of a controller is highly used (i.e., the utilization of network bandwidth is high), CPU utilization of a controller could unexpectedly jitter, while the utilization of network bandwidth is still constant [17]. Thus, the reason for overutilization is not clarified if the utilization per resource type is not defined. Because of this, it is better to model the load of the controller as the utilization ratios of each resource type. Furthermore, all possible switch migration cases should be considered for accurate decision-making for switch migration.

The Definition of RL Communication Channel for Switch Migration
SAR-LB uses the resource utilization of various resource types as inputs to the training phase and can provide more precise action by considering more switch migration cases. We constructed the RL framing for our study, as shown in Figure 2. To compose the framing of RL, it is necessary to define two entities and three communication channels. The two entities are defined as agent and environment. An agent is defined as a candidate Open vSwitch [29] for migration, and the environment is defined as an SDN environment with targeted controllers to migrate the candidate switch. The three communication channels (i.e., state, reward, and action) are important factors in the context of the learning process of the agent. Because state and action space define the input and output space for the RL agent. Moreover, the reward is an important term because the reward is used when the objective function of each agent is calculated. jitter, while the utilization of network bandwidth is still constant [17]. Thus, the reason for overutilization is not clarified if the utilization per resource type is not defined. Because of this, it is better to model the load of the controller as the utilization ratios of each resource type. Furthermore, all possible switch migration cases should be considered for accurate decision-making for switch migration.

The Definition of RL Communication Channel for Switch Migration
SAR-LB uses the resource utilization of various resource types as inputs to the training phase and can provide more precise action by considering more switch migration cases. We constructed the RL framing for our study, as shown in Figure 2. To compose the framing of RL, it is necessary to define two entities and three communication channels. The two entities are defined as agent and environment. An agent is defined as a candidate Open vSwitch [29] for migration, and the environment is defined as an SDN environment with targeted controllers to migrate the candidate switch. The three communication channels (i.e., state, reward, and action) are important factors in the context of the learning process of the agent. Because state and action space define the input and output space for the RL agent. Moreover, the reward is an important term because the reward is used when the objective function of each agent is calculated. We defined switches as RL agents, and the action is defined as selecting the controller for migration (i.e., selecting the target controller). Furthermore, the action size is defined by the number of controllers that a switch can select. Because we use the neural network to evaluate the value of actions in each input state, the output of the neural network is defined as the value of controllers in the current SDN environment status. Thus, the target controller for the switch is selected by the index of the output element, which has a maximum value. We defined switches as RL agents, and the action is defined as selecting the controller for migration (i.e., selecting the target controller). Furthermore, the action size is defined by the number of controllers that a switch can select. Because we use the neural network to evaluate the value of actions in each input state, the output of the neural network is defined as the value of controllers in the current SDN environment status. Thus, the target controller for the switch is selected by the index of the output element, which has a maximum value. Figure 3 defines the state for the RL model. The state is linear feature vectors of resource utilization of controllers and a resource utilization ratio of selected switch to migrate. In Figure 3, we represent linear feature vectors in a table format for a better understanding of state definition. Each row defines each controller's resource utilization per resource type. This resource utilization of the controller is defined as five tuples; C i .ID is an identifier number to classify each controller, and zero (i.e., 0) is a Boolean flag indicating the current information is about a controller. Moreover, the following three C i .Util define the controller's resource utilization rate for each resource type (e.g., CPU, RAM, and network bandwidth). This information for the controller merges with the information of the remaining controllers, which are defined as the same formats that we described in this paragraph.
Electronics 2021, 10, x FOR PEER REVIEW 9 of 19 Figure 3 defines the state for the RL model. The state is linear feature vectors of resource utilization of controllers and a resource utilization ratio of selected switch to migrate. In Figure 3, we represent linear feature vectors in a table format for a better understanding of state definition. Each row defines each controller's resource utilization per resource type. This resource utilization of the controller is defined as five tuples; .
is an identifier number to classify each controller, and zero (i.e., 0) is a Boolean flag indicating the current information is about a controller. Moreover, the following three . define the controller's resource utilization rate for each resource type (e.g., CPU, RAM, and network bandwidth). This information for the controller merges with the information of the remaining controllers, which are defined as the same formats that we described in this paragraph. After the controller information is described, additional information on the selected switch to migrate is added to the feature vector, as shown in the last row of the table in Figure 3. The information on the switch is similar to the description of the controller information. However, the ID of the controller in the switch information represents the ID of the initial controller where the switch is located. In this context, the Boolean flag is one (i.e., 1), which shows the information is about a switch, and the following three .
represents the resource utilization ratio in the initial controller by the selected switch for migration.
Switch migration can be defined as a process of selecting a target controller wherein the load of the selected switch will be added. In our definition of switch migration, the initial controller also needs to be considered as a candidate controller along with other controllers in the learning environment. In this context, as shown in the blue arrow of Figure 3, a load of the selected switch should be removed from a load of the initial controller, , for a fair comparison of other controllers because the target controller selection is based on the comparison of loads among controllers in the environment.
To reflect the difference between the resource utilization of the initial controller and the target controller in the reward, the reward is calculated as the variation of the square difference of resource utilization per resource type between the previous step and the current time-step. It is calculated as described in Equation (1).
where, After the controller information is described, additional information on the selected switch to migrate is added to the feature vector, as shown in the last row of the table in Figure 3. The information on the switch is similar to the description of the controller information. However, the ID of the controller in the switch information represents the ID of the initial controller where the switch is located. In this context, the Boolean flag is one (i.e., 1), which shows the information is about a switch, and the following three S selected .Util represents the resource utilization ratio in the initial controller by the selected switch for migration.
Switch migration can be defined as a process of selecting a target controller wherein the load of the selected switch will be added. In our definition of switch migration, the initial controller also needs to be considered as a candidate controller along with other controllers in the learning environment. In this context, as shown in the blue arrow of Figure 3, a load of the selected switch should be removed from a load of the initial controller, C i , for a fair comparison of other controllers because the target controller selection is based on the comparison of loads among controllers in the environment.
To reflect the difference between the resource utilization of the initial controller and the target controller in the reward, the reward is calculated as the variation of the square difference of resource utilization per resource type between the previous step and the current time-step. It is calculated as described in Equation (1).

Training and Working Phases of SAR-LB
For load balancing of distributed SDN controllers, we propose switch migration decision algorithms with training and working phases in Algorithms 2 and 3, respectively. Algorithm 2 describes our proposed switch migration scheme using the DQN agent [25]. The number of repetitions is set in line 1 of Algorithm 2. After initializing the SDN environment, the initial state value is created according to the information of controllers and target switch to migrate (lines 2 to 4 in Algorithm 3). The communication and action execution process between the DQN agent and the SDN environment is performed from Lines 6 to 12. Lines 13 to 14 define the learning process of the DQN agent. Initialize SDN environment and get initial state state 0 with CONTROLLER_NUM and SWITCH_NUM 3.
Select migration target switch, S 0 randomly in the overutilized controller 4.
Append information of migration target switch on initial state state 0 as shown in equation (1) 5.
Select destination controller (i.e., action), C j by DQN agent with epsilon greedy policy 7.
Migrate target switch, S j based on selected controller, C j 8.
Get next state, state j+1 and reward, reward j from SDN environment 9.
Save experience (state j , controller j , reward j ) on ReplayBuffer 10.
Select next target switch for migration, switch j+1 randomly in the overutilized controller 11.
Next state, state j+1 assigns to current state, state j 13.
Train DQN agent by sampled experience//which is equal to line 11-19 in algorithm 1 15.

end for
In Algorithm 2, we proposed a training phase for switch migration through reinforcement learning. However, the training phase is hard to utilize directly due to its randomness switch selection. To effectively utilize the trained model, the working phase for switch migration is described in Algorithm 3.
In Algorithm 3, the controller with the highest resource utilization rate is selected through line 2 to 5. In line 7 to 10, the information of controllers with the resource utilization rate of each switch, C max .S j , in the controller, C max , is taken as the input of the neural network. After this, actions and rewards for each switch in the controller, C max , are collected. In lines 11 to 12, switch migration is performed through the action with the highest reward.

DNN Model for DQN Agent
The DQN agent was configured through two types of layers, which are the LSTM [30] layer and the fully connected (FC) layer to recognize the change in the incoming load. Long short-term memory (LSTM) is a type of recurrent neural network (RNN) [30]. An RNN is a type of neural network that is designed to operate sequential information between layers. Moreover, second, the FC layer is simple, feed-forward neural networks. FC layers commonly form the last few layers in the network. Thus, the input to the fully connected layer is the output from the final pooling or convolutional layer or LSTM layer. In the combination of LSTM and FC, the number of input features in the FC layer is the number of hidden units in LSTM.
As shown in Figure 4, history length, which defines the number of consecutive experiences when sampling experiences in the DQN replay buffer, is defined as the LSTM sequence length. Thereafter, the result value of the LSTM layer is connected to the fully connected layer so that the behavior considering the continuous experience can be determined.

Algorithm 3 Working phase of SAR-LB
Input: C i : Controller i, C i .CPU,C i .RAM,C i .NB : CPU utilization, RAM utilization, and network bandwidth utilization of C i C i .utilization : total resource utilization rate, C i .S j : switch j from C i , CONTROLLER_NUM: the number of controllers in SDN environment, Output: Load balanced controllers 1. Set list List_Utilization controller for C i .utilization of all controllers 2. for i in range (CONTROLLER_NUM): 3.
Append each utilization of controller in C i .utilization in List_Controller utilization 4. end for 5. Find C max in List_Utilization controller which has maximum resource utilization 6. Set list List_Reward switch for rewards of switches in C max , 7. for C max .S j in C max : 8.
Get reward, C max .S j .reward by selected action, C max .S j .action of the agent for switch, C max .S j based on current input state 9.
Append reward C i .S j .reward in List_Reward switch 10. end for 11. Find action (i.e., destination controller), action max which has maximum reward, 12. Migrate target switch which have maximum reward action to target controller, action max

Experimental Environment
To evaluate our proposed scheme, SAR-LB, we defined a simulation environment using an SDN environment based on Python, and we used the host machine with Intel Xeon Skylake (Gold 6230)/2.10 GHz (20-cores) processor, 384 GB RAM and V100 GPU as shown in Table 2. Each switch from the simulated SDN environment creates a workload through a Poisson distribution [31] for each episode. Additionally, the loads for each resource type (i.e., CPU, RAM, network bandwidth) are also generated differently for each environment creation, which enables our proposed scheme to learn in a dynamic load environment. The DQN-agent based on PyTorch [32] performed the learning by configuring the hyperparameters, as shown in Table 3.

Experimental Environment
To evaluate our proposed scheme, SAR-LB, we defined a simulation environment using an SDN environment based on Python, and we used the host machine with Intel Xeon Skylake (Gold 6230)/2.10 GHz (20-cores) processor, 384 GB RAM and V100 GPU as shown in Table 2. Each switch from the simulated SDN environment creates a workload through a Poisson distribution [31] for each episode. Additionally, the loads for each resource type (i.e., CPU, RAM, network bandwidth) are also generated differently for each environment creation, which enables our proposed scheme to learn in a dynamic load environment. The DQN-agent based on PyTorch [32] performed the learning by configuring the hyperparameters, as shown in Table 3. Table 2. Hardware configuration.

Competing Schemes to Evaluate with SAR-LB
To evaluate our proposed scheme, we set and tested the following schemes as comparisons. For evaluating our schemes statistically, we trained SAR-LB and MARVEL on a stochastic SDN network environment. This stochastic environment means that the network environment at each episode generates different network overhead randomly. Furthermore, we averaged our evaluation results on four different network simulations, which are generated by setting different random seeds. Our SAR-LB is compared with other approaches as follows: Round robin: In a round-robin (RR) scheme, the switch with the highest load in the controller is selected from the overutilized controller, and the targeted controller for migration is selected in a round-robin fashion.
ElastiCon [34]: ElastiCon is well-known research in dynamic switch migration. For balancing load between controllers, their scheme compares every case of switch migration and selects the best migration case based on a standard deviation of utilization among controllers. For a fair comparison with MARVEL and SAR-LB, we have modified their algorithm to find the best migration case using utilization difference between an initial controller and targeted controller as shown in Equation (1), which is used as a reward function in both MARVEL and SAR-LB.
MARVEL [13]: In MARVEL, controllers run multiple agents, and the RL model takes controller resource utilization as input and decides the optimal switch-to-controller mapping for the migration. However, we did not compare our schemes with [14,15] because we consider MARVEL as stateof-the-art schemes that utilize the RL for switch migration than previous studies, and the study of [14,15] is difficult to handle the dynamic changes of load in switches.
SAR-LB: In our proposed scheme, we collect actions (i.e., selected target controller by DQN) of switches for the controller with the highest load and then migrate a switch that has a maximum reward to the target controller. Compared to MARVEL, our scheme utilizes the novel input state definition for representing the utilization ratio per resource type. In addition, action space for our schemes can consider every possible target controller in which this switch can be migrated.

Evaluation Results and Analysis
In our simulation, we conducted the experiments within 300 time-steps in five scenarios: three cases that increase only the number of switches and two cases that increase the number of switches and controllers both. In addition, we conducted the evaluations in these five scenarios on two cases of network conditions: (i) network condition without bursty traffic and (ii) network condition with bursty traffic. The average workload of network conditions with/without bursty traffic is represented in Figure 5. The workload in each controller is dynamically changed along with the time. Therefore, uneven load distribution is shown in Figure 5. Figure 5b shows the average workload distribution of bursty traffic that we utilized in the evaluations. In this figure, the network overhead of each controller is sharply increased unexpectedly.

Comparison of Load Balancing Rate
We applied our proposed switch migration decision-making scheme, SAR-LB, to all controllers in five scenarios with two network conditions, as explained above. The load balancing rate before and after applying our scheme is shown in Figures 6 and 7. In figures, the x-axis shows the time-steps, and the y-axis shows the standard deviation of workload distribution. As shown in Figures 6 and 7, the load balancing of the controller after applying SAR-LB ("red" line in Figures 6 and 7 has more even load distribution among controllers in contrast to the load distribution pattern without load balancing. Furthermore, we also applied alternative schemes in the testbed (i.e., RR, ElatiCon, and MARVEL) to evaluate the performance of our proposed scheme, and the result of the workload distribution in these schemes is shown in Figures 6 and 7 with different colors, respectively. Overall, the standard deviation of load distribution with our approach is better than other competing schemes when the case with four controllers and 80 switches.

Comparison of Load Balancing Rate
We applied our proposed switch migration decision-making scheme, SAR-LB, to all controllers in five scenarios with two network conditions, as explained above. The load balancing rate before and after applying our scheme is shown in Figures 6 and 7. In figures, the x-axis shows the time-steps, and the y-axis shows the standard deviation of workload distribution. As shown in Figures 6 and 7, the load balancing of the controller after applying SAR-LB ("red" line in Figures 6 and 7 has more even load distribution among controllers in contrast to the load distribution pattern without load balancing. Furthermore, we also applied alternative schemes in the testbed (i.e., RR, ElatiCon, and MARVEL) to evaluate the performance of our proposed scheme, and the result of the workload distribution in these schemes is shown in Figures 6 and 7 with different colors, respectively. Overall, the standard deviation of load distribution with our approach is better than other competing schemes when the case with four controllers and 80 switches  Tables 4 and 5 show the average standard deviation of workload load distribution between SAR-LB and competing schemes in the simulation environments in five scenarios, respectively, in two network conditions. The standard deviation of RR also resulted

Comparison of Load Balancing Rate
We applied our proposed switch migration decision-making scheme, SAR-LB, to all controllers in five scenarios with two network conditions, as explained above. The load balancing rate before and after applying our scheme is shown in Figures 6 and 7. In figures, the x-axis shows the time-steps, and the y-axis shows the standard deviation of workload distribution. As shown in Figures 6 and 7, the load balancing of the controller after applying SAR-LB ("red" line in Figures 6 and 7 has more even load distribution among controllers in contrast to the load distribution pattern without load balancing. Furthermore, we also applied alternative schemes in the testbed (i.e., RR, ElatiCon, and MARVEL) to evaluate the performance of our proposed scheme, and the result of the workload distribution in these schemes is shown in Figures 6 and 7 with different colors, respectively. Overall, the standard deviation of load distribution with our approach is better than other competing schemes when the case with four controllers and 80 switches  Tables 4 and 5 show the average standard deviation of workload load distribution between SAR-LB and competing schemes in the simulation environments in five scenarios, respectively, in two network conditions. The standard deviation of RR also resulted in higher than the average standard deviation without load balancing in both scenarios.  Tables 4 and 5 show the average standard deviation of workload load distribution between SAR-LB and competing schemes in the simulation environments in five scenarios, respectively, in two network conditions. The standard deviation of RR also resulted in higher than the average standard deviation without load balancing in both scenarios. In the environment with four controllers and 20 switches, testbed without bursty network traffic, MARVEL, ElastiCon, and SAR-LB achieved 112%, 121%, and 126% better load balancing than without load balancing schemes, respectively. This trend is clearer in the case where only the number of switches is increased. However, the load balancing performance of our proposed scheme, SAR-LB, is getting worse when both the number of switches and controllers are increased. As a result, SAR-LB shows up to 20% better normalized standard deviation than ElastiCon and outperforms other schemes in four of five evaluation scenarios. However, it shows 26-24% less normalized standard deviation with the scenario with 16 controllers and 80 switches compared to ElastiCon and MARVEL.
The evaluation result with bursty network traffic is shown in Table 5. This table shows  similar results as Table 4. However, the difference in standard deviation becomes larger as the number of switches or controllers are changed. Therefore, SAR-LB shows a 1% better normalized standard deviation than ElastiCon in the case of four controllers and 20 switches, and this difference becomes much larger than the network condition without bursty traffic. As a result, SAR-LB shows up to 34% better normalized standard deviation than ElastiCon.
As a result, SAR-LB outperforms MARVEL and ElastiCon when the number of controllers is less than eight, and this trend is much stronger when the number of switches is increased. However, SAR-LB shows less efficient load balancing performance than the other schemes when the number of controllers is increased. This is due to the increased dimension of input state and action space of our scheme when the number of controllers is increased, and it makes it difficult for RL agents in our scheme to learn the current states of the SDN environment and selecting an action.   Table 6 describes the comparison of decision-making time per time-step, which describes the results of the experiments over five scenarios. SAR-LB shows 53-104% decisionmaking time when it compares to MARVEL. This result is made by the optimization of the input pipeline, which merges the input states of every switch in a controller as one input state. Furthermore, SAR-LB and MARVEL show slower decision time than ElastiCon and RR. This is due to the inference overhead of MARVEL and SAR-LB which utilize the DNN model. However, the inference overhead is relatively small when it compares to total switch migration time, as shown in [9,18]. Thus, we assumed that the inference overhead is acceptable to select a more efficient switch migration case.
Although SAR-LB shows a similar and cheaper time cost than MARVEL in Table 6, our algorithm is difficult to distribute among controllers. This problem is the main disadvantage of our algorithm. Therefore, we will discuss this problem with the scalability problem of our scheme in Section 6. Electronics 2021, 10, x FOR PEER REVIEW 16 of 19 Figure 7. Comparison of workload distribution after load balancing in four controllers and 80 switches with bursty network traffic.

Comparison of Decision Time per Time-
Step Table 6 describes the comparison of decision-making time per time-step, which describes the results of the experiments over five scenarios. SAR-LB shows 53-104% decision-making time when it compares to MARVEL. This result is made by the optimization of the input pipeline, which merges the input states of every switch in a controller as one input state. Furthermore, SAR-LB and MARVEL show slower decision time than Elasti-Con and RR. This is due to the inference overhead of MARVEL and SAR-LB which utilize the DNN model. However, the inference overhead is relatively small when it compares to total switch migration time, as shown in [9,18]. Thus, we assumed that the inference overhead is acceptable to select a more efficient switch migration case.
Although SAR-LB shows a similar and cheaper time cost than MARVEL in Table 6, our algorithm is difficult to distribute among controllers. This problem is the main disadvantage of our algorithm. Therefore, we will discuss this problem with the scalability problem of our scheme in Section 6.

Discussion
Our proposed scheme, SAR-LB, shows that it improves up to 34% higher normalized standard deviation among controllers than other competing schemes. The improvement is mainly because of our effective design of communication channels, training and working phases, and the DQN model. In addition, the performance of SAR-LB gets better as the number of switches is increased. Thus, based on these experiment results, we claim that SAR-LB can be used in a common SDN environment in data center networks because each controller in the data center manages dozens of switches generally [19][20][21][22].

Discussion
Our proposed scheme, SAR-LB, shows that it improves up to 34% higher normalized standard deviation among controllers than other competing schemes. The improvement is mainly because of our effective design of communication channels, training and working phases, and the DQN model. In addition, the performance of SAR-LB gets better as the number of switches is increased. Thus, based on these experiment results, we claim that SAR-LB can be used in a common SDN environment in data center networks because each controller in the data center manages dozens of switches generally [19][20][21][22].
However, SAR-LB still has room to be improved in its scalability. We suggest transforming the feature vectors into a fixed size image format. Since an input size is independent of the numbers of controllers and switches, and we are able to stack layers deep on the neural network if we use a convolution layer, this transformation can be taken as an advantage. As a consequence, this transformation process will increase scalability as well as producing more accurate decision-making.
Parallelizing our working phase algorithm into multiple controllers to enhance the scalability is hard to achieve. Thus, we suggest the data parallelism of distributed deep learning. It can be achieved by distributing the inference overhead of the working phase algorithm onto multiple controllers and gathering the results of inference by utilizing a collective communication library (e.g., MPI [35]). The other way to reduce the inference overhead is to set the threshold for inference, i.e., the inference process can be performed only for switches that occupy a large resource usage of the controller with the utilization threshold.

Conclusions
In this paper, we presented our design of switch migration scheme with RL-based switch migration decision-making, SAR-LB, which achieves more evenly balanced load distribution between distributed SDN controllers than other schemes. Unlike the existing RL approaches, we consider the resource utilization ratio of various resource types in state information as well as making a switch as an RL agent to consider all switch migration cases efficiently. As a consequence, our proposed scheme provides better load distribution because of its accurate decision-making for switch migration.
We evaluated the performance of SAR-LB in the simulated testbeds with the five scenarios. In each scenario, we vary the number of controllers or switches. The results show that our proposed SAR-LB performs better than competing schemes in load distribution in four of five scenarios, and the improvement of load distribution between distributed SDN controllers is up to 34% more than existing schemes.

Conflicts of Interest:
The authors declare no conflict of interest.