Smart Site Diversity for a High Throughput Satellite System with Software-Deﬁned Networking and a Virtual Network Function

: High Throughput Satellite (HTS) systems aim to push data rates to the order of Terabit/s, making use of Extremely High Frequencies (EHF) or free-space optical (FSO) in the feeder links. However, one challenge that needs to be addressed is that the use of such high frequencies makes the feeder links vulnerable to atmospheric conditions, which can effectively disable channels at times or temporarily increases the bit error rates. One way to cope with the problem is to introduce site diversity and to forward the data through the gateways not affected, or at least less constrained, by adverse conditions. In this paper, a virtual network function (VNF) introduced through reinforcement learning deﬁnes a smart routing service for an HTS system. Experiments were conducted on an emulated ground-satellite system in CloudLab, testing a VNF implementation of the approach with software-deﬁned networking virtual switches, which indicate the expected performance of the proposed method.


Introduction
The networking industry has been dominated by the use of proprietary hardware appliances built with application-specific integrated circuits (ASIC). Network hardware appliances deliver one or more specific network functionality, including routing, intrusion detection, and traffic shaping. Network Function Virtualization (NFV) aims to replace this traditional approach to networking infrastructure by decoupling network functions from the hardware appliances, with software functions running as modular software on commercial off-the-shelf (COTS) servers with virtualization. The modular approach allows implementing virtualized network functions (VNF) as connectable blocks that can be chained to define complex services. NFV brings several benefits such as elasticity, availability, reliability, shorter deployment times, and reduced capital and operational expenditures compared to the traditional methods [1].
While less than a decade old, the NFV concept has gained rapid acceptance to support diverse applications but mostly for the terrestrial domain. Recent works have started the discussion of the possible benefits of NFV and Software-Defined Networking (SDN) for satellite communication networks, for example, to achieve high rates in satellite-5G applications with wide-scale coverage and high availability. It is expected that the use of Extremely High Frequencies (EHF) or free-space optical (FSO) in the feeder links will push the data rates to the order of Terabit/s in the future High Throughput Satellite (HTS) systems. However, one limitation is that the resulting performance may not be consistent because, with such high frequencies, atmospheric attenuation due to clouds, fog, and rainfall can cause severe time-varying channel impairments [2]. Even with the use of Adaptive Coding Modulation (ACM) or improved error correction, codes can disable or significantly reduce the performance of the satellite link in the event of bad atmospheric conditions [3].
This work investigates an intelligent mitigation approach that is defined as a VNF to deal with the channel impairments of an HTS system exploiting spatial diversity for the feeder link. This approach allows forwarding the feeder link communications through selected gateways from a set of geographically distributed alternatives that are interconnected through high-speed terrestrial links, with the gateways separated by at least 20 km. While this possibility has been discussed in the literature [2,3], the prevailing approach to the problem is static, which involves manual interventions to change the forwarding policy with the knowledge of the system and transmission medium conditions. This work introduces a reinforcement learning (RL) approach that autonomously decides how to set the data forwarding policy without requiring the knowledge of weather conditions. The proposed method not only allows quick routing adaptation to the changes to precipitation fades in the Q/V bands but also to the traffic carried by the different feeder links. We implemented the smart routing method as an SDN controller that runs as a virtual network function (VNF). The deployment of the proposed VNF for onboard controllers helps to improve the reliability and availability of the feeder links.

Background
Bandwidth scarcity is one of the technical challenges to be addressed in future HTS systems. The entire Ka frequency band was allocated to the user link to prevent the bandwidth limitation, moving the feeder link to EHF (or FSO) with spatial diversity of the gateways to combat the channel impairments that may randomly appear due to atmospheric conditions [4]. Various approaches to gateway redundancy were proposed to prevent the need for doubling the infrastructure required for the ground segments. These approaches assumed that all gateways are interconnected by high-speed terrestrial links, and the traffic of a given user can be redirected to another gateway. These approaches, denoted as Smart Gateway Diversity-SGD [5,6] use Network Control Centers to make the routing decision of serving each user beam by, 1. carriers from all (N+0 site diversity), 2. a subset of the gateways using P redundant gateways (N+P site diversity) so that the traffic of a gateway experiencing an outage can be routed through an alternative gateway, or 3. the concurrent operations of all gateways, including the redundant ones (++N+P site diversity).
While most of the works about SDN/NFV integration focus on traffic steering towards VNF instances that implement network functions, only limited works exist for implementing an SDN controller as a VNF. A systematic literature review [7] lists possible controller functions implemented as VNF in existing works. An SDN controller that functions as a traffic load balancer was implemented as a VNF [8]. The controller was implemented as a virtual network service to deploy additional controllers based on the network traffic to improve the overall network performance. An SDN controller that load balances traffic between two snort (intrusion prevention systems) VNF instances based on a control-theoretic approach was proposed [9]. A deep learning-based traffic classifier was implemented as a VNF directs an SDN controller to route traffic with application awareness [10]. An elastic routing service implemented as a Ryu controller is deployed as a VNF to load balance the network traffic across dynamically provisioned switches in a framework called UNIFY ESCAPE [11].
On the other hand, a few works have explored the application of Software Defined Networking, Virtualization, and Network Function Virtualization to satellite networks. For example, Bertaux et al. [12] suggested likely broadband communication scenarios for these technologies, including an inter-hub handover with site diversity case, enhancements for virtual network operator (VNO) services, and the integration of satellite and terrestrial networks. In another work, Gardikis et al. [13] emphasized the use of NFV to ensure the competitiveness of the satellite communications sector. Several approaches are certainly possible for this integration. Li et al. [14] focused on the use of SDN/NFV to orchestrate the control, forwarding, access, service, and management planes operating at different orbits of the space segment. A system architecture for the Internet of Space Things/CubeSats (IoST) was proposed based on SDN and NFV to improve network resource utilization and simplify network management [15]. Prior works such as the one by Cai et al. [16] have formalized the optimization problem to be handled by the VNF. However, these approaches require global knowledge of the link states.

SDN/VNF HTS Architecture
In this work, the number of required gateways N is increased with additional P gateways, which provide redundant paths and capacity. As with the ++N+P site diversity scheme, the n = N + P gateways are concurrently active. Routing occurs at the payload level (i.e., via packet switching) as decided by a VNF. No particular constraints are assumed about the symmetry of the links, so the system model applies to diverse applications. The user beams and the gateway beams are assumed to occur in different bands, with the latter using either EHF (Q/V band) or FSO. The user beams are located at a lower frequency band (e.g., Ka-band), so it is less affected by atmospheric attenuation than the feeder links.
For each user, two routing decision elements are required for each communication direction (i.e., data from or towards the user). Examples of the services include high-definition video streaming (i.e., data flowing mainly towards the user) and a large userbase generating and reporting data to an Internet-of-things cloud server (i.e., data flowing from the user to the gateways). Figure 1 depicts the main components of the system. Each routing decision element consists of an intelligent agent that learns how to select the optimal routing decision just by observing the performance of the past transmissions. As a result, one distinct feature of the proposed approach compared to the existing routing methods considered for the HTS is that it does not require the knowledge of the current system state, including the atmospheric conditions at the gateway sites or the bit-error-rates of the channels. Another advantage is that the agent does not need to be aware of physical-layer changes, such as those handled by adaptive-coding modulation. Through the learning mechanism, the agent observes the outcome of those changes and modifies its forwarding policy. This functionality is achieved by defining agents as VNF that implement reinforcement learning.

Agent Learning Goal
User communications occur as a result of requests originated from the clients with data flowing in either or both directions. The routing goal is to minimize the completion time of these requests (i.e., the response time) and their loss ratio. The agents need not be aware of the actual path of the communications but can determine the resulting performance of each routing decision. This is achieved by testing the state of the feeder links at regular intervals. The problem is formulated as a Markov decision process (MDP), which models the feeder link selection process. There are two variants of reinforcement learning investigated for this task: Q-Learning and actor-critic (learning automata-LA).
Learning Automata (LA) is a decision-making agent working in a random environment that selects an optimal action from a set of actions based on their performance [17]. The environment reacts to the selected action either with a favorable response or with an unfavorable response. The output from the environment is feedback to the LA as an input to assist in deciding the next action. When an action is selected, it has a certain penalty probability c i to produce an unfavorable response and a reward probability d i = 1 − c i to produce a favorable response. The reward and penalty probabilities vary over time and depend on the environment. Each action is selected based on a probability distribution. If the action resulted in a favorable response, its selection probability will be increased, otherwise decreased.
An LA is described by a six-tuple, {φ, X, λ, P, A, G}. Table 1 indicates the model parameters. The number of r actions is always less than or equal to the the number of internal states s. Table 1. Entities and descriptions.

Entities Description
The diverse types of LA are classified based on the kind of input provided to the agent: The S-model has been applied to network communication problems due to its adaptive learning nature (see for example [18][19][20]) and is the model adopted in this work.
With Q-Learning, each action is linked to a certain reward (or Q-value). The agent's goal is to maximize the long-term reward associated with its selected actions: at state s, the agent receives an immediate reward by choosing an action a, based on a policy π. The agent's long-term reward is then given by the total discounted reward from that state s [21]. Q-Learning has been applied to different routing problems in the past (see for example [22,23]).

Feeder Selection with Reinforcement Learning
Central to the concept of RL is the notion of action rewards, which needs to be formulated for the problem context. Considering that N links are available to forward packets, the VNF emits periodic requests at regular intervals (dt seconds) to the switch to evaluate those links through a request/reply packet exchange, the time difference between which indicates the response timed dt i of each link i, i = 1, 2, . . . , N at the interval dt. The difference between the number of request departures and reply arrivals indicates the packet loss rates on the links. The packet loss rate, pkt_loss_rate mt i is measured on the links at an interval of mt seconds. Algorithms 1 and 2 indicate the proposed procedure for updating the delay and loss metrics by separate threads. The moving exponential average values of both metrics are combined to produce the expected link cost, cost f t i at each flow modification interval f t by another thread as described in Equation (1).
where k is an arbitrary penalty constant, e.g., it may include the retransmission time [24]. At each interval f t, the average cost, a_cost f t i is updated via exponential smoothing with the hyperparameter α, 0 ≤ α ≤ 1 as given in Equation (2).
This cost (i.e., a negative reward) is used to modify the routing policy of the feeder links as explained in the next sections.

Routing with SLA
At each flow modification interval, the expected costs (or negative rewards) of the most recently selected feeder link is compared to that of the other links to update the routing policy. As a result, the selection probability of the most recently selected link either increases or decreases. The value of the reward or penalty is calculated through parameter β f t as defined by the Equation (3) where a_cost f t v is the most recent average cost obtained for feeder link v at time f t, x f t x f t 2 is the maximum average cost.
Parameter β f t represents the normalized reward or penalty awarded to the selected feeder link with respect to the observed performance of the other links. The routing policy is updated using [17]: where v is the index of the most recent link. The probability of the other links are updated as follows: where a, b are the reward and penalty parameters. With ergodic assumptions, both parameters are identical.
The feeder links are selected randomly according to the probability distribution Pr as described in Algorithm 3.
Pr for i = 1 to N do 14: computing cost and its moving average 15: 16: where v is the index of selected link 32: 33: 34:

Routing with Q-Learning
The agent updates the Q-values based on the combined cost computed by the measured packet-loss ratio and round trip time (RTT) delays as given in (2). At time f t, the agent observes the state of the environment s f t , selects an action (feeder link) a f t based on the policy and receives an immediate payoff cost f t , observes the subsequent state y f t , and adjusts its Q-values based on a learning rate α: where the constant γ is a learning rate and α is an averaging constant. The feeder link selection uses an -greedy approach, where a feeder link is selected at random with probability of . Otherwise the feeder link with the lowest Q-value is selected, since the Q-values represent costs. By selecting the cost as the payoff, the agent tries to minimize both the average latency and packet-loss ratio at every state. The Q-Learning approach is described in Algorithm 4.

Routing Using Hierarchical SLA (HSLA)
We explored the use of hierarchical SLAs besides the above two methods as a means to improve the efficacy of the routing decisions. To illustrate, consider that four feeder links s 1 , s 2 , s 3 , and s 4 are available to route packets. Two LAs, LA1 and LA2, located at level-1 of the hierarchy handle half of the links each, i.e., LA1 selects among s 1 and s 2 whereas, LA2 handles the other two links s 3 , s 4 . At level-2 of the hierarchy, LA3 selects the optimal link among the outcomes of LA1 and LA2. Figure 2 depicts the general concept behind hierarchical SLAs.

Proof of Concept
A system prototype was developed and tested on an emulated satellite network in CloudLab [25]. The prototype uses Open vSwitch (OVS) to emulate the gateway handover with the intelligent agent-based routing methods implemented as a VNF. The implementation is based on the Ryu SDN framework [26]. For testing purposes, the user requests are modeled as REpresentational State Transfer (REST).

Flow Rules to Optimize Response Times
The reinforcement learning agent running on the controller optimizes the feeder link decisions by updating the flow rules on the OVS based on its adaptive policy. It makes decisions based on the monitored service delay and packet-loss ratios of the links, which are observations obtained by the monitor module.

Monitor Module
A separate thread from the Ryu controller sends specially crafted Internet Control Message Protocol (ICMP) echo packets at fixed intervals (dt seconds) through each of the feeder links using OFPPacketOut messages. It then receives the ICMP replies through OFPPacketIn messages, computes the delay, and stores them in a global data structure. The ICMP packet has a payload of size 44 bytes that contain the following information: switch ID, -target IP address of a given server associated with the feeder link, -creation time stamp.
The delay is computed as the time difference between the time at which an ICMP response is received from a server and the time stamp in the payload of the response. The method used for delay computation is described in Algorithm 1. The number of echo requests sent to and received from each server is updated in the respective global data structures. Likewise, another thread handles the estimation of the packet losses based on the number of echo requests sent and received at a certain rate (every mt seconds) and computes the packet-loss ratio using Algorithm 2.

TCP Packet Handling
The handover is implemented by defining a replicated service that is handled by back servers connected to the gateways. Clients send requests addressed to a Virtual IP address (VIP), which are forwarded to servers according to a selected policy by the VNF controller. The IP addresses of the back servers are hidden from the outside world. In the beginning, the SDN controller receives the requests destined for the VIP and adds a flow rule using a tuple of five match fields to one of the back servers, usually to the server at index 0. While adding a flow rule, it sets the destination IP and destination MAC address fields of the packets to the IP address and MAC address of the chosen back server and set the out_port (OFPActionOutput) to the port, the back server is connected to the switch. It also adds a flow rule to forward the TCP packets in the reverse direction to send back the TCP responses to the clients. In the reverse flow rule, it sets source IP address, source MAC address fields to VIP and Virtual MAC (VMac) addresses and sets out_port (OFPActionOutput) to the port, the client is connected to the switch.

Dynamic Flow Modification
A server is selected at the beginning of each flow modification interval ( f t seconds) based on the chosen algorithm by a separate thread. The value of f t is chosen to be greater than the values of the intervals dt and mt. In addition to Q-Learning and the SLA algorithms, results obtained with Round-Robin (RR) have been included as reference performance. The agent performs the following steps to modify the flow entries of the switch at the beginning of each flow modification interval: 1. Selects a server based on a policy.
• SLA-the agent selects a server based on the performance of the server from the previous flow modification interval as in Algorithm 3. • Q-Learning-the agent selects a server that has the minimum Q-value as in Algorithm 4. • RR-the destination server is selected in sequential order.
In the former two methods (i.e., learning methods), the forward flow rule of the switch is not modified if the selected server in an interval f t is the same in the previous interval f t − 1. 2. Looks up the flow table for a match with the VIP as the destination IP address and the client IP address as the source IP address. If it finds a matching entry, it modifies the flow rule using the OFPFC_MODIFY_STRICT command with actions to set the destination IP address and destination MAC address fields to that of the selected server, and the out_port to the port, the server is connected to the switch. 3. It adds a new flow rule to handle the reverse traffic from the newly selected server if there is no matching rule exists in the flow table.

Testbed Setup
The experimental network built from Clemson's cluster on the CloudLab testbed is depicted in Figure 3. The topology consists of four back servers, one Open vSwitch (OVS), a client (emulating the aggregated traffic flow of multiple users), and the learning agent-based SDN controller as a VNF. The client is connected to the back server via the OVS. The learning agent decides the forwarding of HTTP requests from the clients to the servers based on their observed performance as measured by the response times in ICMP echo/reply packet exchange and the packet-loss ratios on the links connecting OVS to the back servers.
The back servers s 1 , s 2 , s 3 , and s 4 run on Xen Virtual Machines (VM), each configured with two CPU cores-Intel Xeon 2.0 GHz. Apache-2.4.18 was installed on the back servers to provide the simulated network service that users need to access to either retrieve or send information. Open vSwitch runs from the machine switch and forwards the HTTP requests and responses between the client and back servers. The machine has 56 processors, each having 14 cores, and its model is Intel Xeon 2.40 GHz. The SDN controller is running from the Xen VM controller, which has four CPU cores. Artificial delays and packet losses were introduced to the links connecting back servers to the switch using Linux's NetEm tool (Traffic Control) to emulate propagation latencies and the impact of atmospheric channel impairments. A Poisson traffic generator was developed with various configurable sending rates (λ) to emulate aggregated traffic from clients. The client in the Figure 3 emulates the users in Figure 1, the switch represents a router on the satellite, and the back servers represents the links connecting to the ground stations.  Figure 3. Emulated HTS system with VNF providing reinforcement learning-based routing.

Results
Our experiments were conducted with a satellite-like environment by creating emulated delays and packet-losses between clients and servers. The experiments were conducted with different assumptions for the packet-loss of the links emulating different weather conditions. A delay of 8 ms was configured on the links connecting servers to the OVS, besides a 300 ms delay imposed by the CloudLab while provisioning the resources. Packet-loss ratios used in our test scenarios are listed in Tables 2-8. The performance of each algorithm was measured by considering HTTP GET request transmissions for static files of varying sizes between 100 KB to 1 MB and sending HTTP POST requests with a data payload of 10 KB from the client machine. Each experiment was run for 10 min and repeated six times to get statistical averages.

Scenario-1
In test Scenario-1, the links connecting servers to the switch are configured with packet-loss percentages as given in Table 2. These packet losses are emulated using Linux's Traffic Control which can drop packets at a selected rate before they can reach the IP network stack. In this scenario, server s 3 is the fastest one. However, if too many requests are sent to this server, it will increase the response times and making other links appropriate, i.e., server s 1 .  Figure 4a,b depict the performance of the forwarding algorithms while handling HTTP GET and POST requests sent from the client to the back servers. The average delay (i.e., the response time) in serving files is reported. Average delays with both learning algorithms are ≈80% lower than those achieved with RR with GET requests and ≈60% lower with POST requests.
The requests-loss ratios are depicted in Figure 4c,d. The request-loss is the ratio of requests not serviced by the system to the total number of requests sent by the client/(s). The request loss ratio is ≈90% better with both learning algorithms compared to RR with both GET and POST requests.
Both SLA and Q-Learning algorithms show better average response time and request-loss ratios than RR. The variations observed between SLA and Q-Learning is due to the stochastic nature of the server selection. Tables 3 and 4 show the number of times each server is selected by the algorithms while serving GET and POST requests in Scenario-1. The learning algorithms selected the non-optimal servers s 2 and s 4 a lower number of times than RR and utilized the server s 3 more frequently, whereas the link usage was split equally with RR as expected.

Scenario-2
In the second scenario, a dynamic environment is considered. To this end, the packet-loss percentages of the links are changed every 5 min as given in Table 5. The servers s 1 and s 3 are assigned the lowest (emulated) packet-loss percentages.  Figure 5a,b show an increase in the average delays with SLA as a function of the load λ compared to Q-Learning since the SLA algorithm selects servers using a probability distribution. It takes a certain amount of time for the SLA to adjust the policy and find the best link after a change in the packet-loss configuration. However, the Q-Learning performed better than both RR and SLA. With HSLA, the learning occurred faster as each agent was limited to decide between two choices, which translates into improved performance for both the average delay and the request loss ratios in serving HTTP GET requests. With GET requests, the average delays of HSLA, Q-Learning, and SLA were ≈20%-50%, ≈20%-40%, and ≈15%-35% better than RR. HSLA was ≈10%-40% better than Q-learning, and Q-Learning was ≈2%-20% better than SLA. With POST requests, the average delays of HSLA and Q-Learning were ≈60% better than RR, whereas SLA was ≈50% better than RR. Q-learning was ≈5%-20% better than HSLA, and ≈20%-40% better than SLA. Figure 5c,d show that the request loss ratio percentages were better with all three learning algorithms than with RR. The request-loss ratios of HSLA, Q-Learning, and SLA were ≈ 80%, ≈90%, and ≈60% better than RR, and Q-Learning was ≈75% and ≈55% better than SLA and HSLA when serving HTTP GET requests. The request-loss ratios of HSLA, Q-Learning, SLA were ≈80%, ≈90%, and ≈40%-75% better than RR, and Q-Learning was ≈ 50%-80% and ≈10%-80% better than SLA and HSLA when serving POST requests.  Tables 6 and 7 show the number of times each server is selected by the algorithms while serving GET and POST requests in Scenario-2. All of the learning algorithms selected the non-optimal servers s 2 and s 4 a lower number of times. Q-Learning and HSLA used them a lower number of times than SLA, whereas RR used all of the servers equally. The learning algorithms chose the servers s 1 and s 3 almost an equal number of times since the quality of the links connecting them alternates between good and worse every 5 min.

Scenario-3
In Scenario-3, the efficiency of the learning algorithms was tested in a dynamic environment, where packet-loss changes occur at a higher rate, i.e., every 2 min. Packet-loss percentages on the links connecting servers to the client are given in Table 8. Servers s 1 and s 3 were configured with the lowest packet-losses. Figure 6a shows that the average delay obtained with SLA was higher than the other three algorithms in serving GET requests, and except SLA, all of them showed equivalent performances. Figure 6b indicates that Q-Learning produced the lowest average delay when serving POST requests. With GET requests, the average delay of RR was ≈15% and ≈25%-40% better than Q-Learning and SLA. The average delay with HSLA was similar to RR, whereas Q-learning was ≈20% better than SLA, and HSLA was ≈10% better than Q-Learning. When sending POST requests, average delays of HSLA, Q-Learning, and SLA were ≈30%, ≈40%, and ≈20% better than RR, and Q-learning was ≈20% better than HSLA and ≈30% better than SLA.
Figures 6c,d depict the request-loss ratios obtained with the algorithms. All learning algorithms show better performances, and Q-Learning shows the best performance. When sending GET requests, the request-loss ratios of HSLA, Q-Learning, and SLA were ≈60%, ≈90%, and ≈50% better than RR, and Q-Learning was ≈90% better than both SLA and HSLA. When sending POST requests, the request-loss ratios of HSLA, Q-Learning, and SLA were ≈60%, ≈90%, and ≈60% better than RR. Q-Learning was ≈75% better than both SLA and HSLA.   Tables 9 and 10 show the number of times each server was selected by the algorithms in Scenario-3 while serving GET and POST requests. All of the learning algorithms chose the non-optimal servers s 2 and s 4 for the least number of times. Among all, Q-Learning was the lowest, whereas RR used all the servers equally. Since LA algorithms involve updating selection probabilities and making stochastic server selection, learning was not efficient when compared to Q-Learning. When the quality of the links changes more rapidly, the LA algorithms find it harder to choose the optimal link due to the learning delay.

Discussion
All the learning algorithms showed exceptionally better performance in the static environment (Scenario-1). The reason being is that limited link switching was needed, since s 3 was the only fastest option. However, with a dynamic environment, where the fastest option alternates between s 1 and s 3 , the link switching occurs at a higher rate, which is a desirable feature for HTS to be able to adapt to weather condition changes. A drawback of this mechanism is linked to an implementation issue rather than the technique itself because the dynamic flow modifications may produce packet drops when the flow rules are being modified in the switch. However, despite the additional delay involved in TCP retransmissions, the average response times were observed to improve. The performance penalty caused by the dynamic modification of flows in OpenFlow switches is a known issue [27][28][29]. It was experimentally verified that OpenFlow SDN hardware switches would cause 3 ms-30 ms latency due to flow rule modifications [27]. Updating of the selection probabilities with the SLA algorithm requires a certain amount of time to find the best link when the environmental conditions change. There are possibilities for non-optimal servers to be selected during the transient period due to the delay in learning. If the frequency of the changes in the environment increase, the switching between servers is also expected to increase, causing additional packet losses at the switch ports. With HSLA, the learning happens faster as the selection probabilities are adjusted in a shorter time since only two actions are available at each SLA. Another possible way to reduce these issues is through priority-based flow [30], which will be investigated in the future.

Conclusions
With the experimental results obtained with an emulated multi-site satellite network, it was shown that an SDN/VNF approach could dynamically switch transmissions among the feeder links of an HTS system to alleviate the performance degradation brought by the adverse atmospheric conditions, which can temporarily affect one or more of the gateway sites and cause network congestion. The use of reinforcement learning-based algorithms was explored for this task to reduce both packet losses and the average latency. Since reinforcement learning does not need prior training, the proposed mechanisms help to improve the autonomy of HTS systems.
Funding: This research was supported by the National Aeronautics and Space Administration (NASA) grant #80NSSC17K0525.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: