A Two-Hops State-Aware Routing Strategy Based on Deep Reinforcement Learning for LEO Satellite Networks

: Low Earth Orbit (LEO) satellite networks can provide complete connectivity and worldwide data transmission capability for the internet of things. However, arbitrary ﬂow arrival and uneven tra ﬃ c load among areas bring about unbalanced tra ﬃ c distribution over the LEO constellation. Therefore, the routing strategy in LEO networks should have the ability to adjust routing paths based on changes in network status adaptively. In this paper, we propose a Two-Hops State-Aware Routing Strategy Based on Deep Reinforcement Learning (DRL-THSA) for LEO satellite networks. In this strategy, each node only needs to obtain the link state within the range of two-hop neighbors, and the optimal next-hop node can be output. The link state is divided into three levels, and the tra ﬃ c forwarding strategy for each level is proposed, which allows DRL-THSA to cope with link outage or congestion. The Double-Deep Q Network (DDQN) is proposed in DRL-THSA to ﬁgure out the optional next hop by inputting the two-hops link states. The DDQN is analyzed from three aspects: model setting, training process and running process. The e ﬀ ectiveness of DRL-THSA, in terms of end-to-end delay, throughput, and packet drop rate, is veriﬁed via a set of simulations using the Network Simulator 3 (NS3).


Introduction
As the powerful supplement of terrestrial networks, the satellite networks are playing an increasingly significant role in the next generation global communication system [1]. Satellite networks inherently offer many advantages, such as worldwide coverage and better multicast ability. Compared with geostationary earth orbit (GEO) and medium earth orbit (MEO), low earth orbit (LEO) satellites system has a shorter delay, low propagation, and globally seamless coverage [2,3]. However, since LEO satellite networks are usually composed of tens or hundreds of satellites, its routing problem is more complicated than terrestrial network mainly due to its features, such as dynamic link states and unbalanced traffic load caused by arbitrary flow arrival and communication hot spots [4]. Therefore, merely applying terrestrial routing algorithm on satellite networks is impracticable.
Applied to transmit service data in the LEO satellite networks efficiently, an effective routing strategy is essential. Previously, many proposed routing strategies for LEO satellite networks focused on how to minimize end-to-end propagation delay. With the explosive growth of satellite applications, however, there are two defects shown in traditional satellite routing strategy: the packet drop rate at network layer becomes abnormally high, and the cumulative queuing delay during transmission gets non-negligibly large. An ant colony optimization-based routing strategy is proposed in [5] for LEO networks, it can adjust the routing path when the network topology changes, but requires a long

•
A two-hops state-aware routing strategy based on deep reinforcement learning (DRL-THSA) is proposed for LEO satellite networks. In DRL-THSA, each node collects link state information within two-hop neighbors and makes routing decisions based on the information. The link state information is interacted between nodes through Hello packets; therefore, the DRL-THSA discover the node failure event in time and change the next hop node. • A setup and update method of link state is proposed. The link state is divided into three levels, and the traffic forwarding strategy for each level is presented, which allows DRL-THSA to cope with link congestion.

•
The routing decision method based on the DDQN network is proposed, and the DDQN is analyzed from three aspects: model setting, training process and running process.
The remainder of this paper has the following structure: In Section 2, the LEO satellite networks model, which includes the satellite networks topology, setup and update of link state, and two-hops state aware updating, are described. In Section 3, the deep reinforcement learning model setting and routing algorithm are presented in detail. The experimental results are discussed in Section 4, and Section 5 draws the conclusions.

LEO Satellite Networks Model
At the beginning of this section, we first explain the definition of the symbols used later in Table 1. Table 1. Definition of the symbols.

G
The direct graph of the LEO system V The set of satellites E The set of inter-satellite links t c The queue check interval I avg (t) The average input packet rate O avg (t) The average output packet rate I avg The prediction average input packet rate O avg The prediction average output packet rate λ I The weight of the past average input rate λ O The weight of the past average output rate α 0 , α 1 , α 2 Parameters to calculate λ I and λ O L max The max length of the buffer queue L(t) The current length of the buffer queue q The queue occupancy rate p The predicted queue occupancy rate T 1 The threshold between free state and busy state T 2 The threshold between busy state and congested state X The traffic reduction ratio t s The desired time for a satellite to reside in free state N The index of satellite node n The direction of inter-satellite link, n ∈ [1,2,3,4]  The source node N d The destination node m The number of neighbors for current satellite node s i The number of link states of neighbor satellite i RAAN The right ascension of ascending node ω The mean anomaly α The weight of inter-plane ISL β The weight of intra-plane ISL r d The reward for success r s The punishment for mistake θ online The weight of online DNN θ target The weight of target DNN N target The number of iterations to reset θ target ε Greedy value for DDQN

Satellite Networks Topology
The LEO system is modeled as a direct graph G = (V, E), where V represents the set of satellites and E represents the set of inter-satellite links (ISLs). Each satellite has four ISLs, including two intra-plane ISLs and two inter-plane ISLs [19]. Due to the extreme variation of the angular velocity of inter-plane ISLs-the ISLs within cross-seam-the north and south pole area cannot be built. As to simplify the change of network topology, we adopt the Virtual Node (VN) strategy to set up a satellite networks topology. In VN-based topology, a virtual node is supposed to be the current physical satellite, which is above the specific surface of the earth [20]. A virtual node and a physical satellite correspond one to one at any time, and the correspondence will change if a physical satellite moves out of the coverage of current VN or into the coverage of another VN [21]. The process of correspondence changing is called handoff. When a handoff happens, the state information will be transferred from the former physical satellite to the latter. In this way, rotating physical satellites can be converted into fixed virtual nodes, and the dynamic topology is also transformed into an accordingly static one. As it is shown in Figure 1, we construct the LEO satellite networks based on STK. The LEO system is modeled as a direct graph ( , ) G V E  , where V represents the set of satellites and E represents the set of inter-satellite links (ISLs). Each satellite has four ISLs, including two intra-plane ISLs and two inter-plane ISLs [19]. Due to the extreme variation of the angular velocity of inter-plane ISLs-the ISLs within cross-seam-the north and south pole area cannot be built. As to simplify the change of network topology, we adopt the Virtual Node (VN) strategy to set up a satellite networks topology. In VN-based topology, a virtual node is supposed to be the current physical satellite, which is above the specific surface of the earth [20]. A virtual node and a physical satellite correspond one to one at any time, and the correspondence will change if a physical satellite moves out of the coverage of current VN or into the coverage of another VN [21]. The process of correspondence changing is called handoff. When a handoff happens, the state information will be transferred from the former physical satellite to the latter. In this way, rotating physical satellites can be converted into fixed virtual nodes, and the dynamic topology is also transformed into an accordingly static one. As it is shown in Figure 1, we construct the LEO satellite networks based on STK.

Setup and Update of Link State
In satellite networks, when satellite receives a packet, the packet is inserted into the buffer queue at one direction by sequence, waiting to be sent out. However, the buffer space is limited, and accumulated packets will fill up the whole queue if the traffic is too heavy. Then the packets which still get into the queue will be dropped. Therefore, it is essential to monitor the queue to reduce unnecessary packets loss.

Setup and Update of Link State
In satellite networks, when satellite receives a packet, the packet is inserted into the buffer queue at one direction by sequence, waiting to be sent out. However, the buffer space is limited, and accumulated packets will fill up the whole queue if the traffic is too heavy. Then the packets which still get into the queue will be dropped. Therefore, it is essential to monitor the queue to reduce unnecessary packets loss.
Let t c denote the queue check interval, according to the average input packet rate I avg (t − t c ) and the average output packet rate O avg (t − t c ) in the past, and we predict the average input packet rate I avg and the average output packet rate O avg in the next t c seconds with Equations (1) and (2): where λ I and λ O represent the weight of the past average input rate and the past average output rate, respectively, 0 < λ I < 1, 0 < λ O < 1. These weights act as filters. The average input packet rate and the average output packet rate are desired to represent the long average packet rate, which should be counted over a long period. The short-term light traffic load needs to be filtered. Therefore, the selection of λ I and λ O are essential. If these weights are too large, the average packet rate will nearly equal the instantaneous traffic load. Otherwise, if these weights are too small, it is hard for the average packet rate to represent the long-range traffic load, which results in ineffective estimation and routing  (3) and (4).
, α 2 else Thus, the estimated average rate does not change much when the instantaneous traffic load is closed to the estimated traffic load in the last interval. The short-term light traffic load is filtered, and the average packet rate can be estimated effectively.
In DRL-THSA, for a given direction, L max denotes the max length of the buffer queue, and L(t) represents the current length of the buffer queue. Therefore, the queue occupancy rate is calculated by Equation (5).
To avoid dropping a packet, it is crucial to make sure that the queue is not full before the next check. The predicted queue occupancy rate is calculated by Equation (6).
In this way, two cases can be envisioned: • p ≥ 1: It means packet drop may happen in the next t c seconds. Therefore, the link state is set to be congested whatever the current queue occupancy rate is. The threshold T 1 and T 2 should meet The thresholds should be adjusted to fit the average input packet rate I avg and the average output packet rate O avg . The threshold T 1 and T 2 meet Considering the extreme situation, we get The link state is marked as Free State (FS) when q is below T 1 and is considered to be Busy State (BS) if q is between T 1 and T 2 . It is defined as Congested State (CS) when q exceeds T 2 .
To monitor and control the load effectively, if the link state is BS or CS, the satellite should send a notification including the traffic reduction ratio X to its neighbor and request it to decrease the input packet rate to I avg · X. When the satellite enters BS, assuming the desired time for a satellite to reside in FS is set to be t s , the traffic reduction ratio is calculated by Equations (11) and (12).
When the satellite enters CS, it should require its neighbor to stop transmitting the packet immediately. Therefore, X is set to be 0.

Two-Hops State Aware Updating
Taking a given satellite as the center, two-hops states consist of the link states of all the ISLs within two-hops. It is shown in Figure 2.
When the satellite enters CS, it should require its neighbor to stop transmitting the packet immediately. Therefore, X is set to be 0.

Two-Hops State Aware Updating
Taking a given satellite as the center, two-hops states consist of the link states of all the ISLs within two-hops. It is shown in Figure 2. In DRL-THSA, each satellite keeps both link state table (LST) and neighbors' link state tables (NLST). The link states are stored as the style in Table 2. t . If the current link state is different from the previous one, the current satellite will update its LST and send the link state change messages to its neighbor satellites. When satellite receives the message from neighbors, it updates NLST according to the information contained in the message. The process of dynamic two-hops state aware updating is shown in Algorithm 1. In DRL-THSA, each satellite keeps both link state table (LST) and neighbors' link state tables (NLST). The link states are stored as the style in Table 2. To monitor the link connectedness, we adopt the HELLO packet strategy proposed in the open shortest path first routing scheme [22]. The satellite sends HELLO packets to its neighbors with the period t h The connectedness is defined to be off if the current satellite does not receive the acknowledgment (ACK) message from the direction n within t d . It may not change states until it receives the HELLO packets periodically. When the change happens, the current satellite updates its LST and broadcasts the connectedness change messages to all the other neighbors.
To monitor the link state, satellites check the buffer queue of all the directions with the period t c . If the current link state is different from the previous one, the current satellite will update its LST and send the link state change messages to its neighbor satellites.
When satellite receives the message from neighbors, it updates NLST according to the information contained in the message. The process of dynamic two-hops state aware updating is shown in Algorithm 1.

Deep Reinforcement Learning Model Setting
The satellite networks topology was converted into a 2D plan, shown in Figure 3. Reinforcement learning observes the information obtained in the satellite networks topology, which functions as the environment.
In deep reinforcement learning, an agent is modeled as a four-tuple consisting of {S, A, P, R}, where S is a set of states, A is a set of actions. P is the state transition probability that represents the probability of a switch from one state to another state, and R is a reward function that represents a reward r received from the operating environment. Combining the deep reinforcement learning with satellite routing strategy, we construct the model {S, A, P, R} as Table 3.

Deep Reinforcement Learning Model Setting
The satellite networks topology was converted into a 2D plan, shown in Figure 3. Reinforcement learning observes the information obtained in the satellite networks topology, which functions as the environment.
represents routing source node, destination node, and current satellite's LST. N next represents the decision of next-hop satellite node. P is set to be P next calculated by Equation (13): where m represents the number of neighbors for current satellite node, and s i represents the number of link states of neighbor satellite i. When the packet is routed from N s to N d , the reward r is calculated by Equations (14) and (15): where RAAN represents right ascension of ascending node. ω represents the mean anomaly. α and β are the weights of inter-plane ISLs and intra-plane ISLs. We define r d as a high reward for success and −r c as a punishment for mistake.

Routing Algorithm
The DDQN model observes the information of the satellite networks topology through the training process. However, the training process of DDQN requires a large amount of overhead, and it takes a long time. Due to limited resources and processing capacity on the satellite, we simulate the flows of the satellite networks and complete the DDQN training process on the ground. The off-line training process enables the DDQN model to cope with all the link states that may be encountered. Then the trained DDQN models are stored on the satellite and no longer updated during the satellite routing process.

Double-DQN Offline Training Process
Double-DQN uses Deep Neural Networks (DNN) instead of the look-up table to represent all the states and actions. The inputs of the DNN are the current states, and the outputs are the Q-values of all the possible actions. We propose to use the DDQN which is composed of an online DNN with weight θ online and a target DNN with weights θ target . DNN needs to be trained to achieve convergence state. The online DNN updates its weights θ online at each iteration. The target DNN resets its weights θ target to θ online in every N target iterations and keeps weights θ target fixed at other iterations. The loss function at the current iteration is shown in Equation (16): where the target value Y DDQN is defined as To minimize the loss function, it needs to update the weights θ by using the experience < s, a, r, s > to train the DNN. The DDQN can execute action a by the ε-greedy policy to balance its exploration and exploitation. Algorithm 2 shows the DDQN train algorithm which uses the DDQN to find the optional routing policy. Accordingly, based on the experience e, the online DNN computes the optional value Q(s , a; θ target ). Then, the loss function L DDQN and the target value Y DDQN are calculated according to Equations (14) and (15) respectively. The value of L DDQN is used to update weights θ online . To make sure the stability of the learning, DDQN uses the experience reply memory M to store experience e, and a mini-batch of N b experiences are taken at each iteration to train the DNNs. Because the network topology environment changes with the destination node. For the whole LEO satellite networks, the number of DDQN is equal to the number of satellites.

Double-DQN On-Board Running Process
The trained DDQN models are attached to the satellite. During the DDQN on-board running process, each satellite calculates the optimal next hop by inputting the two-hops link states to the corresponding DDQN model, and then the next hop node repeats the process until the packets arrive at the destination. In addition, DRL-THSA considers a total of four cases in LEO satellite networks including link failure, link recovery, link state change, and endless-loop route. The routing strategy can competently cope with the dynamic changes of the satellite networks by effectively handling these cases, observing the two-hops state information and training the DDQN. Algorithm 3 shows the workflow of DRL-THSA routing strategy. Repeat the routing strategy until the N two is not equal to N c . Then the N next is chosen to transmit the packet.
Traditional calculation strategies such as Dijkstra require too much computation resources because of global routing table updating. In fact, there may be only one link state change. DRL-THSA makes full use of the two-hops link state information which is partially updated. It significantly reduces the updating overhead when only a few link states change. However, it is not applicable to the networks where the number of disconnected links is destructive. In addition, there is no need for DRL-THSA to recalculate when link states change owing to the fact that DDQN has known how to route them in the whole cases.

Parameters Setup
To evaluate the proposed DRL-THSA, we use NS-3.29 (Network Simulator 3, Version 3.29) as the simulation tool to construct the simulations in an Iridium-like satellite network with 66 satellites distributed over six planes. Except for the satellites along the seam where cross-seam ISLs cannot be built, each satellite maintains two inter-plane ISLs and two intra-plane ISLs. Intra-plane ISLs keep connected all the time while inter-plane ISLs only work outside the polar area [23]. The capacity of ISLs is set to 25Mbps. The average packet size is set to 1 KB and the queue length is set to 100 packets. We utilize 200 On-Off flows and the On-Off period of each flow follows a Pareto distribution with the shape of 1.5. The average burst and idle time are both set to 500ms. Traffic load can be controlled by adjusting the data transmission rate of sources or the number of flows. The main system parameters are shown in Table 4. According to the experience, the greedy value of ε is set as 0.9. All the simulations are run for 60s equivalent to that in [9]. All scenarios are run 100 times, and the average values are considered as the final results. We evaluate our routing model, the ELB [9], the TLR [10] and the Extreme Learning Machine-based distributed routing (ELMDR) [12] under the same scenario for the comparison from three aspects: average end-to-end delay, packet drop rate, and system throughput.

End-to-End Delay
The total delay of packets that arrived at their destination is recorded. To measure the performance of DRL-THSA under different traffic conditions, different individual data transmission rate and number of flows are used in the simulation. When the individual data transmission rate varies from 2.5 Mbps to 3.5 Mbps, the number of flows is fixed to 200. The purpose is to evaluate the system performance when the number of flows is unchanged and the transmission rate is changed. When the number of flows increases from 200 to 300, the individual data transmission rate is fixed to 3.5 Mbps. The purpose is to evaluate the system performance when the transmission rate is unchanged and the number of flows changes. The average end-to-end delay for different transmission rates is shown in Figure 4. The average end-to-end delay for different number of flows is shown in Figure 5. varies from 2.5 Mbps to 3.5 Mbps, the number of flows is fixed to 200. The purpose is to evaluate the system performance when the number of flows is unchanged and the transmission rate is changed. When the number of flows increases from 200 to 300, the individual data transmission rate is fixed to 3.5 Mbps. The purpose is to evaluate the system performance when the transmission rate is unchanged and the number of flows changes. The average end-to-end delay for different transmission rates is shown in Figure 4. The average end-to-end delay for different number of flows is shown in Figure 5.
It can be seen that the average end-to-end delay of DRL-THSA is smaller than that of ELB, TLR and ELMDR. The reason is that the DRL-THSA filters out the impact of short-term light traffic load while counting the average input rate and the output rate. The ELMDR needs to discover the routing path and pass back to the source node through the mobile agent. Therefore, when the transmission rate increases, the link state during the routing path may change from the idle state to the congestion state, so that the end-to-end delay of the data packet increases. Since the TLR and ELB do not consider the impact of short-term light traffic fluctuations on routing computation, the short-term light traffic load will lead the TLR and the ELB to choose this link. Therefore, the packets might be sent to the short-term light traffic load nodes and increase the congestion level, which brings a longer average end-to-end delay. Another reason is that the DRL-THSA alternates path according to route state within two-hops, which avoids more queuing delay. The routing strategy is determined by the DDQN model, which is per-training. The more accurate estimation of the average traffic rate, the more accurate the state entered into the DDQN, which provides a more optimal route than TLR and ELB. In Figure 5, the average end-to-end delay of DRL-THSA increases with the increasing of flows. Since the congestion of node is tried to be avoided, the packets will be transmitted on another routing path, which increases the average end-to-end delay.

Packet Drop Rate
The performance of DRL-THSA is evaluated by the total packets drop rate. The total packets are recorded to obtain the drop rate. Figures 6 and 7 show the total packet drop rate with different transmission rates and different number of flows. It should be noted that the links between satellites are assumed as error-free. Thus, the packets are dropped when the queue buffer of the satellite is not enough. In other words, the satellite has been congested. Furthermore, when the time to live (TTL) of a packet decreases to zero, this packet is dropped. It can be seen that the packet drop rate of DRL-THSA is lower than that of the ELB, TLR and ELMDR. According to the routing mechanism, ELMDR has the highest packets drop rate. The reason is that the pass back routing information of ELMDR under high traffic load may be outdated. At the same time, the packets drop rate of ELB is higher End-to-end Delay (ms) It can be seen that the average end-to-end delay of DRL-THSA is smaller than that of ELB, TLR and ELMDR. The reason is that the DRL-THSA filters out the impact of short-term light traffic load while counting the average input rate and the output rate. The ELMDR needs to discover the routing path and pass back to the source node through the mobile agent. Therefore, when the transmission rate increases, the link state during the routing path may change from the idle state to the congestion state, so that the end-to-end delay of the data packet increases. Since the TLR and ELB do not consider the impact of short-term light traffic fluctuations on routing computation, the short-term light traffic load will lead the TLR and the ELB to choose this link. Therefore, the packets might be sent to the short-term light traffic load nodes and increase the congestion level, which brings a longer average end-to-end delay. Another reason is that the DRL-THSA alternates path according to route state within two-hops, which avoids more queuing delay. The routing strategy is determined by the DDQN model, which is per-training. The more accurate estimation of the average traffic rate, the more accurate the state entered into the DDQN, which provides a more optimal route than TLR and ELB. In Figure 5, the average end-to-end delay of DRL-THSA increases with the increasing of flows. Since the congestion of node is tried to be avoided, the packets will be transmitted on another routing path, which increases the average end-to-end delay.

Packet Drop Rate
The performance of DRL-THSA is evaluated by the total packets drop rate. The total packets are recorded to obtain the drop rate. Figures 6 and 7 show the total packet drop rate with different transmission rates and different number of flows. It should be noted that the links between satellites are assumed as error-free. Thus, the packets are dropped when the queue buffer of the satellite is not enough. In other words, the satellite has been congested. Furthermore, when the time to live (TTL) of a packet decreases to zero, this packet is dropped. It can be seen that the packet drop rate of DRL-THSA is lower than that of the ELB, TLR and ELMDR. According to the routing mechanism, ELMDR has the highest packets drop rate. The reason is that the pass back routing information of ELMDR under high traffic load may be outdated. At the same time, the packets drop rate of ELB is higher than that of the TLR. The reason is that the congestion at the current hop is not considered in the ELB. Therefore, packets might be dropped before sending. In DRL-THSA, the DDQN considers the node state within two-hops and routes the packets through the optimal path. Thus, the DRL-THSA can be seen as a dynamic optimal routing strategy which can avoid the congestion before it occurs. The TLR only considers the node state within one hop, and the congestion might be occurring at the next hop when the next hop cannot find a suitable node to route the packets. In Figure 7, the packets drop rate increases with the increasing of the number of flows. The reason is that since more packets are routed to the sub-optimal next hop, the TTL of packets is more likely to decrease to zero.

Packet Drop Rate
The performance of DRL-THSA is evaluated by the total packets drop rate. The total packets are recorded to obtain the drop rate. Figures 6 and 7 show the total packet drop rate with different transmission rates and different number of flows. It should be noted that the links between satellites are assumed as error-free. Thus, the packets are dropped when the queue buffer of the satellite is not enough. In other words, the satellite has been congested. Furthermore, when the time to live (TTL) of a packet decreases to zero, this packet is dropped. It can be seen that the packet drop rate of DRL-THSA is lower than that of the ELB, TLR and ELMDR. According to the routing mechanism, ELMDR has the highest packets drop rate. The reason is that the pass back routing information of ELMDR under high traffic load may be outdated. At the same time, the packets drop rate of ELB is higher than that of the TLR. The reason is that the congestion at the current hop is not considered in the ELB. Therefore, packets might be dropped before sending. In DRL-THSA, the DDQN considers the node state within two-hops and routes the packets through the optimal path. Thus, the DRL-THSA can be seen as a dynamic optimal routing strategy which can avoid the congestion before it occurs. The TLR only considers the node state within one hop, and the congestion might be occurring at the next hop when the next hop cannot find a suitable node to route the packets. In Figure 7, the packets drop rate increases with the increasing of the number of flows. The reason is that since more packets are routed to the sub-optimal next hop, the TTL of packets is more likely to decrease to zero.  Figures 8 and 9 show that the DRL-THSA has the highest throughput among the three routing strategies. It should be noted that the traffic load has been balanced among all the satellites in DRL-THSA strategy, resulting in higher throughput than that of ELB, TLR and ELMDR.   9 show that the DRL-THSA has the highest throughput among the three routing strategies. It should be noted that the traffic load has been balanced among all the satellites in DRL-THSA strategy, resulting in higher throughput than that of ELB, TLR and ELMDR.  9 show that the DRL-THSA has the highest throughput among the three routing strategies. It should be noted that the traffic load has been balanced among all the satellites in DRL-THSA strategy, resulting in higher throughput than that of ELB, TLR and ELMDR.     Figures 8 and 9 show that the DRL-THSA has the highest throughput among the three routing strategies. It should be noted that the traffic load has been balanced among all the satellites in DRL-THSA strategy, resulting in higher throughput than that of ELB, TLR and ELMDR.

Traffic Distribution Index
The traffic distribution index in [10] is used to investigate how well the traffic is distributed over the entire constellation, which can be expressed as where n is the number of ISLs and x i represents the actual number of packets that traversed the i th ISL. The higher the value of the traffic distribution index is, the better the traffic is distributed over the entire constellation. Figures 10 and 11 show the traffic distribution index performance of these three routing strategies. It can be seen that the DRL-THSA outperforms the ELB, the TLR and the ELMDR. The traffic distribution index in [10] is used to investigate how well the traffic is distributed over the entire constellation, which can be expressed as where n is the number of ISLs and i x represents the actual number of packets that traversed the th i ISL. The higher the value of the traffic distribution index is, the better the traffic is distributed over the entire constellation. Figures 10 and 11 show the traffic distribution index performance of these three routing strategies. It can be seen that the DRL-THSA outperforms the ELB, the TLR and the ELMDR.  In order to verify that DRL-THSA can alleviate congestion and reduce queueing delays, the average queue occupancy of each satellite is shown in Figure 12. The simulations are performed for cases where the individual data transmission rate is fixed to 3.5Mbps and the number of flows is fixed to 300. It can be seen that the DRL-THSA obtains the lowest average queue occupancy, which means that congestion is alleviated throughout the network. There are two reasons for DRL-THSA traffic with a more uniform traffic distribution than ELB, TLR and ELMDR. The first reason is that the DRL-THSA filters the shore-term light traffic load. The second reason is that the  -greedy value of DDQN is set as 0.9. Therefore, DRL-THSA has a chance to explore the next-hop node autonomously, thus further diverting the traffic flow.  The traffic distribution index in [10] is used to investigate how well the traffic is distributed over the entire constellation, which can be expressed as  (18) where n is the number of ISLs and i x represents the actual number of packets that traversed the th i ISL. The higher the value of the traffic distribution index is, the better the traffic is distributed over the entire constellation. Figures 10 and 11 show the traffic distribution index performance of these three routing strategies. It can be seen that the DRL-THSA outperforms the ELB, the TLR and the ELMDR.  In order to verify that DRL-THSA can alleviate congestion and reduce queueing delays, the average queue occupancy of each satellite is shown in Figure 12. The simulations are performed for cases where the individual data transmission rate is fixed to 3.5Mbps and the number of flows is fixed to 300. It can be seen that the DRL-THSA obtains the lowest average queue occupancy, which means that congestion is alleviated throughout the network. There are two reasons for DRL-THSA traffic with a more uniform traffic distribution than ELB, TLR and ELMDR. The first reason is that the DRL-THSA filters the shore-term light traffic load. The second reason is that the  -greedy value of DDQN is set as 0.9. Therefore, DRL-THSA has a chance to explore the next-hop node autonomously, thus further diverting the traffic flow.  In order to verify that DRL-THSA can alleviate congestion and reduce queueing delays, the average queue occupancy of each satellite is shown in Figure 12. The simulations are performed for cases where the individual data transmission rate is fixed to 3.5Mbps and the number of flows is fixed to 300. It can be seen that the DRL-THSA obtains the lowest average queue occupancy, which means that congestion is alleviated throughout the network. There are two reasons for DRL-THSA traffic with a more uniform traffic distribution than ELB, TLR and ELMDR. The first reason is that the DRL-THSA filters the shore-term light traffic load. The second reason is that the ε-greedy value of DDQN is set as 0.9. Therefore, DRL-THSA has a chance to explore the next-hop node autonomously, thus further diverting the traffic flow.

Conclusions
In this paper, a Two-Hops State-Aware Routing Strategy Based on Deep Reinforcement Learning (DRL-THSA) for LEO satellite networks is presented. In DRL-THSA, we propose a mechanism to evaluate link states for adjusting the dynamic traffic in satellite networks and put forward a two-hops state-aware strategy to update the real-time link states. When the link states changes, it may broadcast the changes to its neighbors for updating the LST. To observe the information contained in LEO satellite networks topology, we train DDQN models through a simulated routing environment. According to the training efficiency and the convergence degree of the network, the  -greedy value of DDQN is chosen as 0.9. Combined with the two-hops state-aware strategy, our models can figure out the optimal next hop for routing packets, and they can also handle situations including link failure, link recovery, link state change, and endless-loop route. Simulation results demonstrate that DRL-THSA performs well in terms of end-to-end delay, throughput, and packet drop rate and traffic distribution index. In future research, we will study the impact of deep learning network structure and parameter settings on routing strategy performance.