Research on Resource Allocation Method of Space Information Networks Based on Deep Reinforcement Learning

The space information networks (SIN) have a series of characteristics, such as strong heterogeneity, multiple types of resources, and difficulty in management. Aiming at the problem of resource allocation in SIN, this paper firstly establishes a hierarchical and domain-controlled SIN architecture based on software-defined networking (SDN). On this basis, the transmission, caching, and computing resources of the whole network are managed uniformly. The Asynchronous Advantage Actor-Critic (A3C) algorithm in deep reinforcement learning is introduced to model the process of resource allocation. The simulation results show that the proposed scheme can effectively improve the expected benefits of unit resources and improve the resource utilization efficiency of the SIN.


Introduction
At present, with the gradual deepening of space science exploration and the continuous development of space information technology, the construction of space information systems presents a state of explosive development. However, the construction of all kinds of spatial information systems is still separate, forming a situation of repeated construction and "chimney-like development". Various navigation, communication, remote-sensing, and other satellites occupy a large amount of orbital resources. When a single satellite system completes a given task, it will have more idle states, resulting in a waste of space resources [1]. The proposal of the space information network (SIN) provides a solution to the above problems. The SIN became a research hotspot in the global field [2].
The SIN is a network system that acquires, transmits, and processes spatial information in real time on various space platforms (such as synchronous satellites or mid-orbit satellites, stratospheric balloons, and manned or unmanned aerial vehicles) [3]. Compared with the ground network, the SIN plays an irreplaceable role in earth observation, emergency communication, air transportation, space TT&C, and the expansion of national strategic interests [4]. Compared with the traditional satellite network, the SIN has a series of characteristics such as complex structure, dynamic topology change, large cross-domain spatial scale, and so on. Therefore, we need to build an efficient SIN architecture to realize the effective allocation and management of multi-dimensional resources in the SIN, which is of great significance for the construction of the SIN [5].
Software-defined networking (SDN) is a new network architecture with data forwarding which is control-separated and software-programmable. SDN adopts a centralized control surface and a distributed forwarding surface. The control plane uses the developed control and forwarding In conclusion, the space-net-Earth network architecture can make full use of the wide-area coverage ability and the abundant transmission and processing ability of the space-based network, and reduce the complexity and cost of the system technology; it is a more appropriate reference for the construction of the SIN.
Aiming at the resource scheduling problem of the SIN, the current research mainly focuses on the resource scheduling of the GEO data relay satellite. Adinolfi used a backtracking heuristic algorithm to solve the resource scheduling problem of the GEO data relay satellite for the European Space Station [16]. Rojanasoonthon studied the tracking and data relay satellite system (TDRSS) of the United States, and studied the scheduling problem with two visual time windows [17]. Gu analyzed the resource and task constraints in the scheduling process of the GEO data relay satellite, and established the scheduling model of the GEO data relay satellite [18]. The current research lacks research on the overall resource allocation method for different types of nodes in the SIN.

SDN-Based Space Information Networks
Based on the advantages of SDN technology, some scholars and research institutes proposed its application in the SIN. The related research is still in its infancy, mainly focusing on the research of architecture and routing algorithms. Researchers at the Centre National de la Recherche Scientifique (CNRS) and Université de Toulouse studied handover decision-making algorithms in satellite networks through SDN's programmability [19]. Joint researchers from the Polytechnic University of Catalonia and the Greek National Research Center are exploring the introduction of SDN technology into satellite networks. SDN technology is used to improve the satellite network infrastructure, so as to improve the joint service capability of ground and satellite networks and hybrid access service capability [20]. Researchers from Hughes Network Systems Inc. of the United States directly proposed a software-defined satellite network (SDSN) architecture, and applied it to their SPACEWAY system (a new generation of broadband satellite communication system). By establishing modules and performance objects, the extended allocation of inter-satellite packet routing addresses and the resource management control in the controller were realized [21]. Reference [22] proposes a networking architecture for on-board switching systems based on SDN. SDN on-board switching system can effectively reduce the load of traditional on-board switching systems, optimize the utilization of satellite channel resources, and improve the quality of service support capability of satellite communication networks. References [23,24] analyzed the routing algorithm of SDN-based SIN.
SIN includes a large number of heterogeneous nodes such as satellites, caches, mobile edge computing (MEC) servers, and so on. The allocation of the SIN resources involves the allocation of multi-dimensional resources such as transmission, caching, and computing. It is necessary to make overall considerations to achieve the maximum effective use of the SIN resources.

System Model
In this section, we firstly establish an SDN-based SIN architecture. On this basis, this article takes the use of LEO communication satellites and GEO data relay satellites for information transmission as an example; the network model, satellite coverage and transmission model, communication link model, caching model, and computing model of the SIN are analyzed.

Overall Networking Architecture
Based on the core idea of SDN, this paper establishes a hierarchical and domain-controlled SIN architecture, whose overall network architecture is shown in Figure 1. From a hierarchical point of view, the SIN architecture is divided into three parts: space-based, air-based, and ground-based. The space-based network is mainly composed of satellites with different orbits, which are geostationary orbit satellites (GEO), medium orbit satellites (MEO), and low orbit satellites (LEO) from far to near. Air-based networks include stratospheric airships, balloons, manned or unmanned aerial vehicles, etc. Ground-based networks are mainly composed of gateway base stations, caches, and MEC servers, as well as space-based network controllers, From a hierarchical point of view, the SIN architecture is divided into three parts: space-based, air-based, and ground-based. The space-based network is mainly composed of satellites with different orbits, which are geostationary orbit satellites (GEO), medium orbit satellites (MEO), and low orbit satellites (LEO) from far to near. Air-based networks include stratospheric airships, balloons, manned or unmanned aerial vehicles, etc. Ground-based networks are mainly composed of gateway base stations, caches, and MEC servers, as well as space-based network controllers, space-based network controllers, ground-based network controllers, and an SIN resource management and scheduling center. Because the resource management and scheduling center of the SIN plays an important role, a backup center should be set up.
From the point of view of sub-domain, the ground-based network, space-based network, and air-based network are divided into several domains according to the region; each domain is controlled by a network controller. Among them, there are three kinds of network controllers, space-based network controllers, air-based network controllers, and ground-based network controllers, which control space-based networks, air-based networks, and ground-based networks, respectively. In order to make full use of the global coverage capability of GEO satellites and the high-speed computing capability of ground controllers, space-based network controllers are divided into space-based network controllers on the ground and space-based network controllers on the GEO satellite. Space-based network controllers on the ground are responsible for computing and storing large amounts of data and other complex functions. Space-based network controllers on the GEO satellite are responsible for collecting global views, completing simple routing storage, distributing flow tables, and other functions. Each network controller constitutes a single-domain controller, and multiple single-domain controllers are uniformly controlled by the SIN resource management and scheduling center [25].

Network Control Architecture
Based on the structure of SDN, the control architecture of the SIN is divided into three layers: application layer, control layer, and infrastructure layer [26]. The top layer is the application layer, which refers to a series of space tasks such as emergency communication and deep space exploration completed by the SIN. At the bottom is the infrastructure layer, which refers to satellites in different orbits, stratospheric vehicles, gateway base stations, and so on. In the middle is the control layer, which is composed of network controllers and the SIN resource management and scheduling center. The hierarchical and domain-based control structure of the SIN is shown in Figure 2.
In the control layer, the single-domain controller collects the topological information of each node in the domain. When the intra-domain traffic arrives, the single-domain controller calculates the intra-domain links, and controls the nodes by downloading the flow table, so as to realize path building and service processing. The SIN resource management and scheduling center is responsible for the control and allocation of the whole-network resources. It obtains the domain topology resources from the single-domain controller and establishes the whole-network topology. When cross-domain service arrives, it is responsible for cross-domain path calculation to realize cross-domain service transmission. In addition, due to the heterogeneity of different inter-domain networks, the SIN resource management and scheduling center is also responsible for the unification of heterogeneous device interfaces to achieve cross-domain interconnection of heterogeneous devices.
In the SIN management architecture based on SDN, north-direction agreement and south-direction agreement play an important role. North-direction agreement is a series of interfaces between application layer and control layer. There is no unified standard for its interface protocol. Therefore, the control layer provides many extensible application program interfaces (APIs) for different users in the application layer, and each API interface corresponds to a corresponding application; thus, the control architecture can implement a variety of application services. A typical south-direction agreement is OpenFlow [27], which is responsible for the interaction between the control layer and the underlying implementation switches to complete the forwarding of infrastructure Remote Sens. 2019, 11, 448 6 of 21 layer data. In the OpenFlow protocol, an OpenFlow switch can connect multiple network controllers; however, at the same time, only one controller has control over it, and other controllers have read-only function. In the SDN-based SIN management architecture, all switches in each single domain can only be managed by its single-domain controller. management and scheduling center [25].

Network Control Architecture
Based on the structure of SDN, the control architecture of the SIN is divided into three layers: application layer, control layer, and infrastructure layer [26]. The top layer is the application layer, which refers to a series of space tasks such as emergency communication and deep space exploration completed by the SIN. At the bottom is the infrastructure layer, which refers to satellites in different orbits, stratospheric vehicles, gateway base stations, and so on. In the middle is the control layer, which is composed of network controllers and the SIN resource management and scheduling center. The hierarchical and domain-based control structure of the SIN is shown in Figure 2. In the control layer, the single-domain controller collects the topological information of each node in the domain. When the intra-domain traffic arrives, the single-domain controller calculates the intra-domain links, and controls the nodes by downloading the flow table, so as to realize path building and service processing. The SIN resource management and scheduling center is responsible for the control and allocation of the whole-network resources. It obtains the domain topology resources from the single-domain controller and establishes the whole-network topology. When cross-domain service arrives, it is responsible for cross-domain path calculation to realize cross-domain service transmission. In addition, due to the heterogeneity of different inter-domain networks, the SIN resource management and scheduling center is also responsible for the unification of heterogeneous device interfaces to achieve cross-domain interconnection of heterogeneous devices.
In the SIN management architecture based on SDN, north-direction agreement and south-direction agreement play an important role. North-direction agreement is a series of

Network Model
The SIN resource management and scheduling center and the single-domain controllers realize the dispatching of various resources. This paper takes an LEO communication satellite and GEO data relay satellite as examples to analyze. Let la, lga, ca, ma, and ua represent the LEO communication satellite, GEO data relay satellite, cache device, MEC server, and user in the underlying physical resources, respectively. Let la = {1, . . . , L}, lga = {1, . . . , Lg}, ca = {1, . . . , C}, ma = {1, . . . , M} and ua = {1, . . . , U}, where L, Lg, C, M, and U represent the number of LEO satellites, GEO data relay satellites, caches, MEC servers, and users, respectively [28].

LEO Satellite Coverage Model
LEO satellite can only cover users in a certain time and space range to complete the transmission of information. The geometric relationship between LEO satellite and user is shown in Figure 3.
In Figure 3, O is the geocentric, R e is the earth radius, h is the LEO satellite orbit altitude, and P represents the ground user; at t 0 time, the maximum elevation of the ground user is θ max , and the LEO satellite position and the sub-satellite points are S and M. At t time, the LEO satellite position and satellite sub-satellite points are S and N. Furthermore, γ(t 0 ), γ(t) and ψ(t) represent the corresponding geocentric angles between P and M, P and N, and M and N, respectively. In addition, θ(t) represents the elevation of the ground user at time t; θ(t) is the minimum elevation of the ground user, and the corresponding maximum geocentric angle at time t is γ max . GEO data relay satellites, caches, MEC servers, and users, respectively [28].

LEO Satellite Coverage Model
LEO satellite can only cover users in a certain time and space range to complete the transmission of information. The geometric relationship between LEO satellite and user is shown in Figure 3.   According to the spherical triangle PMN and the triangle OPS shown in Figure 3, we can obtain the following: cos The effective coverage time t c of LEO satellite to ground users is where ω = ω s − ω e i 0 is the angular velocity of a satellite in the Earth-centered, Earth-fixed, (ECEF) coordinate system, ω s is the angular velocity of a satellite in the Earth-centered inertial (ECI) coordinate system, ω e is the angular velocity of the earth's rotation under ECI, and i 0 is the orbital inclination angle. Ground users are randomly distributed. We assume that the distance from the ground user to the sub-satellite point obeys a uniform distribution. Therefore, when the LEO satellite covers ground users, According to Equations (3) and (4), the cumulative distribution function of coverage time t c is where T m represents the maximum effective coverage time of satellite to ground users. When γ(t 0 ) = 0, according to Equation (3), we can get According to Equation (5), the probability density function of coverage time According to Equation (7), the average coverage time E(t c ) of LEO satellite to ground users is as follows [29]: The elevation θ l u used between u and LEO satellite l is where cosΘ = cos(u l o − l lo ) cos u la cos l la + sin u la sin l la .
In Equation (10), u lo and u la represent the longitude and latitude of the user, respectively, while l lo and l la represent the longitude and latitude of LEO satellite, respectively.
When the LEO satellite is flying around the equator, l la = 0, the longitude of the user and the satellite is the same, u lo = l lo , the elevation is the maximum, and Equation (10) can be simplified into Therefore, within the average coverage time E(t c ), the maximum θ l umax of θ l u is To ensure that elevation increases monotonously, we set Ω as the elevation of LEO satellite from the horizon to the user. The relationship between Ω and θ l u is as follows: The maximum value of Ω is Ω max = 2 * θ l u,max . In this model, the smaller Ω is, the longer the LEO satellite coverage time will be. LEO satellites have more time to transmit, cache, and compute information with users. The larger Ω is, the shorter the coverage time of LEO satellite to users will be, and the less time it will take for the LEO satellite to transmit, cache, and compute information with users.
Because there are many LEO satellites in the SIN, we cannot determine which LEO satellite is connected to the user, nor can we determine the elevation of user u and satellite l at the next moment. Therefore, we set the elevation angle of user u and satellite l to the random variable Ω l u . The value range of Ω l u can be divided into Y segments: Each segment conforms to a Markov chain model and has Y segments, that is, y = {y 0 , y 1 , . . . , y Y −1 }. The elevation of user u and LEO satellite l at time t is expressed as We have a total of T time slots, representing the total time from the user's application to the user's receiving and processing information. Based on a certain transition probability, w l u (t) transfers from one state to another. The probability of transition from state S11 to state S12 is expressed as κ S11S12 (t). We can get a Y × Y dimensional elevation state transition probability matrix between user u and a LEO satellite l as follows: where κ S11S12 (t) = Pr w l u (t + 1) = S12 w l u (t) = S11 , S11, S12 ∈ y.

GEO Data Relay Satellite Transmission Model
Due to the limited transmission capacity of the LEO communication satellite, it cannot meet the user's all-weather real-time transmission requirements. Therefore, the relay transmission mode of the GEO data relay satellite and LEO satellite will become an important part of the SIN [30].
We assume that the LEO satellite contains I tasks. Each task is arranged in descending order of importance. Task i represents the important task of the i-th item. The request rate of task i at time t is The arrival process of task i obeys Poisson distribution with a parameter. The content of task request satisfies Zipf-like distribution. The probability We are not sure if task i requires the transmission of a GEO data relay satellite. Therefore, we assume that task i is transmitted by the GEO relay satellite as a random variable ℘ i . If task i does not require relay satellite transmission, then ℘ i = 0; otherwise, ℘ i = 1, constituting a Markov chain model ℘ i = {0, 1} with two states. The transmission state of time t can be expressed as ℘ i (t), t ∈ { 0, 1, 2, . . . , T − 1} . According to a certain transition probability, the transmission state ℘ i (t) is transferred from one state to another state. Let J S21S22 (t) denote the probability of transition from state S21 to state S22; then, the transition probability matrix i (t) is obtained as follows: where

Communication Link Model
According to Reference [33], the main models of satellite communication channel are the C. Loo model, Corazza model, and Lutz model. The C. Loo model is mainly suitable for rural environments. The received signals are mainly composed of direct shadowing signal components and multi-path signal components which are not shadowed. The Corazza model is applicable to all environments (roads, villages, cities, etc.). The signals received by users are affected by shadows. The Lutz model divides the channel environment between satellite and user into good and bad states. In the good state, there is no shadowing effect. In the bad state, there is no direct signal component. The above three models are represented as model X, model Y, and model Z, respectively. Three main propagation models of satellite communication links are shown in Figure 4. We assume that the LEO satellite contains I tasks. Each task is arranged in descending order of importance. Task i represents the important task of the i-th item. The request rate of task i at time t is The arrival process of task i obeys Poisson distribution with a ϖ parameter. The content of task request satisfies Zipf-like distribution. The probability of task i is 1 / i a r , where We are not sure if task i requires the transmission of a GEO data relay satellite. Therefore, we assume that task i is transmitted by the GEO relay satellite as a random variable i Ã .
is obtained as follows: where ( )

Communication Link Model
According to Reference [33], the main models of satellite communication channel are the C. Loo model, Corazza model, and Lutz model. The C. Loo model is mainly suitable for rural environments. The received signals are mainly composed of direct shadowing signal components and multi-path signal components which are not shadowed. The Corazza model is applicable to all environments (roads, villages, cities, etc.). The signals received by users are affected by shadows. The Lutz model divides the channel environment between satellite and user into good and bad states. In the good state, there is no shadowing effect. In the bad state, there is no direct signal component. The above three models are represented as model X, model Y, and model Z, respectively. Three main propagation models of satellite communication links are shown in Figure  4.  We assume that the probability of satellite link transmission models X, Y, and Z are p X , p Y and p Z , respectively. From this, we get a three-element model space S = {S X , S Y , S Z }. The state transition probability matrix Λ between the three models is Λ =    P XX P XY P XZ P YX P YY P YZ P ZX P ZY P ZZ where ∆t is the smallest unit of time for state transition between two transmission models, and < Γ X >, < Γ Y > and < Γ Z > represent the average time of model states X, Y and Z, respectively. We assume that the transmission link between satellite and user is time-varying and can be modeled as a finite-state Markov chain model. In this model, the quality of the channel is expressed as the signal-to-noise ratio (SNR) of the signal received by the user. We assume that the SNR of the signal received by user u from LEO satellite l is the random variable h l u . The value range of h l u can be divided into L segments: Each segment conforms to a Markov chain model and has L segments, that is, H = {H 0 , H 1 , . . . , H L −1 }. At time t, the SNR of the signal received by user u from LEO satellite l is h l u (t), where t ∈ { 0, 1, 2, . . . , T − 1} . According to a certain transition probability, the SNR h l u (t) is transferred from one state to another state. Let γ S31S32 (t) denote the probability of transition from state S31 to state S32. The state transition probability matrix of transmission channel between user u and LEO satellite l can be expressed as an where l S31S32 (t) = Pr h l u (t + 1) = S32 h l u (t) = S31 , S31, S32 ∈ H. We assume that the available spectrum bandwidth of the LEO satellite l is B l Hz, where B l u Hz is allocated to user u. The available return capacity of satellite l is Z l bps. User u's spectrum utilization at time t is v l u (t). Then, the communication rate between user u and LEO satellite l is and ∑ u∈ua ComR l u (t) ≤ Z l , ∀l ∈ la, where a l u (t) indicates whether user u is connected to LEO satellite l. a l u (t) = 1 indicates that user u is connected to LEO satellite l; otherwise, a l u (t) = 0.

Caching Model
Based on the analysis of Section 3.3.2, users in the SIN have I tasks. Each task is arranged in descending order of importance. Task i represents the important task of the i-th item. The request rate of task i at time t is shown in Equation 15. The arrival process of task i obeys Poisson distribution with a parameter. The content of the task request satisfies a Zipf-like distribution. The probability of task i is 1/ρi α , where ρ = ∑ I i=1 1/i α , α is a Zipf slope, and 0 < α ≤ 1 [34]. We cannot determine whether task i is cached first. Therefore, we assume that task i is cached as a random variable ς i . If task i is not cached, then ς i = 0; otherwise, ς i = 1, constituting a Markov chain model ς i = {0, 1} with two states. The cache state of time t can be expressed as ς i (t),t ∈ { 0, 1, 2, . . . , T − 1} . According to a certain transition probability, the cache state ς i (t) is transferred from one state to another state. Let J S41S42 (t) denote the probability of transition from state S41 to state S42; then, the transition probability matrix Φ i (t) is obtained as follows: where

Computing Model
Let user u have computing task T u = {o u , n u }, where o u represents the size of the task content, and n u represents the number of cycles that the central processing unit (CPU) needs to run to complete the task. Because there are multiple users and MEC servers, it is impossible to know how much computing power is allocated to user u. Therefore, a random variable Ξ m u is established to represent the computing power of assigning MEC server m to user u. Ξ m u is divided into M discrete intervals, Π = {Π 0 , Π 1 , . . . , Π M −1 }. The computing state of time t can be expressed as Ξ m u (t), t ∈ { 0, 1, 2, . . . , T − 1} . According to a certain transition probability, the computing state Ξ m u (t) is transferred from one state to another state. Let ε S51S52 (t) denote the probability of transition from state S51 to state S52. The state transition probability matrix E m where ε S51S52 (t) = P r Ξ m u (t + 1) = S52|Ξ m u (t + 1) = S51 , S51, S52 ∈ Π. The execution time of task T u on MEC server m is Thus, the computing rate is and ∑

Problem Equation
Based on the satellite coverage and transmission model, communication link model, caching model, and computing model established in Section 3, this section models the allocation of multi-dimensional resources in the SIN as a deep reinforcement learning process. Next, the state set, action set, reward function, and A3C algorithm flow in the process of deep reinforcement learning are analyzed.

State Set
The state set of the SIN includes the elevation state between user and satellite, transmission state of GEO data relay satellite, communication link state, caching state, and computing state. Therefore, the state set S (t) of time t can be expressed as where Γ c

Action Set
In the dynamic change of the SIN, we use a deep reinforcement learning algorithm to decide which LEO satellite is connected to user u, whether the tasks of user u need GEO data relay satellite for transmission, whether the tasks of user u are cached, and which MEC server is used to compute the tasks of user u. Therefore, the set of actions at time t is where the following apply: . When ComA l u (t) = 0, it means that user u is not connected to LEO satellite l at time t; otherwise, ComA l u (t) = 1. In this paper, at any time, it is assumed that only one LEO satellite is connected to the user u; thus, ∑ l∈la ComA l u (t) = 1, ∀u ∈ ua. (t) = 0, it means that the task is not transmitted by GEO data relay satellite lg; otherwise, ComA lg l (t) = 1. In this paper, at any time, it is assumed that only one GEO data relay satellite is connected to the LEO satellite; thus, ∑ lgIlga ComA lg l (t) = 1, ∀l ∈ la.
When CaA c u (t) = 0, it means that the task is not cached by cache c; otherwise, CaA c u (t) = 1. In this paper, at any time, suppose there is only one cache to cache a specified task; thus, ∑ c∈ca CaA c u (t) = 1, ∀u ∈ ua.
. When CompA m u (t) = 0, it means that the task was not handed over to MEC server m for computing; otherwise, CompA m u (t) = 1. In this paper, at any time, it is supposed that there is only one MEC server to compute a specified task; thus, ∑ m∈ma CompA m u (t) = 1, ∀u ∈ ua.

Reward Function
According to Reference [37], SDN managers of the SIN need to pay for LEO satellite l, GEO data relay satellite lg, cache c, and MEC server m. It is assumed to pay δ l to the LEO satellite every Hz, δ lg to the GEO data relay satellite per Hz, ς c to the cache per unit storage space, and η m to the MEC server per joule.
In addition, the SIN managers need to charge users for information transmission, caching, and computing. Suppose τ u is charged per bit of transmission information, κ u is charged per bit of cache information, and φ u is charged per bit of calculation information. The reward function is where e m represents the energy consumed by the CPU to rotate a circle. We define the reward function R u (t) as the expected benefit of the unit resource at time t, that is, the ratio of the fee charged to the user and the fee paid to obtain the resource. The higher the value of R u (t) is, the higher the utilization rate of resources will be.

A3C Algorithm
In this paper, we need to consider the coverage of the LEO satellite, transmission status of the GEO data relay satellite, communication link status, cache status, and computing power of the MEC server. Moreover, the SIN is a dynamic network system which is constantly changing. Therefore, this paper adopts the A3C algorithm in the deep reinforcement learning algorithm. The A3C algorithm is a deep reinforcement learning algorithm which combines a use value function and a strategy gradient. The actor part can dynamically change the strategy according to the learned value function. The critic part estimates the current state (action) value function and evaluates the actor's strategy [38]. The basic framework of the A3C algorithm based on the SIN is shown in Figure 5.
where m e represents the energy consumed by the CPU to rotate a circle. We define the reward function ( ) u R t as the expected benefit of the unit resource at time t, that is, the ratio of the fee charged to the user and the fee paid to obtain the resource. The higher the value of ( ) u R t is, the higher the utilization rate of resources will be.

A3C Algorithm
In this paper, we need to consider the coverage of the LEO satellite, transmission status of the GEO data relay satellite, communication link status, cache status, and computing power of the MEC server. Moreover, the SIN is a dynamic network system which is constantly changing. Therefore, this paper adopts the A3C algorithm in the deep reinforcement learning algorithm. The A3C algorithm is a deep reinforcement learning algorithm which combines a use value function and a strategy gradient. The actor part can dynamically change the strategy according to the learned value function. The critic part estimates the current state (action) value function and evaluates the actor's strategy [38]. The basic framework of the A3C algorithm based on the SIN is shown in Figure  5.  In the A3C algorithm, first of all, we define the learning strategy as ι. The value function V ι (s) and action value function Q ι (s, a) are used to judge the learning strategy. The value function V ι (s) of the current initial state s is defined as where E ι [ * ] represents mathematical expectations under certain state transition probabilities and learning strategies, R u (t) represents the reward function, and is the discount factor, ∈ [0, 1]. is used to measure the role of reward function in value function. The farther it is away from the current state, the smaller the value of will be. Each strategy represents a mapping from state to action space, i.e., a = ι(s). The action value function Q ι (s, a) is defined as Actor networks can be divided into three parts. Assuming that the network parameter of the Actor part is Θ, the following results are obtained: (1) Revenue function: (2) Derivation of strategy function: ∇ Θ ι Θ (s, a) = ι Θ (s, a)∇ Θ log ι Θ (s, a); (3) Renewal of income gradient through gradient: For the Critic part, set the network parameter as Θ c . When the Actor network and the Critic network are finally determined, V Θ c (s) ≈ V ι Θ (s). The optimal strategy obtained through Actor and Critic networks is the same. Therefore, the gradients of the two should be equal, i.e., ∇ Θ c V Θ c (s) = ∇ Θ log ι Θ (s, a).

After the above deduction, we define the loss function as
. When the loss function is minimized, its minimum value is obtained when the derivative is 0. It can be concluded that ∇ Θ c ε = 0. Further derivation shows that Therefore, the gradient of the income function J(Θ) is It is known that the network parameter of Actor part is Θ and that of the Critic part is Θ c . Since there are multiple threads in the A3C algorithm, we have two parameters in the thread: Θ and Θ c . Set the global counter T = 0; thus, each thread has its own counter t. The flow chart of the A3C algorithm is shown below.

Algorithm: Asynchronous Advantage Actor-Critic
Initialize thread step counter t ← 1 repeat Reset gradients: dΘ ← 0 and dΘ ← 0 dΘ c ← 0 Synchronize thread-specific parameters Θ = Θ and Θ c = Θ c t start = t Get state S t repeat Perform a u (t) according to policy ι(a u (t)|S t ; Θ ) Receive reward R u (t) and new state S t+1 t ← t + 1 Perform asynchronous update of Θ using dΘ and of Θ c using dΘ c Until T > T max (1) Thread counters are initialized to t = 1. The network parameters Θ and Θ c are used to initialize the parameters Θ and Θ c in the thread.
(2) Iterate sequentially until the maximum number of executions t max is reached, or other termination states are encountered. In successive iterations, the action a u (t) is obtained by using the strategy function ι(a u (t)|S t ; Θ ) . Execute this action to get the next state S(t + 1) and the corresponding reward value R u (t). The value function of each state is solved by the Critic network at this time.
In case of termination V(S t , Θ c ) General situation Update counters: t = t + 1, T = T + 1.
(3) In multiple sampling, it may be t max times, or it may end in advance. The Bellman equation is used to calculate the value function for each sampling result, and the network parameters of Actor and Critic are updated by gradient.
(4) After the number of iterations is reached, the parameters Θ and Θ c in each thread are used to update the network parameters Θ and Θ c of the whole Actor and Critic parts.

Simulation Parameter Setting
In the experiment, the hardware environment was an Intel Core i7-8750 CPU, with 8 GB of memory and 1 TB of hard disk space. The software environment was Python3.6.1 with Tensorflow1.4.0, MATLAB R2014a [39].
We assumed that there were three GEO data relay satellites, five LEO communication satellites, seven MEC servers, and seven caches. The altitudes of the five LEO satellites were 500 km, 780 km, 1000 km, 1200 km, and 1400 km. The elevation angle between user u and LEO satellite l conforms to Markov chain model. Assuming that the elevation angle is excellent, w l u = 10, better, w l u = 8, medium elevation, w l u = 6, lower elevation, w l u = 4, and extremely bad, w l u = 2. We assume that the elevation state transition probability matrix is Similarly, when the communication efficiency between user u and satellite l is very excellent, the spectrum utilization ratio is v l u (t) = 10, better, v l u (t) = 8, medium condition, v l u (t) = 5, lower condition, v l u (t) = 1, and extremely bad, v l u (t) = 0.2. Its state transition probability matrix is Assuming that there is a space task, whether it needs a GEO relay satellite transmission conforms to a Markov chain model, and its state transition probability matrix is The cache state of the space task conforms to the Markov chain model, and its state transition probability matrix is For MEC servers, when the computing state is excellent, the computing rate is Ξ m u (t) = 50, better, Ξ m u (t) = 30, medium condition, Ξ m u (t) = 10, lower condition, Ξ m u (t) = 3, and extremely bad, Ξ m u (t) = 0.5. Its state transition probability matrix is The remaining parameters in the simulation are shown in Table 2. Table 2. Simulation parameter setting. LEO-low Earth orbit; GEO-geostationary orbit; CPU-central processing unit. In this experiment, we simulated the expected benefits of unit resources in the following six situations as follows:

Parameters
(1) Unified consideration of LEO satellite elevation state, communication link state, GEO data relay satellite transmission state, caching state, and computing state, expressed as A3C-based all scheme.
(2) Unified consideration of GEO data relay satellite transmission state, caching state, and computing state, regardless of LEO satellite elevation state and communication link state, expressed as A3C-based without coverage communication scheme.
(3) Unified consideration of LEO satellite elevation state, communication link state, caching status, and computing state, regardless of GEO data relay satellite transmission state, expressed as A3C-based without GEO communication.
(4) Unified consideration of LEO satellite elevation state, communication link state, GEO data relay satellite transmission state, and computing state, regardless of caching state, expressed as A3C-based without caching scheme.
(5) Unified consideration of LEO satellite elevation state, communication link state, GEO data relay satellite transmission state, and caching state, regardless of computing state, expressed as A3C-based without computing scheme.
(6) Direct allocation of resources under static network conditions, expressed as A3C-based no scheme [40].

Simulation Result
The simulation results in this paper are discussed below. Figure 6 shows the convergence performance under different schemes. From the simulation, we can see that, at the beginning of deep reinforcement learning, the expected benefit per unit resource is low. With the increase of training times, the expected benefit of unit resources tends to be stable. The proposed A3C-based all scheme takes into account the coverage area of the LEO satellite, the communication link state between users and the LEO satellite, the transmission state of the GEO data relay satellite, the caching state of caches, and the computing state of the MEC server, which has better resource utilization efficiency. Figure 7 shows that with the increase in elevation angles of users and LEO satellites, the expected benefits per unit resource of the SIN increase gradually. The proposed A3C-based all scheme takes into account the coverage area of the LEO satellite, the communication link state between users and the LEO satellite, the transmission state of the GEO data relay satellite, the caching state of caches, and the computing state of the MEC server, which has better resource utilization efficiency.

Simulation Result
The simulation results in this paper are discussed below. Figure 6 shows the convergence performance under different schemes. From the simulation, we can see that, at the beginning of deep reinforcement learning, the expected benefit per unit resource is low. With the increase of training times, the expected benefit of unit resources tends to be stable. The proposed A3C-based all scheme takes into account the coverage area of the LEO satellite, the communication link state between users and the LEO satellite, the transmission state of the GEO data relay satellite, the caching state of caches, and the computing state of the MEC server, which has better resource utilization efficiency.  Figure 7 shows that with the increase in elevation angles of users and LEO satellites, the expected benefits per unit resource of the SIN increase gradually. The proposed A3C-based all scheme takes into account the coverage area of the LEO satellite, the communication link state between users and the LEO satellite, the transmission state of the GEO data relay satellite, the caching state of caches, and the computing state of the MEC server, which has better resource utilization efficiency.  Figure 8 shows that, with the increase of the task content, the cost of caching charged to users increases gradually; thus, the expected benefit of unit resources of the SIN decreases gradually. The proposed A3C-based all scheme takes into account the coverage area of the LEO satellite, the communication link state between users and the LEO satellite, the transmission state of the GEO data relay satellite, the caching state of caches, and the computing state of the MEC server, which can achieve better expected benefits per unit resource.  Figure 9 shows the relationship between the unit charging price for using transmission resources and the expected benefit of the unit resource. With the increase of the unit charging price for using transmission resources, the expected benefit of the unit resource of the SIN increases gradually. The scheme of A3C-based all takes into account the coverage area of the LEO satellite, the state of communication link between users and the LEO satellite, the transmission state of the GEO data relay satellite, the caching state of caches, and the computing state of the MEC server. It effectively improves the efficiency of unit resource utilization, and has more advantages than other schemes.  Figure 8 shows that, with the increase of the task content, the cost of caching charged to users increases gradually; thus, the expected benefit of unit resources of the SIN decreases gradually. The proposed A3C-based all scheme takes into account the coverage area of the LEO satellite, the communication link state between users and the LEO satellite, the transmission state of the GEO data relay satellite, the caching state of caches, and the computing state of the MEC server, which can achieve better expected benefits per unit resource.  Figure 7 shows that with the increase in elevation angles of users and LEO satellites, the expected benefits per unit resource of the SIN increase gradually. The proposed A3C-based all scheme takes into account the coverage area of the LEO satellite, the communication link state between users and the LEO satellite, the transmission state of the GEO data relay satellite, the caching state of caches, and the computing state of the MEC server, which has better resource utilization efficiency.  Figure 8 shows that, with the increase of the task content, the cost of caching charged to users increases gradually; thus, the expected benefit of unit resources of the SIN decreases gradually. The proposed A3C-based all scheme takes into account the coverage area of the LEO satellite, the communication link state between users and the LEO satellite, the transmission state of the GEO data relay satellite, the caching state of caches, and the computing state of the MEC server, which can achieve better expected benefits per unit resource.  Figure 9 shows the relationship between the unit charging price for using transmission resources and the expected benefit of the unit resource. With the increase of the unit charging price for using transmission resources, the expected benefit of the unit resource of the SIN increases gradually. The scheme of A3C-based all takes into account the coverage area of the LEO satellite, the state of communication link between users and the LEO satellite, the transmission state of the GEO data relay satellite, the caching state of caches, and the computing state of the MEC server. It effectively improves the efficiency of unit resource utilization, and has more advantages than other schemes.
The expected utility per resource(utility/resource) The expected utility per resource(utility/resource) Figure 8. Expected benefits of unit resources under different task content. Figure 9 shows the relationship between the unit charging price for using transmission resources and the expected benefit of the unit resource. With the increase of the unit charging price for using transmission resources, the expected benefit of the unit resource of the SIN increases gradually.
The scheme of A3C-based all takes into account the coverage area of the LEO satellite, the state of communication link between users and the LEO satellite, the transmission state of the GEO data relay satellite, the caching state of caches, and the computing state of the MEC server. It effectively improves the efficiency of unit resource utilization, and has more advantages than other schemes.
Remote Sens. 2019, 11, x FOR PEER REVIEW 20 of 23 Figure 9. The relationship between the unit charging price for using transmission resources and the expected benefit of unit resources. Figure 10 shows the relationship between the unit charging price for using caching resources and the expected benefit of the unit resource. With the increase of the unit charging price for using caching resources, the expected benefit of the unit resource of the SIN increases gradually. The scheme of A3C-based all takes into account the coverage area of the LEO satellite, the state of communication link between users and the LEO satellite, the transmission state of the GEO data relay satellite, the caching state of caches, and the computing state of the MEC server. It effectively improves the efficiency of unit resource utilization, and has more advantages than other schemes. Figure 10. The relationship between the unit charging price for using caching resources and the expected benefit of unit resources. Figure 11 shows the relationship between the unit charging price for using computing resources and the expected benefit of the unit resource. With the increase of the unit charging price for using caching resources, the expected benefit of the unit resource of the SIN increases gradually. The proposed A3C-based all scheme takes into account the coverage area of the LEO satellite, the communication link state between users and the LEO satellite, the transmission state of the GEO data relay satellite, the caching state of caches, and the computing state of the MEC server, which has better resource utilization efficiency.
The expected utility per resource(utility/resource) The expected utility per resource(utility/resource) Figure 9. The relationship between the unit charging price for using transmission resources and the expected benefit of unit resources. Figure 10 shows the relationship between the unit charging price for using caching resources and the expected benefit of the unit resource. With the increase of the unit charging price for using caching resources, the expected benefit of the unit resource of the SIN increases gradually. The scheme of A3C-based all takes into account the coverage area of the LEO satellite, the state of communication link between users and the LEO satellite, the transmission state of the GEO data relay satellite, the caching state of caches, and the computing state of the MEC server. It effectively improves the efficiency of unit resource utilization, and has more advantages than other schemes.
Remote Sens. 2019, 11, x FOR PEER REVIEW 20 of 23 Figure 9. The relationship between the unit charging price for using transmission resources and the expected benefit of unit resources. Figure 10 shows the relationship between the unit charging price for using caching resources and the expected benefit of the unit resource. With the increase of the unit charging price for using caching resources, the expected benefit of the unit resource of the SIN increases gradually. The scheme of A3C-based all takes into account the coverage area of the LEO satellite, the state of communication link between users and the LEO satellite, the transmission state of the GEO data relay satellite, the caching state of caches, and the computing state of the MEC server. It effectively improves the efficiency of unit resource utilization, and has more advantages than other schemes. Figure 10. The relationship between the unit charging price for using caching resources and the expected benefit of unit resources. Figure 11 shows the relationship between the unit charging price for using computing resources and the expected benefit of the unit resource. With the increase of the unit charging price for using caching resources, the expected benefit of the unit resource of the SIN increases gradually. The proposed A3C-based all scheme takes into account the coverage area of the LEO satellite, the communication link state between users and the LEO satellite, the transmission state of the GEO data relay satellite, the caching state of caches, and the computing state of the MEC server, which has better resource utilization efficiency.
The expected utility per resource(utility/resource) The expected utility per resource(utility/resource) Figure 10. The relationship between the unit charging price for using caching resources and the expected benefit of unit resources. Figure 11 shows the relationship between the unit charging price for using computing resources and the expected benefit of the unit resource. With the increase of the unit charging price for using caching resources, the expected benefit of the unit resource of the SIN increases gradually. The proposed A3C-based all scheme takes into account the coverage area of the LEO satellite, the communication link state between users and the LEO satellite, the transmission state of the GEO data relay satellite, the caching state of caches, and the computing state of the MEC server, which has better resource utilization efficiency. Remote Sens. 2019, 11, x FOR PEER REVIEW 21 of 23 Figure 11. The relationship between the unit charging price for using computing resources and the expected benefit of unit resources.

Conclusions
In this paper, in order to improve the resource management and utilization efficiency of the SIN, firstly, based on the core idea of SDN, a hierarchical and domain-controlled SIN architecture was established. The overall networking architecture and network control architecture were designed. On this basis, the transmission resources, caching resources, and computing resources of the SIN were managed in a unified way. Next, the satellite coverage and transmission model, communication link model, caching model, and computing model of the SIN were modeled and analyzed. Finally, the A3C algorithm of deep reinforcement learning was introduced to model and simulate the multi-dimensional resource allocation problem of the SIN. The simulation results show that the proposed scheme can effectively improve the expected benefits of unit resources and the utilization efficiency of the SIN resources. In this paper, LEO communication satellites and several GEO data relay satellites were taken as examples for analysis. However, the SIN is a huge system. In practical applications, the scheduling of remote-sensing satellites, navigation satellites, and other resources may have different situations, which need specific analysis. Furthermore, in a follow-up study, we will further analyze the other SIN resources such as energy resources and sensor resources.

Conclusions
In this paper, in order to improve the resource management and utilization efficiency of the SIN, firstly, based on the core idea of SDN, a hierarchical and domain-controlled SIN architecture was established. The overall networking architecture and network control architecture were designed. On this basis, the transmission resources, caching resources, and computing resources of the SIN were managed in a unified way. Next, the satellite coverage and transmission model, communication link model, caching model, and computing model of the SIN were modeled and analyzed. Finally, the A3C algorithm of deep reinforcement learning was introduced to model and simulate the multi-dimensional resource allocation problem of the SIN. The simulation results show that the proposed scheme can effectively improve the expected benefits of unit resources and the utilization efficiency of the SIN resources. In this paper, LEO communication satellites and several GEO data relay satellites were taken as examples for analysis. However, the SIN is a huge system. In practical applications, the scheduling of remote-sensing satellites, navigation satellites, and other resources may have different situations, which need specific analysis. Furthermore, in a follow-up study, we will further analyze the other SIN resources such as energy resources and sensor resources.