Joint Drone Access and LEO Satellite Backhaul for a Space–Air–Ground Integrated Network: A Multi-Agent Deep Reinforcement Learning-Based Approach

: The space–air–ground integrated network can provide services to ground users in remote areas by utilizing high-altitude platform (HAP) drones to support stable user access and using low earth orbit (LEO) satellites to provide large-scale traffic backhaul. However, the rapid movement of LEO satellites requires dynamic maintenance of the matching relationship between LEO satellites and HAP drones. Additionally, different traffic types generated at HAP drones hold varying levels of values. Therefore, a tripartite matching problem among LEO satellites, HAP drones, and traffic types jointly considering multi-dimensional characteristics such as remaining visible time, channel condition, handover latency


Introduction
Densely deployed ground communication infrastructures can provide access services for mobile and Internet of Things (IoT) users in urban areas, with the advantages of high data rates and small propagation delay.However, deploying infrastructures in remote areas such as the ocean and desert is challenging and expensive.Various applications in remote areas, such as forest monitoring, desert communication, and maritime logistics, are difficult to serve [1,2].There are still approximately three billion people all over the world living without Internet access, presenting an obstacle for 6G in realizing seamless connectivity and ubiquitous access [3,4].How to achieve user access and traffic backhaul for mobile and IoT users in remote areas has become crucial [5].
Satellite communication makes up for the shortage of terrestrial networks and provides users with large-scale access services with wide coverage.The utilization of low earth orbit (LEO) satellites for global Internet access and traffic backhaul has garnered attention due to their lower development and launch cost and transmission latency compared with geostationary earth orbit (GEO) and medium earth orbit (MEO) satellites [6].The use of inter-satellite links (ISLs) enables the traffic generated by ground IoT users to be relayed among LEO satellites and transmitted back to the terrestrial traffic center [7].However, the severe path loss between LEO satellites and ground IoT users makes it difficult for users to directly access LEO satellites due to limited transmission power.
In order to reduce the demand for user-side transmission power, the space-air-ground integrated network has attracted a lot of attention from academia and industry in 6G [8,9].Compared to the orbital altitude of hundreds or thousands of kilometers of LEO satellites, the altitude of drones is much lower, thus needing lower transmission power from ground IoT users [10].In the space-air-ground integrated network, drones in the air are utilized to support user access with lower transmission energy costs, and satellites in space are used to provide traffic backhaul with global coverage [11].They work together with communication infrastructures on the ground to provide users with various application services.In recent years, a category of drone that can provide users with more stable access, namely high altitude platform (HAP) drone, has become a research hotspot.Different from traditional drones, HAP drones hover at an altitude of about 20 km in the stratosphere, with base stations deployed on them to provide users with ubiquitous and stable access.HAP drones can extend communication capabilities across the space, air, and ground domains.Specifically, aerial networks composed of HAP drones are utilized to support user access and collect traffic generated by users in remote or inaccessible areas lacking communication infrastructures.Then, LEO satellites are used to support traffic backhaul to the terrestrial traffic center, thus supplying stable access and traffic backhaul [12].
Due to the advantages of low deployment cost, flexible on-demand deployment, and reliable line-of-sight communication link, HAP drones have been employed in the satellite-ground network for user access, traffic backhaul, and task execution.However, practical issues in the space-air-ground integrated network have been overlooked in existing research.For instance, due to the high mobility of LEO satellites, HAP drones need to be switched between different LEO satellites.Therefore, the calculation of the available traffic transmission time of HAP drones must jointly consider the remaining visible time and handover latency.Furthermore, different traffic types generated at HAP drones hold varying values, suggesting a preference for establishing matching for highvalue traffic types first.Lastly, the assumption of a specific constant traffic generation state at HAP drones in existing research does not align with the stochastic and deterministic nature of traffic generation in practice, rendering conventional static matching algorithms inapplicable [13].
Therefore, in order to address the issues mentioned above, a tripartite matching problem among LEO satellites, HAP drones, and traffic types is investigated for the spaceair-ground integrated network in this paper.Specifically, the main contributions of this paper are as follows: • First, the network architecture and working mechanism of the space-air-ground integrated network is introduced, which aims at achieving user access and traffic backhaul in remote areas.Different from the conventional static traffic generation state with deterministic variables, the traffic generation state at HAP drones is modeled as a mixture of stochasticity and determinism, which aligns with real-world scenarios.

•
Then, different from the conventional schemes that treat all traffic types as equally important, we develop a tripartite matching problem among satellites, HAP drones, and traffic types based on the different values of different traffic types.The problem can be decoupled into two sub-problems: traffic-drone matching and LEO-drone matching.Traffic-drone matching is simplified into multiple separate sub-subproblems through mathematical analysis, which can be addressed independently.LEO-drone matching cannot be solved by conventional optimization solvers since the traffic generation state at drones is a mixture of stochasticity and determinism.Thus, reinforcement learning is adopted.Moreover, due to the significant propagation latency between terrestrial traffic center and LEO satellites, a conventional centralized scheme cannot obtain the latest status of the network.Therefore, it cannot devise LEOdrone matching strategies in a timely manner.In addition, the state space of the LEO-drone matching sub-problem is continuous.Therefore, a multi-agent deep rein-forcement learning approach with centralized training and decentralized execution is proposed, in which the value network is centrally trained at the terrestrial traffic center and the LEO-drone matching strategy is timely devised at LEO satellites in a decentralized manner.

•
Finally, the convergence performance of the proposed matching approach is discussed and analyzed through simulations.In addition, the proposed algorithm is compared with state-of-the-art algorithms under different network parameters to validate its effectiveness.
The rest of the paper is organized as follows.The related works are discussed in Section 2. The system model and working mechanism are illustrated in Section 3. Section 4 formulates and simplifies the tripartite matching problem.In Section 5, the formulated problem is solved by the multi-agent deep reinforcement learning algorithm.Simulation results are presented and discussed in Section 6. Future work is summarized in Section 7. Finally, conclusions are drawn in Section 8.

Related Works
Abbasi et al. first presented the potential use cases, open challenges, and possible solutions of HAP drones for next-generation networks [14].The main communication links between HAP drones and other non-terrestrial network (NTN) platforms, along with their advantages and challenges, are presented in [15].Due to the rapid movement of LEO satellites, the matching relationship between HAP drones and LEO satellites is not fixed, so efficient matching and association strategies need to be developed.In [16], the matching relationship between user equipment (UE), HAP drones, and terrestrial base stations (BS) is formulated as a mixed discrete-continuous optimization problem under the HAP drone payload connectivity constraints, HAP drones and BSs power constraints, and backhaul constraints to maximize the network throughput.The formulated problem is solved using a combination of integer linear programming and generalized assignment problems.A deep Q-learning (DQL) approach is proposed in [17] to perform the user association between a terrestrial base station and a HAP drone based on the channel state information of the previous time slot.In addition to the above-mentioned UE's selection between terrestrial networks and non-terrestrial networks, there has been relevant research on the threeparty matching problem among users, HAP drones and satellites in remote areas without terrestrial network coverage.In [18], the matching problem among users, HAP drones, and satellites is formulated to maximize the total revenue and it was solved by a satelliteoriented restricted three-sided matching algorithm.In [19], a throughput maximization problem is formulated for ground users in an integrated satellite-aerial-ground network by comprehensively optimizing user association, transmission power, and unmanned aerial vehicle (UAV) trajectory.In [20], a UAV-LEO integrated traffic collection network is proposed to maximize the uploaded traffic volume while ensuring the energy consumption by comprehensively considering bandwidth allocation, UAV trajectory design, power allocation, and LEO satellite selection.The maximum computation delay among terminals is minimized in [21] by a joint considering matching relationship, resource allocation, and deployment location optimization.An alternating optimization algorithm based on block coordinate descent and successive convex approximation is proposed to solve this.A joint association and power allocation approach is proposed for the space-air-ground network in [22] to maximize the transmitted traffic amount while minimizing the transmit power under the constraints of the power budget and quality of service (QoS) requirements of HAP drones and the data storage and visibility time of LEO satellites.The association problem and power allocation problem are alternately addressed by the GUROBI optimizer and the whale optimization algorithm, respectively.
It is worth mentioning that reinforcement learning (RL) algorithms are widely used for HAP drone problems.HAP drones form a distributed network, and with multi-agent RL, the space-air-ground integrated network can effectively become self-organizing.In [23], a multi-agent Q-learning approach is proposed to tackle the service function chain placement problem for LEO satellite networks in a discrete-time stochastic control framework, thus optimizing the long-term system performance.In [24], a multi-agent deep reinforcement learning algorithm with global rewards is proposed to optimize the transmit power, CPU frequency, bit allocation, offloading decision, and bandwidth allocation via a decentralized method, thus achieving the computation offloading and resource allocation for the LEO satellite edge computing network.In [25], the utility of HAP drones is maximized by jointly optimizing association and resource allocation, which is formulated as a Stackelberg game.The formulated problem is transformed into a stochastic game model, and a multi-agent deep RL algorithm is adopted to solve it.

System Model and Working Mechanism
In order to provide services for mobile users and IoT users in remote areas, the spaceair-ground integrated network is investigated in this paper, and its network architecture is shown in Figure 1.It utilizes an aerial network composed of HAP drones to collect traffic generated by various IoT users, thus providing stable and large-scale access services for areas without ground communication infrastructures.Via a drone-LEO link and multiple LEO-LEO links, the collected traffic is then relayed to the LEO satellite connected to a ground station to achieve traffic backhaul.Finally, the ground station downloads the traffic via the LEO-ground link and transmits it back to the terrestrial traffic center for processing via optical fibers.Ground devices access HAP drones through the C-band.HAP drones are directly connected to LEO satellites through the Ka-band to achieve high-rate traffic backhaul [26].

Traffic Generation Model for HAP Drones
For the space-air-ground integrated network, the drone-LEO link needs to transmit the traffic generated by the HAP drone itself and the traffic collected from various mobile and IoT users on the ground.This traffic can be divided into traffic types generated with a determined rate, which mainly include HAP drone health status and UE location, and traffic types generated abruptly with random probability, such as malfunction diagnosis and signaling execution.Therefore, the traffic generation state at HAP drones is modeled as a mixture of stochasticity and determinism.Markov chains can be used to describe the traffic generation models of various types uniformly, as shown in Figure 2. Specifically, the generation of each traffic type at HAP drones is modeled as a Markov chain with two states: on and off.In the on state, traffic is generated at a constant rate, whereas traffic generation ceases in the off state.Denote the self-transition probabilities of the q-th traffic type from on to on as p 1,q and from off to off as p 2,q , where q ∈ {1, 2, • • • , Q} and Q are the total number of traffic types.For traffic types generated at a constant rate, there are p 1,q = 1 and p 2,q = 0.
For traffic types generated abruptly with random probability, there are 0 < p 1,q < 1 and 0 < p 2,q < 1, which means that the state switches randomly between on and off.In addition, different traffic types have varying levels of importance in practical scenarios.For instance, the traffic carrying the remaining power of HAP drones is more valuable than other traffic types.To account for this, we introduce a value factor µ q to represent the value of the q-th traffic type.The optimization objective is to maximize the average transmitted traffic value of the network in each time slot.Unlike the conventional approach, which treats all traffic types equally, we prioritize the transmission of high-value traffic when system resources are restricted, which aligns better with actual transmission requirements.

Traffic Transmission Model between LEO Satellites and HAP Drones
Suppose that there are I HAP drones at an altitude of h 1 and J LEO satellites at an altitude of h 2 in the space-air-ground integrated network.The LEO satellite set is denoted as J = {1, 2, • • • , J}, and the HAP drone set is denoted as I = {1, 2, • • • , I}.Each HAP drone is equipped with an omnidirectional antenna, and each LEO satellite is equipped with L steerable beams.The time interval is divided into M time slots with a length of M 0 , and the time slot set is denoted as M = {1, 2, • • • , M}.When M 0 is sufficiently small, the matching between LEO satellites and HAP drones in each time slot can be treated as quasi-static.In each time slot, one LEO satellite beam can provide services for no more than one HAP drone, and one HAP drone can establish a connection with at most one LEO satellite.We define a LEO-drone matching matrix X I×J [m] to describe the matching relationship between LEO satellites and HAP drones in the m-th time slot.If the i-th HAP drone is served by the j-th LEO satellite in the m-th time slot, there is This work focuses on mobile users and IoT users in depopulated regions with almost no obstacles.Therefore, small-scale fading due to multi-path effects can be neglected.The channel gain from the i-th HAP drone to the j-th LEO satellite in the m-th time slot can be expressed as follows [27]: where c and f c represent the speed of light and the carrier frequency, respectively.d i,j [m] represents the distance between the i-th HAP drone and the j-th LEO satellite in the m-th time slot.Based on this, the traffic transmission rate between the i-th HAP drone and the j-th LEO satellite can be expressed as follows: where W is the bandwidth of LEO beams, P h is the transmit power of HAP drones, and G i and G j represent the antenna gains of the transmitter of HAP drone and the receiver of LEO satellite, respectively [28].k B is Boltzmann's constant, and T b is the system noise temperature.When the channel gain between the HAP drone and the j-th LEO satellite exceeds a given threshold h 0 , it is considered that this HAP drone is within the visible range of the j-th LEO satellite.In the m-th time slot, the set of HAP drones within the visible range of the j-th LEO satellite can be expressed as follows: As a result of the high-speed movement of LEO satellites, handover is required when the HAP drone moves outside the visible range of the LEO satellite.HAP drones are unable to send traffic to LEO satellites during the handover duration T h , which can be approximately expressed as follows: where κ is the signaling that requires to be transmitted between the HAP drone and the LEO satellite during handover.Therefore, the available traffic transmission time in the m-th time slot can be represented as follows: where T remain i,j represents the remaining visible time between the j-th LEO satellite and the i-th HAP drone.In each time slot, a HAP drone can only choose one of the Q traffic types for transmission.We define a traffic-drone matching matrix Y I×Q [m] to describe the transmission status of different traffic types at each HAP drone in the m-th time slot.If the q-th traffic type of the i-th HAP drone is sent in the m-th time slot, there is y i,q [m] = 1; otherwise, y i,q [m] = 0. Thus, in the m-th time slot, the maximum traffic volume from the q-th traffic type of the i-th HAP drone to the j-th LEO satellite can be expressed as follows: The transmitted traffic value of the q-th traffic type of the i-th HAP drone in the m-th time slot can be represented as follows: where S i,q [m] is the traffic volume of the q-th traffic type stored at the i-th HAP drone in the m-th time slot, which can be obtained as follows: where G i,q [m − 1] denotes the traffic volume of the q-th traffic type newly generated at the i-th HAP drone in the (t − 1)-th time slot.It is a random variable that follows the traffic generation model defined in Section 3.1.Therefore, the total transmitted traffic value of the space-air-ground integrated network can be given as follows:

Problem Formulation and Transformation
The optimization objective is to establish tripartite matching among LEO satellites, HAP drones, and traffic types by choosing the most suitable LEO-drone matching matrix X I×J [m] and traffic-drone matching matrix Y I×Q [m] in each time slot, so as to maximize the average transmitted traffic value of the network.The objective function and constraints can be formulated as follows: max Constraint (10b) specifies that each HAP drone can connect to a maximum of one LEO satellite in each time slot.Constraint (10c) specifies that the number of HAP drones served by each LEO satellite is equal to the beam number L. Note that even though each LEO satellite can provide services for less than L HAP drones, this would lead to inefficient use of satellite beams.Thus, in order to achieve maximum average transmitted traffic value of the network in each time slot, all beams of each satellite will be utilized.Constraint (10d) specifies that each HAP drone can transmit a maximum of one traffic type in each time slot.(10e) and (10f) are restrictions on the elements of the LEO-drone matching matrix and the traffic-drone matching matrix, respectively.
The formulated problem (10a) is a mixed integer nonlinear programming problem.In the following content, we will analyze and simplify it.Given a specific X I×J [m] and substituting (9) into (10a), the original problem is as follows: Through analysis, it becomes evident that m] is solely dependent on the matching y i,q [m]|q ∈ Q between the i-th HAP drone and all traffic types in the m-th time slot, and is independent of the matching between other HAP drones and traffic types in other time slots.Consequently, maximizing (11a) can be achieved by maximizing each term within the brackets of (11a).Thus, (11a) can be rephrased as follows: Formulation (12a) is equivalent to optimizing I × M independent sub-subproblems.For ∀i ∈ I, ∀m ∈ M, the sub-subproblem can be formulated as follows: It is feasible that the region can be expressed as follows: or Regarding the former, the optimal value of (13a) is 0, whereas for the latter, the optimal value is greater than or equal to 0. Hence, the optimal solution of (13a) must adhere to (15), so as to maximize the objective function.By substituting (15) into (13a), it is equivalent to addressing the following: Its optimal solution can be expressed as follows: Based on this, the optimal solution of (11a) can be expressed as follows: At this point, we have successfully decomposed the optimization sub-problem (11a) into I × M independent sub-subproblems through mathematical analysis.The optimal traffic-drone matching matrix Y I×Q [m] can be obtained according to (18).Intuitively, once the LEO-drone matching of each time slot is determined, the maximum average transmitted traffic value of the network can be achieved by choosing the traffic type with the highest value for each HAP drone to transmit.
Substituting the optimal solution (18) into the objective function (10a) yields the following: which is solely associated with the LEO-drone matching matrix X I×J [m].

Problem Solving and Algorithm Designing
Typically, conventional optimization solvers are employed to solve problems with deterministic variables [29].Problems with random variables are difficult to be solved using these solvers.Nevertheless, the tripartite matching problem that this paper focuses on is a mixture of stochasticity and determinism.Therefore, we adopt reinforcement learning to dynamically solve the LEO-drone matching sub-problem (19a).Specifically, the matching between each LEO satellite and HAP drones is modeled as a Markov decision process [30], where each LEO satellite is treated as an agent.The state, action, and reward of the j-th LEO satellite are defined as follows: |i ∈ I, q ∈ Q .In the m-th time slot, the j-th LEO satellite obtains the state of each HAP drone within its visible range, which includes the available traffic transmission time T i,j [m] and the traffic transmission rate R i,j [m] of the current m-th time slot, as well as the stored traffic volume S i,q [m − 1] and the traffic generation rate G i,q [m − 1] of each traffic type in the previous (m − 1)-th time slot.For the HAP drone i 0 , which is not within the visible range of the j-th LEO satellite, i.e., i 0 / ∈ I j [m], there are In the m-th time slot, the action of the j-th LEO satellite is to determine which L HAP drones to provide services for.If multiple LEO satellites decide to provide services to the same HAP drone, this HAP drone will actively choose to connect to the LEO satellite, transmitting the highest traffic value.• Reward: In the m-th time slot, the reward obtained by the j-th LEO satellite after taking action a j [m] in state s j [m] is defined as the total transmitted traffic value of the j-th LEO satellite in the current time slot.
Then, reinforcement learning is employed to solve (19a) based on the above definitions.The discounted return of the j-th LEO satellite in the m-th time slot is defined as follows: where γ ∈ [0, 1) represents the discount rate, which is used to balance the impact of short- term and long-term rewards.If γ is close to 0, the discounted return mainly depends on recent rewards.Conversely, if γ approaches 1, the discounted return primarily depends on forward rewards.Q-values can be used to evaluate the expectation of return that the j-th LEO satellite can achieve by taking action a j based on policy π j in state s j , which can be expressed as follows: In conventional Q-learning, Q-values of the optimal policy π * j can be continuously updated through iterations.Generate an episode of length T max .For the t-th iteration, the Q-value of the state-action pair s t j , a t j can be obtained as follows [31]: q t+1 j s t j , a t j = q t j s t j , a t j − ϑ t j q t j s t j , a t j − r t j s t j , a t j + γmax a∈A j q t j s j (t + 1), a , where t ∈ {1, 2, • • • , T max }. s t j represents the state at the t-th step of the episode, and a t j denotes the action taken in state s t j .ϑ t j represents the learning rate, and A j denotes the action space of the j-th LEO satellite.r t j s t j , a t j denotes the average one-step immediate reward acquired after taking action a t j in state s t j , which can be represented as follows: Supposing that the proposed approach converges after C iterations, the optimal policy can be expressed as follows [32]: The aforementioned conventional Q-learning algorithm stores the calculated Q-values q t j s t j , a t j in the form of tables, known as a Q-table, which has the advantages of being intuitive and easy to analyze.However, due to the continuous state space of (19a), using the conventional tabular Q-learning algorithm requires storing a large volume of data, thereby increasing storage costs.Furthermore, the generalization ability of the conventional Q-learning algorithm is poor.To address these issues, a deep Q-learning algorithm is employed in this paper, which is one of the earliest and most successful algorithms that introduces deep neural networks into reinforcement learning [32].In deep Q-learning, the high-dimensional Q-table can be approximated by a deep Q network with low-dimensional parameters, thereby significantly reducing the storage cost.In addition, the Q-values of unvisited state-action pairs can be calculated through value function approximation, giving it strong generalization ability.
In addition, the aforementioned algorithm is fully decentralized, in which each satellite calculates its Q-values according to its own local states, local actions, and local rewards.However, LEO satellites are not completely independent, but influence each other.For example, if the i-th HAP drone is connected to the j-th LEO satellite at the current moment, other LEO satellites cannot provide service for this HAP drone.Therefore, the aforementioned fully decentralized reinforcement learning algorithm cannot obtain high performance and may not even converge in some cases.An alternative solution is to use a fully centralized reinforcement learning algorithm.In each time slot, each LEO satellite sends its experience obtained from its interaction with the environment to the terrestrial traffic center.Then, both value network training and strategy making are performed at the center based on global experiences.Nevertheless, the experience of each satellite must pass through multiple ISLs, an LEO-ground link, and an optical fiber link to be transmitted back to the terrestrial traffic center, facing high propagation latency.The terrestrial traffic center is unable to obtain the latest status of the space-air-ground integrated network, so it is unable to make timely LEO-drone matching strategies.To address these issues, we employ multi-agent deep reinforcement learning with centralized training and decentralized execution.The value network of each LEO satellite is trained in a centralized manner at the terrestrial traffic center.Then, the trained value networks are distributed to the corresponding LEO satellites [33].Each satellite distributively trains its policy network based on the received value network and the latest local observations, thus it can devise LEO-drone matching strategies in a timely manner.
Specifically, when training the value network, each LEO satellite sends its local experience s j , a j , r j , s ′ j obtained from its interaction with the environment to the terrestrial traffic center, where s ′ j is the state reached after taking action a j in state s j .Based on the collected local experiences of various LEO satellites, the terrestrial traffic center forms the global experience, including the global state

and the global reached state s
and stores s, a, r j , s ′ in the replay buffer D j .Afterwards, the terrestrial traffic center trains the value network of the j-th LEO satellite based on s, a, r j , s ′ to evaluate the quality of the matching approach.As previously mentioned, the deep Q-learning algorithm is adopted, where the true Q-values of the optimal strategy are approximated by the Q-values calculated by the trained value network, which can be obtained through the quasi-static target network scheme [34].Specifically, two networks need to be defined: the target network q j S, A, ω j,target and the main network q j S, A, ω j,main described by parameters ω j,target and ω j,main , respectively, where S and A are global states and global actions collected by the terrestrial traffic center in the form of random variables.The objective of parameter iteration is to minimize the mean square error of the Q-values calculated by the target network and the main network.This can be achieved by minimizing the loss function, which can be expressed as follows: where S ′ and R j represent the reached state and the acquired reward after taking action A in state S, respectively.The gradient-descent algorithm is then adopted to minimize the objective function.The gradient of (25) can be calculated as follows: ∇ ω j,main J j ω j,main = E R j + γmax a q j S ′ , a, ω j,target − q j S, A, ω j,main × ∇ ω j,main q j S, A, ω j,main , where ∇ ω j,main q j S, A, ω j,main can be obtained through the gradient back propagation algorithm [35].In each iteration, an experience batch D batch j is randomly sampled from the replay buffer D j to train the value network.For each sample s, a, r j , s ′ in D batch j , the parameter ω j,main of the main network is updated as follows: where β is the learning rate.After ∆ iterations, the parameter ω j,target of the target network is updated as ω j,main : ω j,target ← ω j,main .
Algorithm 1 presents the matching algorithm based on multi-agent deep reinforcement learning, in which ϵ-greedy is used to balance exploitation and exploration.The value network of each LEO satellite is centrally trained at the terrestrial traffic center based on the global states, global actions, and the local reward of each LEO satellite.Then, the trained value network is sent to the corresponding LEO satellite.At the j-th LEO satellite, its policy network can be trained in a decentralized manner based on its received value network with parameter ω j,target and its local observations.Afterwards, each LEO satellite develops its own optimal strategy based on its trained policy network to maximize the long-term return.Finally, each LEO satellite broadcasts the matching strategy to all HAP drones within its visible range.
Algorithm 1 Matching approach based on multi-agent deep reinforcement learning.
Input: Episode length T max , learning rate β, greedy factor ϵ, discount factor γ, iteration number ∆, and randomly initialize parameters ω j,main and states s 1 j , let ω j,target = ω j,main , δ = 0, D j = Φ, and D batch j = Φ; Output: Optimal strategy for each LEO satellite 1: for t = 1 to T max do 2: for j = 1 to J do 3: The j-th LEO satellite takes action a t j according to ϵ-greedy strategy, where the optimal action is arg max a∈A j q j s t j , a, ω j,target ; Interact with environment to get the rewards r t j and the reached states s t+1 j ; 5: end for 6: Form the global state Store s t , a t , r t j , s t+1 into the replay buffer D j ; for j = 1 to J do 18: Send the trained value network q j s j , a j , ω j,target to the j-th LEO satellite; 19: The j-th LEO satellite trains its own policy network based on s j and q j s j , a j , ω j,target ; Develop the optimal strategy of the j-th LEO satellite based on its trained policy network.

21:
end for 22: end for

Simulation Results
In order to verify the effectiveness of the proposed matching algorithm, preliminary simulations are conducted.The main simulation parameters are listed in Table 1.We compare the proposed approach with some state-of-the-art algorithms, including deep deterministic policy gradient (DDPG), deep Q-network (DQN), and two greedy methods.

•
For the first greedy method (abbreviated as Greedy 1), each LEO satellite will choose the L HAP drones with the highest channel gains within its visible range to establish connections.

•
For the second greedy method (abbreviated as Greedy 2), each LEO satellite will choose the L HAP drones with the longest remaining visible time within its visible range to establish connections.

•
For both Greedy 1 and Greedy 2, each HAP drone that has established a connection with an LEO satellite will choose the traffic type with the largest transmitted traffic value for transmission.Figure 3 illustrates the transmitted traffic values of the proposed matching approach in one time slot under episode lengths of 500, 1000, 1500, 2000, 2500, and 3000, respectively.When the length of the episode does not exceed 2000, the transmitted traffic values in one time slot increase significantly with the increase of the episode length.However, when the length of the episode exceeds 2000, the transmitted traffic values in one time slot are basically the same for various episode lengths.Thus, the length of the episode is set to 2000 in subsequent simulations, thereby saving computational resources while ensuring performance.Furthermore, it can be observed that for any episode length, the transmitted traffic value will first increase and then remain essentially stable, which validates the convergence of the proposed matching algorithm.Figure 4 illustrates the variation of the relative mean square error of Q-values obtained by the target network and the main network under learning rates of 0.15, 0.1, 0.08, and 0.05, respectively.As the learning rate β increases from 0.05 to 0.1, the rate of decrease in relative mean square error accelerates.Nevertheless, as the learning rate continues to increase from 0.1 to 0.15, the rate of decrease in relative mean square error remains almost unchanged, but its fluctuations will increase.Therefore, in order to balance the convergence speed and stability, we set the learning rate β to 0.1 in subsequent simulations.Figure 5 illustrates the total transmitted traffic values of different algorithms under varying HAP drone transmission powers.It can be seen that with the increase of the transmission power, the total transmitted traffic values of all algorithms increase.This is because, according to (2), increasing the transmission power of HAP drones can improve the traffic transmission rates, thereby increasing the total transmitted traffic value of the space-air-ground integrated network.From Figure 5 we can see that the proposed multiagent deep RL algorithm is the best.Since multi-agent deep RL utilizes centralized training and decentralized execution to reduce the interference of non-stationary environments among agents, the proposed algorithm can increase the transmitted traffic value compared with DDPG and DQN.Furthermore, all the three RL-based algorithms perform better than the greedy methods due to the following two reasons.• Greedy 1 aims to improve the transmission rate between LEO satellites and HAP drones by choosing HAP drones with higher channel gains, thereby increasing the total transmitted traffic value.Similarly, Greedy 2 focuses on reducing the handover latency by choosing HAP drones with long remaining visible time, thereby improving the available traffic transmission time of HAP drones, so as to increase the total transmitted traffic value.In contrast, the RL-based algorithms take a more comprehensive perspective by jointly considering multi-dimensional characteristics such as remaining visible time, channel condition, handover latency, and traffic storage capacity.Thus, the RL-based algorithms can improve the total transmitted traffic value of the network from a global perspective, surpassing the performance of Greedy 1 and Greedy 2.

•
Both Greedy 1 and Greedy 2 rely on static matching algorithms, which fail to account for the randomness of traffic generation at HAP drones.In contrast, the RL-based algorithms can learn the randomness of the traffic generation at HAP drones and make the matching strategy based on this learning.
Figure 6 illustrates the total transmitted traffic values of different algorithms with respect to the LEO satellite beam number L. As the number of LEO satellite beams increases, the total transmitted traffic values of all algorithms will also increase.This is because increasing the number of LEO satellite beams can relax the constraint (10c), thereby allowing more HAP drones to transmit traffic to LEO satellites simultaneously, so as to increase the total transmitted traffic value of the space-air-ground integrated network.From Figure 6, we can see that the proposed multi-agent deep reinforcement learning algorithm is the best since it can learn from the experience of the other LEO satellites.Furthermore, all three RL-based algorithms perform better than greedy methods for the same reasons shown in Figure 5.

Future Work
Although the proposed approach can effectively address the tripartite matching problem among LEO satellites, HAP drones, and traffic types, there are some limitations.

Matching among Various Network Nodes
In this paper, only the matching problem between HAP drones and LEO satellites is considered.However, in the space-air-ground integrated network, in addition to HAP drones and LEO satellites, there are also a variety of network nodes, such as ground users, gateway stations, and geostationary earth orbit satellites.In the future, it is necessary to investigate the matching relationships among different nodes to improve the topology of the space-air-ground integrated network.For example, the matching problem between ground users and HAP drones should be addressed by comprehensively considering multiple factors such as the location, movement speed, and service requirements of ground users and the payloads of HAP drones.

Computing Task Assignment and Resource Allocation
Our research only considers how to perform user access and traffic backhaul in remote areas where ground base stations are difficult to deploy.However, in addition to serving remote areas, HAP drones can also provide low-latency edge computing services for IoT devices in urban areas with ground base station coverage.In the future, the great pressure that computing-intensive applications place on resource-constrained Internet of Things (IoT) devices with limited computing capability and energy storage can be alleviated by offloading latency-sensitive computing tasks to nearby edge nodes.A matching strategy for ground users, HAP drones, and ground base stations should be developed by jointly optimizing computing task assignment and resource allocation, thus improving the performance of the space-air-ground integrated network, such as minimizing the maximum task execution latency among IoT devices or maximizing the amount of transmitted traffic per unit time.

HAP Drone Localization
The positions of HAP drones are assumed to be stationary and known in our paper.However, the positions of HAP drones will constantly change due to jitter.Only by knowing the exact location of HAP drones can we accurately calculate the distance between HAP drone and LEO satellite, the remaining visible time, and the channel capacity.Therefore, the exact location of HAP drone is essential for making the user access and traffic backhaul strategy of the space-air-ground integrated network.In the future, the HAP drone localization problem needs to be solved.Other positioning systems can be added to estimate the exact location of HAP drone.For example, reinforcement learning-based algorithms can be used to regularly predict the exact location of HAP drone by inputting atmospheric data such as wind speed.

Conclusions
In this paper, the matching problem between HAP drones and LEO satellites in the space-air-ground integrated network has been investigated.First, we introduced the network architecture and working mechanism, including the traffic generation model and the traffic transmission model.Then, a tripartite matching problem that takes comprehensive consideration of multi-dimensional characteristics has been formulated to maximize the average transmitted traffic value of network.Through mathematical simplification, the optimization problem is then simplified into two independent sub-problems: trafficdrone matching and LEO-drone matching.The former can be decoupled into multiple independent and easily solvable sub-subproblems.Considering the mixed stochastic and deterministic traffic generation model, the long propagation latency between LEO satellites and HAP drones, and in the continuous state space, we proposed a multi-agent deep reinforcement learning approach with centralized training and decentralized execution to solve the LEO-drone matching problem.In this approach, the value network is trained in a centralized manner at the terrestrial traffic center and the matching strategy is timely formulated in a decentralized manner at LEO satellites.Finally, the proposed approach has been compared with multiple state-of-the-art algorithms through simulations, and results have proven the effectiveness and efficiency of the proposed algorithm.

Figure 1 .
Figure 1.Network architecture of the space-air-ground integrated network.

Figure 2 .
Figure 2. Traffic generation model for each HAP drone.

Figure 3 .
Figure 3. Transmitted traffic values in one time slot with different episode lengths.

Figure 4 .
Figure 4. Mean square error of the Q-values obtained by the target network and the main network.

Figure 5 .
Figure 5.Total transmitted traffic value under different HAP drone transmission powers.

Figure 6 .
Figure 6.Total transmitted traffic value under different beam numbers.