1. Introduction
Densely deployed ground communication infrastructures can provide access services for mobile and Internet of Things (IoT) users in urban areas, with the advantages of high data rates and small propagation delay. However, deploying infrastructures in remote areas such as the ocean and desert is challenging and expensive. Various applications in remote areas, such as forest monitoring, desert communication, and maritime logistics, are difficult to serve [
1,
2]. There are still approximately three billion people all over the world living without Internet access, presenting an obstacle for 6G in realizing seamless connectivity and ubiquitous access [
3,
4]. How to achieve user access and traffic backhaul for mobile and IoT users in remote areas has become crucial [
5].
Satellite communication makes up for the shortage of terrestrial networks and provides users with large-scale access services with wide coverage. The utilization of low earth orbit (LEO) satellites for global Internet access and traffic backhaul has garnered attention due to their lower development and launch cost and transmission latency compared with geostationary earth orbit (GEO) and medium earth orbit (MEO) satellites [
6]. The use of inter-satellite links (ISLs) enables the traffic generated by ground IoT users to be relayed among LEO satellites and transmitted back to the terrestrial traffic center [
7]. However, the severe path loss between LEO satellites and ground IoT users makes it difficult for users to directly access LEO satellites due to limited transmission power.
In order to reduce the demand for user-side transmission power, the space–air–ground integrated network has attracted a lot of attention from academia and industry in 6G [
8,
9]. Compared to the orbital altitude of hundreds or thousands of kilometers of LEO satellites, the altitude of drones is much lower, thus needing lower transmission power from ground IoT users [
10]. In the space–air–ground integrated network, drones in the air are utilized to support user access with lower transmission energy costs, and satellites in space are used to provide traffic backhaul with global coverage [
11]. They work together with communication infrastructures on the ground to provide users with various application services. In recent years, a category of drone that can provide users with more stable access, namely high altitude platform (HAP) drone, has become a research hotspot. Different from traditional drones, HAP drones hover at an altitude of about 20 km in the stratosphere, with base stations deployed on them to provide users with ubiquitous and stable access. HAP drones can extend communication capabilities across the space, air, and ground domains. Specifically, aerial networks composed of HAP drones are utilized to support user access and collect traffic generated by users in remote or inaccessible areas lacking communication infrastructures. Then, LEO satellites are used to support traffic backhaul to the terrestrial traffic center, thus supplying stable access and traffic backhaul [
12].
Due to the advantages of low deployment cost, flexible on-demand deployment, and reliable line-of-sight communication link, HAP drones have been employed in the satellite-ground network for user access, traffic backhaul, and task execution. However, practical issues in the space–air–ground integrated network have been overlooked in existing research. For instance, due to the high mobility of LEO satellites, HAP drones need to be switched between different LEO satellites. Therefore, the calculation of the available traffic transmission time of HAP drones must jointly consider the remaining visible time and handover latency. Furthermore, different traffic types generated at HAP drones hold varying values, suggesting a preference for establishing matching for high-value traffic types first. Lastly, the assumption of a specific constant traffic generation state at HAP drones in existing research does not align with the stochastic and deterministic nature of traffic generation in practice, rendering conventional static matching algorithms inapplicable [
13].
Therefore, in order to address the issues mentioned above, a tripartite matching problem among LEO satellites, HAP drones, and traffic types is investigated for the space–air–ground integrated network in this paper. Specifically, the main contributions of this paper are as follows:
First, the network architecture and working mechanism of the space–air–ground integrated network is introduced, which aims at achieving user access and traffic backhaul in remote areas. Different from the conventional static traffic generation state with deterministic variables, the traffic generation state at HAP drones is modeled as a mixture of stochasticity and determinism, which aligns with real-world scenarios.
Then, different from the conventional schemes that treat all traffic types as equally important, we develop a tripartite matching problem among satellites, HAP drones, and traffic types based on the different values of different traffic types. The problem can be decoupled into two sub-problems: traffic–drone matching and LEO–drone matching. Traffic–drone matching is simplified into multiple separate sub-subproblems through mathematical analysis, which can be addressed independently. LEO–drone matching cannot be solved by conventional optimization solvers since the traffic generation state at drones is a mixture of stochasticity and determinism. Thus, reinforcement learning is adopted. Moreover, due to the significant propagation latency between terrestrial traffic center and LEO satellites, a conventional centralized scheme cannot obtain the latest status of the network. Therefore, it cannot devise LEO–drone matching strategies in a timely manner. In addition, the state space of the LEO–drone matching sub-problem is continuous. Therefore, a multi-agent deep reinforcement learning approach with centralized training and decentralized execution is proposed, in which the value network is centrally trained at the terrestrial traffic center and the LEO–drone matching strategy is timely devised at LEO satellites in a decentralized manner.
Finally, the convergence performance of the proposed matching approach is discussed and analyzed through simulations. In addition, the proposed algorithm is compared with state-of-the-art algorithms under different network parameters to validate its effectiveness.
The rest of the paper is organized as follows. The related works are discussed in
Section 2. The system model and working mechanism are illustrated in
Section 3.
Section 4 formulates and simplifies the tripartite matching problem. In
Section 5, the formulated problem is solved by the multi-agent deep reinforcement learning algorithm. Simulation results are presented and discussed in
Section 6. Future work is summarized in
Section 7. Finally, conclusions are drawn in
Section 8.
2. Related Works
Abbasi et al. first presented the potential use cases, open challenges, and possible solutions of HAP drones for next-generation networks [
14]. The main communication links between HAP drones and other non-terrestrial network (NTN) platforms, along with their advantages and challenges, are presented in [
15]. Due to the rapid movement of LEO satellites, the matching relationship between HAP drones and LEO satellites is not fixed, so efficient matching and association strategies need to be developed. In [
16], the matching relationship between user equipment (UE), HAP drones, and terrestrial base stations (BS) is formulated as a mixed discrete–continuous optimization problem under the HAP drone payload connectivity constraints, HAP drones and BSs power constraints, and backhaul constraints to maximize the network throughput. The formulated problem is solved using a combination of integer linear programming and generalized assignment problems. A deep Q-learning (DQL) approach is proposed in [
17] to perform the user association between a terrestrial base station and a HAP drone based on the channel state information of the previous time slot. In addition to the above-mentioned UE’s selection between terrestrial networks and non-terrestrial networks, there has been relevant research on the three-party matching problem among users, HAP drones and satellites in remote areas without terrestrial network coverage. In [
18], the matching problem among users, HAP drones, and satellites is formulated to maximize the total revenue and it was solved by a satellite-oriented restricted three-sided matching algorithm. In [
19], a throughput maximization problem is formulated for ground users in an integrated satellite–aerial–ground network by comprehensively optimizing user association, transmission power, and unmanned aerial vehicle (UAV) trajectory. In [
20], a UAV-LEO integrated traffic collection network is proposed to maximize the uploaded traffic volume while ensuring the energy consumption by comprehensively considering bandwidth allocation, UAV trajectory design, power allocation, and LEO satellite selection. The maximum computation delay among terminals is minimized in [
21] by a joint considering matching relationship, resource allocation, and deployment location optimization. An alternating optimization algorithm based on block coordinate descent and successive convex approximation is proposed to solve this. A joint association and power allocation approach is proposed for the space–air–ground network in [
22] to maximize the transmitted traffic amount while minimizing the transmit power under the constraints of the power budget and quality of service (QoS) requirements of HAP drones and the data storage and visibility time of LEO satellites. The association problem and power allocation problem are alternately addressed by the GUROBI optimizer and the whale optimization algorithm, respectively.
It is worth mentioning that reinforcement learning (RL) algorithms are widely used for HAP drone problems. HAP drones form a distributed network, and with multi-agent RL, the space–air–ground integrated network can effectively become self-organizing. In [
23], a multi-agent Q-learning approach is proposed to tackle the service function chain placement problem for LEO satellite networks in a discrete-time stochastic control framework, thus optimizing the long-term system performance. In [
24], a multi-agent deep reinforcement learning algorithm with global rewards is proposed to optimize the transmit power, CPU frequency, bit allocation, offloading decision, and bandwidth allocation via a decentralized method, thus achieving the computation offloading and resource allocation for the LEO satellite edge computing network. In [
25], the utility of HAP drones is maximized by jointly optimizing association and resource allocation, which is formulated as a Stackelberg game. The formulated problem is transformed into a stochastic game model, and a multi-agent deep RL algorithm is adopted to solve it.
4. Problem Formulation and Transformation
The optimization objective is to establish tripartite matching among LEO satellites, HAP drones, and traffic types by choosing the most suitable LEO–drone matching matrix
and traffic–drone matching matrix
in each time slot, so as to maximize the average transmitted traffic value of the network. The objective function and constraints can be formulated as follows:
Constraint (
10b) specifies that each HAP drone can connect to a maximum of one LEO satellite in each time slot. Constraint (
10c) specifies that the number of HAP drones served by each LEO satellite is equal to the beam number
L. Note that even though each LEO satellite can provide services for less than
L HAP drones, this would lead to inefficient use of satellite beams. Thus, in order to achieve maximum average transmitted traffic value of the network in each time slot, all beams of each satellite will be utilized. Constraint (
10d) specifies that each HAP drone can transmit a maximum of one traffic type in each time slot. (
10e) and (
10f) are restrictions on the elements of the LEO–drone matching matrix and the traffic–drone matching matrix, respectively.
The formulated problem (
10a) is a mixed integer nonlinear programming problem. In the following content, we will analyze and simplify it. Given a specific
and substituting (
9) into (
10a), the original problem is as follows:
Through analysis, it becomes evident that
is solely dependent on the matching
between the
i-th HAP drone and all traffic types in the
m-th time slot, and is independent of the matching
between other HAP drones and traffic types in other time slots. Consequently, maximizing (
11a) can be achieved by maximizing each term within the brackets of (
11a). Thus, (
11a) can be rephrased as follows:
Formulation (
12a) is equivalent to optimizing
independent sub-subproblems. For
, the sub-subproblem can be formulated as follows:
It is feasible that the region can be expressed as follows:
or
Regarding the former, the optimal value of (
13a) is 0, whereas for the latter, the optimal value is greater than or equal to 0. Hence, the optimal solution of (
13a) must adhere to (
15), so as to maximize the objective function. By substituting (
15) into (
13a), it is equivalent to addressing the following:
Its optimal solution can be expressed as follows:
Based on this, the optimal solution of (
11a) can be expressed as follows:
At this point, we have successfully decomposed the optimization sub-problem (
11a) into
independent sub-subproblems through mathematical analysis. The optimal traffic–drone matching matrix
can be obtained according to (
18). Intuitively, once the LEO–drone matching of each time slot is determined, the maximum average transmitted traffic value of the network can be achieved by choosing the traffic type with the highest value for each HAP drone to transmit.
Substituting the optimal solution (
18) into the objective function (
10a) yields the following:
which is solely associated with the LEO–drone matching matrix
.
5. Problem Solving and Algorithm Designing
Typically, conventional optimization solvers are employed to solve problems with deterministic variables [
29]. Problems with random variables are difficult to be solved using these solvers. Nevertheless, the tripartite matching problem that this paper focuses on is a mixture of stochasticity and determinism. Therefore, we adopt reinforcement learning to dynamically solve the LEO–drone matching sub-problem (
19a). Specifically, the matching between each LEO satellite and HAP drones is modeled as a Markov decision process [
30], where each LEO satellite is treated as an agent. The state, action, and reward of the
j-th LEO satellite are defined as follows:
State:
In the m-th time slot, the j-th LEO satellite obtains the state of each HAP drone within its visible range, which includes the available traffic transmission time and the traffic transmission rate of the current m-th time slot, as well as the stored traffic volume and the traffic generation rate of each traffic type in the previous -th time slot. For the HAP drone , which is not within the visible range of the j-th LEO satellite, i.e., , there are , , , and .
Action:
In the m-th time slot, the action of the j-th LEO satellite is to determine which L HAP drones to provide services for. If multiple LEO satellites decide to provide services to the same HAP drone, this HAP drone will actively choose to connect to the LEO satellite, transmitting the highest traffic value.
Reward:
In the m-th time slot, the reward obtained by the j-th LEO satellite after taking action in state is defined as the total transmitted traffic value of the j-th LEO satellite in the current time slot.
Then, reinforcement learning is employed to solve (
19a) based on the above definitions. The discounted return of the
j-th LEO satellite in the
m-th time slot is defined as follows:
where
represents the discount rate, which is used to balance the impact of short-term and long-term rewards. If
is close to 0, the discounted return mainly depends on recent rewards. Conversely, if
approaches 1, the discounted return primarily depends on forward rewards. Q-values can be used to evaluate the expectation of return that the
j-th LEO satellite can achieve by taking action
based on policy
in state
, which can be expressed as follows:
In conventional Q-learning, Q-values of the optimal policy
can be continuously updated through iterations. Generate an episode of length
. For the
t-th iteration, the Q-value of the state-action pair
can be obtained as follows [
31]:
where
.
represents the state at the
t-th step of the episode, and
denotes the action taken in state
.
represents the learning rate, and
denotes the action space of the
j-th LEO satellite.
denotes the average one-step immediate reward acquired after taking action
in state
, which can be represented as follows:
Supposing that the proposed approach converges after
C iterations, the optimal policy can be expressed as follows [
32]:
The aforementioned conventional Q-learning algorithm stores the calculated Q-values
in the form of tables, known as a Q-table, which has the advantages of being intuitive and easy to analyze. However, due to the continuous state space of (
19a), using the conventional tabular Q-learning algorithm requires storing a large volume of data, thereby increasing storage costs. Furthermore, the generalization ability of the conventional Q-learning algorithm is poor. To address these issues, a deep Q-learning algorithm is employed in this paper, which is one of the earliest and most successful algorithms that introduces deep neural networks into reinforcement learning [
32]. In deep Q-learning, the high-dimensional Q-table can be approximated by a deep Q network with low-dimensional parameters, thereby significantly reducing the storage cost. In addition, the Q-values of unvisited state-action pairs can be calculated through value function approximation, giving it strong generalization ability.
In addition, the aforementioned algorithm is fully decentralized, in which each satellite calculates its Q-values according to its own local states, local actions, and local rewards. However, LEO satellites are not completely independent, but influence each other. For example, if the
i-th HAP drone is connected to the
j-th LEO satellite at the current moment, other LEO satellites cannot provide service for this HAP drone. Therefore, the aforementioned fully decentralized reinforcement learning algorithm cannot obtain high performance and may not even converge in some cases. An alternative solution is to use a fully centralized reinforcement learning algorithm. In each time slot, each LEO satellite sends its experience obtained from its interaction with the environment to the terrestrial traffic center. Then, both value network training and strategy making are performed at the center based on global experiences. Nevertheless, the experience of each satellite must pass through multiple ISLs, an LEO-ground link, and an optical fiber link to be transmitted back to the terrestrial traffic center, facing high propagation latency. The terrestrial traffic center is unable to obtain the latest status of the space–air–ground integrated network, so it is unable to make timely LEO–drone matching strategies. To address these issues, we employ multi-agent deep reinforcement learning with centralized training and decentralized execution. The value network of each LEO satellite is trained in a centralized manner at the terrestrial traffic center. Then, the trained value networks are distributed to the corresponding LEO satellites [
33]. Each satellite distributively trains its policy network based on the received value network and the latest local observations, thus it can devise LEO–drone matching strategies in a timely manner.
Specifically, when training the value network, each LEO satellite sends its local experience
obtained from its interaction with the environment to the terrestrial traffic center, where
is the state reached after taking action
in state
. Based on the collected local experiences of various LEO satellites, the terrestrial traffic center forms the global experience, including the global state
, the global action
, and the global reached state
, and stores
in the replay buffer
. Afterwards, the terrestrial traffic center trains the value network of the
j-th LEO satellite based on
to evaluate the quality of the matching approach. As previously mentioned, the deep Q-learning algorithm is adopted, where the true Q-values of the optimal strategy are approximated by the Q-values calculated by the trained value network, which can be obtained through the quasi-static target network scheme [
34]. Specifically, two networks need to be defined: the target network
and the main network
described by parameters
and
, respectively, where
S and
A are global states and global actions collected by the terrestrial traffic center in the form of random variables. The objective of parameter iteration is to minimize the mean square error of the Q-values calculated by the target network and the main network. This can be achieved by minimizing the loss function, which can be expressed as follows:
where
and
represent the reached state and the acquired reward after taking action
A in state
S, respectively. The gradient-descent algorithm is then adopted to minimize the objective function. The gradient of (
25) can be calculated as follows:
where
can be obtained through the gradient back propagation algorithm [
35]. In each iteration, an experience batch
is randomly sampled from the replay buffer
to train the value network. For each sample
in
, the parameter
of the main network is updated as follows:
where
is the learning rate. After
iterations, the parameter
of the target network is updated as
:
Algorithm 1 presents the matching algorithm based on multi-agent deep reinforcement learning, in which
-greedy is used to balance exploitation and exploration. The value network of each LEO satellite is centrally trained at the terrestrial traffic center based on the global states, global actions, and the local reward of each LEO satellite. Then, the trained value network is sent to the corresponding LEO satellite. At the
j-th LEO satellite, its policy network can be trained in a decentralized manner based on its received value network with parameter
and its local observations. Afterwards, each LEO satellite develops its own optimal strategy based on its trained policy network to maximize the long-term return. Finally, each LEO satellite broadcasts the matching strategy to all HAP drones within its visible range.
Algorithm 1 Matching approach based on multi-agent deep reinforcement learning |
- Input:
Episode length , learning rate , greedy factor , discount factor , iteration number , and randomly initialize parameters and states , let , , , and ; - Output:
Optimal strategy for each LEO satellite - 1:
for to do - 2:
for to J do - 3:
The j-th LEO satellite takes action according to -greedy strategy, where the optimal action is ; - 4:
Interact with environment to get the rewards and the reached states ; - 5:
end for - 6:
Form the global state , the global action , and the global reached state ; - 7:
for to J do - 8:
Store into the replay buffer ; - 9:
Randomly sample an experience batch from ; - 10:
Update based on according to ( 27); - 11:
end for - 12:
; - 13:
if then - 14:
for ; - 15:
; - 16:
end if - 17:
for to J do - 18:
Send the trained value network to the j-th LEO satellite; - 19:
The j-th LEO satellite trains its own policy network based on and ; - 20:
Develop the optimal strategy of the j-th LEO satellite based on its trained policy network. - 21:
end for - 22:
end for
|
6. Simulation Results
In order to verify the effectiveness of the proposed matching algorithm, preliminary simulations are conducted. The main simulation parameters are listed in
Table 1. We compare the proposed approach with some state-of-the-art algorithms, including deep deterministic policy gradient (DDPG), deep Q-network (DQN), and two greedy methods.
For the first greedy method (abbreviated as Greedy 1), each LEO satellite will choose the L HAP drones with the highest channel gains within its visible range to establish connections.
For the second greedy method (abbreviated as Greedy 2), each LEO satellite will choose the L HAP drones with the longest remaining visible time within its visible range to establish connections.
For both Greedy 1 and Greedy 2, each HAP drone that has established a connection with an LEO satellite will choose the traffic type with the largest transmitted traffic value for transmission.
Figure 3 illustrates the transmitted traffic values of the proposed matching approach in one time slot under episode lengths of 500, 1000, 1500, 2000, 2500, and 3000, respectively. When the length of the episode does not exceed 2000, the transmitted traffic values in one time slot increase significantly with the increase of the episode length. However, when the length of the episode exceeds 2000, the transmitted traffic values in one time slot are basically the same for various episode lengths. Thus, the length of the episode is set to 2000 in subsequent simulations, thereby saving computational resources while ensuring performance. Furthermore, it can be observed that for any episode length, the transmitted traffic value will first increase and then remain essentially stable, which validates the convergence of the proposed matching algorithm.
Figure 4 illustrates the variation of the relative mean square error of Q-values obtained by the target network and the main network under learning rates of 0.15, 0.1, 0.08, and 0.05, respectively. As the learning rate
increases from 0.05 to 0.1, the rate of decrease in relative mean square error accelerates. Nevertheless, as the learning rate continues to increase from 0.1 to 0.15, the rate of decrease in relative mean square error remains almost unchanged, but its fluctuations will increase. Therefore, in order to balance the convergence speed and stability, we set the learning rate
to 0.1 in subsequent simulations.
Figure 5 illustrates the total transmitted traffic values of different algorithms under varying HAP drone transmission powers. It can be seen that with the increase of the transmission power, the total transmitted traffic values of all algorithms increase. This is because, according to (
2), increasing the transmission power of HAP drones can improve the traffic transmission rates, thereby increasing the total transmitted traffic value of the space–air–ground integrated network. From
Figure 5 we can see that the proposed multi-agent deep RL algorithm is the best. Since multi-agent deep RL utilizes centralized training and decentralized execution to reduce the interference of non-stationary environments among agents, the proposed algorithm can increase the transmitted traffic value compared with DDPG and DQN. Furthermore, all the three RL-based algorithms perform better than the greedy methods due to the following two reasons.
Greedy 1 aims to improve the transmission rate between LEO satellites and HAP drones by choosing HAP drones with higher channel gains, thereby increasing the total transmitted traffic value. Similarly, Greedy 2 focuses on reducing the handover latency by choosing HAP drones with long remaining visible time, thereby improving the available traffic transmission time of HAP drones, so as to increase the total transmitted traffic value. In contrast, the RL-based algorithms take a more comprehensive perspective by jointly considering multi-dimensional characteristics such as remaining visible time, channel condition, handover latency, and traffic storage capacity. Thus, the RL-based algorithms can improve the total transmitted traffic value of the network from a global perspective, surpassing the performance of Greedy 1 and Greedy 2.
Both Greedy 1 and Greedy 2 rely on static matching algorithms, which fail to account for the randomness of traffic generation at HAP drones. In contrast, the RL-based algorithms can learn the randomness of the traffic generation at HAP drones and make the matching strategy based on this learning.
Figure 6 illustrates the total transmitted traffic values of different algorithms with respect to the LEO satellite beam number
L. As the number of LEO satellite beams increases, the total transmitted traffic values of all algorithms will also increase. This is because increasing the number of LEO satellite beams can relax the constraint (
10c), thereby allowing more HAP drones to transmit traffic to LEO satellites simultaneously, so as to increase the total transmitted traffic value of the space–air–ground integrated network. From
Figure 6, we can see that the proposed multi-agent deep reinforcement learning algorithm is the best since it can learn from the experience of the other LEO satellites. Furthermore, all three RL-based algorithms perform better than greedy methods for the same reasons shown in
Figure 5.
7. Future Work
Although the proposed approach can effectively address the tripartite matching problem among LEO satellites, HAP drones, and traffic types, there are some limitations.
7.1. Matching among Various Network Nodes
In this paper, only the matching problem between HAP drones and LEO satellites is considered. However, in the space–air–ground integrated network, in addition to HAP drones and LEO satellites, there are also a variety of network nodes, such as ground users, gateway stations, and geostationary earth orbit satellites. In the future, it is necessary to investigate the matching relationships among different nodes to improve the topology of the space–air–ground integrated network. For example, the matching problem between ground users and HAP drones should be addressed by comprehensively considering multiple factors such as the location, movement speed, and service requirements of ground users and the payloads of HAP drones.
7.2. Computing Task Assignment and Resource Allocation
Our research only considers how to perform user access and traffic backhaul in remote areas where ground base stations are difficult to deploy. However, in addition to serving remote areas, HAP drones can also provide low-latency edge computing services for IoT devices in urban areas with ground base station coverage. In the future, the great pressure that computing-intensive applications place on resource-constrained Internet of Things (IoT) devices with limited computing capability and energy storage can be alleviated by offloading latency-sensitive computing tasks to nearby edge nodes. A matching strategy for ground users, HAP drones, and ground base stations should be developed by jointly optimizing computing task assignment and resource allocation, thus improving the performance of the space–air–ground integrated network, such as minimizing the maximum task execution latency among IoT devices or maximizing the amount of transmitted traffic per unit time.
7.3. HAP Drone Localization
The positions of HAP drones are assumed to be stationary and known in our paper. However, the positions of HAP drones will constantly change due to jitter. Only by knowing the exact location of HAP drones can we accurately calculate the distance between HAP drone and LEO satellite, the remaining visible time, and the channel capacity. Therefore, the exact location of HAP drone is essential for making the user access and traffic backhaul strategy of the space–air–ground integrated network. In the future, the HAP drone localization problem needs to be solved. Other positioning systems can be added to estimate the exact location of HAP drone. For example, reinforcement learning-based algorithms can be used to regularly predict the exact location of HAP drone by inputting atmospheric data such as wind speed.