Intelligent Routing Optimization via GCN-Transformer Hybrid Encoder and Reinforcement Learning in Space–Air–Ground Integrated Networks

Liu, Jinling; Li, Song; Li, Xun; Zhang, Fan; Wang, Jinghan

doi:10.3390/electronics15010014

Open AccessArticle

Intelligent Routing Optimization via GCN-Transformer Hybrid Encoder and Reinforcement Learning in Space–Air–Ground Integrated Networks

by

Jinling Liu

¹

,

Song Li

^2,*,

Xun Li

²,

Fan Zhang

² and

Jinghan Wang

¹

Graduate School, Air Force Engineering University, Xi’an 710051, China

²

Air Defense and Antimissile School, Air Force Engineering University, Xi’an 710051, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(1), 14; https://doi.org/10.3390/electronics15010014 (registering DOI)

Submission received: 25 November 2025 / Revised: 15 December 2025 / Accepted: 17 December 2025 / Published: 19 December 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

The Space–Air–Ground Integrated Network (SAGIN), a core architecture for 6G, faces formidable routing challenges stemming from its high-dynamic topological evolution and strong heterogeneous resource characteristics. Traditional protocols like OSPF suffer from excessive convergence latency due to frequent topology updates, while existing intelligent methods such as DQN remain confined to a passive reactive decision-making paradigm, failing to leverage spatiotemporal predictability of network dynamics. To address these gaps, this study proposes an adaptive routing algorithm (GCN-T-PPO) integrating a GCN-Transformer hybrid encoder, Particle Swarm Optimization (PSO), and Proximal Policy Optimization (PPO) with spatiotemporal attention. Specifically, the GCN-Transformer encoder captures spatial topological dependencies and long-term temporal traffic evolution, with PSO optimizing hyperparameters to enhance prediction accuracy. The PPO agent makes proactive routing decisions based on predicted network states (next K time steps) to adapt to both topological and traffic dynamics. Extensive simulations on real dataset-parameterized environments (CelesTrak TLE data, CAIDA 100G traffic statistics, CRAWDAD UAV mobility models) demonstrate that under 80% high load and bursty Pareto traffic, GCN-T-PPO reduces end-to-end latency by 42.4% and packet loss rate by 75.6%, while improving QoS satisfaction rate by 36.9% compared to DQN. It also outperforms SOTA baselines including OSPF, DDPG, D2-RMRL, and Graph-Mamba. Ablation studies validate the statistical significance (p < 0.05) of key components, confirming the synergistic gains from spatiotemporal joint modeling and proactive decision-making. This work advances SAGIN routing from passive response to active prediction, significantly enhancing network stability, resource utilization efficiency, and QoS guarantees, providing an innovative solution for 6G global seamless coverage and intelligent connectivity.

Keywords:

intelligent routing; deep reinforcement learning; graph convolutional network; transformer; spatiotemporal prediction; proximal policy optimization

1. Introduction

1.1. The Vision and Driving Forces of the Integrated Air Space Network (SAGIN)

The Integrated Space–Air–Ground Network (SAGIN) is widely regarded as the core infrastructure for achieving the vision of “global seamless coverage” and “intelligent connectivity of all things” for sixth-generation mobile communication (6G) and future networks [1,2]. SAGIN aims to build a three-dimensional, wide-area, on-demand access communication system by organically integrating the three-layer heterogeneous resources of space-based networks (such as low-orbit LEO and medium-orbit MEO satellite constellations), air-based networks (such as drone clusters and high-altitude platforms), and ground-based networks (such as ground 5G/6G base stations and user terminals) [3,4].

The rise of SAGIN is driven by both technology and business. Technologically, the maturity of reusable rocket technology, such as SpaceX’s Falcon 9, greatly reduces the launch cost of satellites, especially the Starlink constellation, making it commercially viable to deploy large-scale constellations consisting of thousands of satellites. In terms of business, traditional terrestrial cellular networks only cover about 20% of the Earth’s land area, while vast oceans, deserts, mountains, and remote areas remain underserved. SAGIN, in particular the LEO constellation, with its low propagation delay (MEO is about 125 ms, LEO can be as low as 30 ms) brought by its low orbit, can provide low-delay and high-bandwidth Internet access services for the world (including aviation, navigation, and remote areas), which is the key to bridging the global digital divide [5,6].

1.2. Core Challenges of SAGIN Routing: High Dynamics and Strong Heterogeneity

Although SAGIN has broad prospects, its unique architecture also brings routing challenges far beyond those of traditional the Internet, which are mainly reflected in “high dynamics” and “strong heterogeneity” [7,8].

High dynamics: High dynamics represent the primary challenge for SAGIN routing. In the space layer, LEO satellites continuously orbit Earth at orbital velocities of approximately 7.66 km/s, causing periodic yet extremely frequent changes in the visibility of inter-satellite links (ISLs) and satellite–ground links (SGLs). Taking the Iridium constellation as an example, the average visibility window for a single satellite to a ground observer is only 9 min, after which a handover must be executed, leading to topology updates occurring at the minute or even second level [9]. Concurrently, while drone swarms in the air layer offer deployment flexibility, their adoption of uncertain movement models like random walks triggers non-periodic link additions and removals, further amplifying the unpredictability of the entire network topology [10,11]. The superposition of these two layers causes SAGIN’s link state to change at a pace far exceeding traditional terrestrial networks, imposing stringent demands for continuous real-time convergence on routing algorithms.

Strong heterogeneity: SAGIN is a “network of networks”. There are significant differences between the space-based, air-based, and ground-based layers in terms of node capabilities (satellites have limited computing resources), link characteristics (e.g., high-bandwidth laser ISL vs. weather-sensitive radio frequency SGL), and communication protocol stacks [12]. In addition, the services carried by the network are highly heterogeneous: for instance, real-time video streams require latency of less than 100 ms, while data backup services demand high bandwidth. Such differentiated Quality of Service (QoS) requirements place extremely high demands on the fine-grained scheduling capabilities of routing algorithms [8,13].

To more accurately characterize the heterogeneity and dynamics of SAGIN, this paper introduces multi-layer node attribute modeling (including satellites, UAVs, and ground stations) within the model. Specifically, this includes node processing capacity (CPU), link type (laser), link stability (QoS, packet loss rate), and node mobility (orbital parameters). Furthermore, experiments are driven by real-world TLE data to ensure the model accurately reflects high-dynamic scenarios.

1.3. Core Challenges of SAGIN Routing

To address dynamic networks, researchers first attempted to apply traditional terrestrial routing protocols (e.g., OSPF, BGP) to SAGIN; both operational experience and analytical studies have demonstrated fundamental limitations [14]. Furthermore, with the coexistence of 5G and satellite networks, interference issues between terrestrial 5G base stations and satellite earth stations have become increasingly prominent. There is an urgent need to achieve coordinated operation of heterogeneous networks through guard band design and spectrum coordination mechanisms [15].

The design philosophy of these protocols is based on a core assumption: the network topology is basically static, and link changes (failures) are low-probability events. Therefore, they adopt a passive-convergence mechanism: flooding updates only after a topology change is detected and recalculating routes. However, in SAGIN, “change” is the norm rather than the exception.

(1): Failure of OSPF (Open Shortest Path First): In LEO networks, the high-speed movement of satellites transforms topology changes from “occasional events” into “the norm,” continuously triggering Link State Advertisement (LSA) floods from OSPF nodes. Existing research indicates that as node scale expands, both LSA volume and routing overhead increase exponentially. Maintaining topology synchronization alone may consume over 12% of onboard bandwidth [14]. More critically, while control messages are still propagating through the network, the next wave of link switches arrives: routing tables are marked “outdated” before they can converge, triggering periodic oscillations, loops, and severe performance degradation [9].
(2): Lag of BGP (Border Gateway Protocol): BGP’s distributed path discovery mechanism reveals critical shortcomings in highly dynamic inter-satellite topologies. Whenever a link switch occurs, BGP must undergo a lengthy cycle of “path discovery–revocation–rediscovery,” with convergence times often reaching tens of seconds or even minutes. During this period, the data plane is forced to adopt suboptimal or invalid routes, causing prolonged connection interruptions and high latency. For SAGIN real-time services demanding end-to-end latency below a low threshold, this “disconnect-then-reconnect” behavior is clearly unacceptable [15].
(3): In addition to high dynamics and strong heterogeneity, centralized routing schemes also face non-negligible overhead challenges. The bandwidth resources of satellite networks are scarce (especially for satellite-ground links), and the uplink and downlink transmission delays are non-negligible. The processes such as state data interaction and routing rule issuance between the Centralized Control Center (G-CCC) and satellites, UAVs, and ground nodes will generate communication overhead. At the same time, the spatiotemporal prediction and reinforcement learning decision-making of G-CCC will generate computational overhead. If these overheads are not reasonably controlled, they may occupy a large amount of satellite bandwidth or prolong the uplink and downlink response time, making the centralized scheme infeasible. It has been confirmed that the LSA flooding of traditional protocols (such as OSPF) will consume more than 12% of the onboard bandwidth [14], while existing intelligent routing schemes (such as DQN and DDPG) have not been specially optimized for the overhead of centralized architectures, which is also a key issue that needs to be supplemented and is discussed in this paper.

1.4. Evolution of Intelligent Routing: From Q-Routing to DRL

The failure of traditional protocols proves that SAGIN routing must shift from “passive convergence” to “active adaptation and prediction” [4,16]. This has promoted the development of intelligent routing algorithms based on machine learning (ML) [17].

(1): Phase 1: Classical reinforcement learning (Q-Routing): Early Q-Routing introduced distributed reinforcement learning into the network layer, enabling each node to maintain a Q-value table so that packets “learn” optimal paths while being forwarded [18]. However, this tabular storage faced a dimensionality disaster in SAGIN, where node scale and state space expanded dramatically—Q-table size grew exponentially with the number of states and actions. Experiments show that when satellite counts exceed dozens, algorithm convergence time extends from seconds to minutes, with near-complete loss of generalization capability for unseen topologies [19]. Consequently, Q-Routing remains limited to small-scale static scenarios and cannot adapt to highly dynamic, large-scale air–ground–space networks.
(2): Phase 2: Deep reinforcement learning (DRL): To overcome the dimensionality catastrophe, deep reinforcement learning replaces Q-tables with deep neural networks, achieving end-to-end abstraction of high-dimensional states. DQN pioneered the integration of convolutional networks with Q-learning, enabling direct action value outputs on continuous vectors like delay and queue length [12,20,21,22]. Policy gradient methods such as PPO and A2C further enhanced training stability and sample efficiency by constraining step size and advantage estimation [23,24]. Nevertheless, existing DRL routing still follows the passive “perception–action” paradigm: agents make decisions based solely on instantaneous features like current queue and link delay. They cannot predict satellite handover 10 min in advance or reserve bandwidth for sudden traffic surges, thus failing to exploit potential gains from SAGIN’s orbital periodicity and traffic predictability [25,26].

1.5. Contributions of This Paper: Deep Reinforcement Learning Based on Spatiotemporal Prediction

To address the above gaps, this paper proposes an intelligent routing algorithm based on a GCN + Transformer hybrid encoder and PPO reinforcement learning. Its core innovation lies in upgrading routing decisions from “passive reactive” to “proactive decision-making” [27,28,29].

The specific technical path is as follows:

(1): Spatiotemporal state prediction: The network state of SAGIN (latency, bandwidth, load) is essentially high-dimensional spatiotemporal graph data. To achieve proactive perception of network status, some studies treat SAGIN’s latency, bandwidth, and load as high-dimensional spatiotemporal graph signals and propose a “GCN-Transformer” hybrid encoder: first, a Graph Convolutional Network (GCN) aggregates multi-hop neighbor features on the topological snapshot at each time step, extracting the spatial coupling relationships between satellite–satellite and satellite–ground links [30,31,32,33,34]; subsequently, the node embedding sequence output by the GCN is fed into the Transformer encoder, which captures the dynamic evolution of traffic and link quality over extended time spans through multi-head self-attention [35,36]. The two components are end-to-end concatenated to predict the entire network state for the next K time steps in a single pass, providing reliable “preview” input for subsequent routing decisions [27,34,37].
(2): Hyperparameter optimization (PSO): For such a complex hybrid encoder, its hyperparameters (e.g., number of GCN layers, number of Transformer heads) are difficult to tune manually. This paper innovatively introduces the Particle Swarm Optimization (PSO) algorithm to automatically search for the optimal hyperparameter combination with the goal of minimizing the prediction Mean Squared Error (MSE), avoiding local optima.
(3): Intelligent routing decision (PPO): Some studies deploy Proximal Policy Optimization (PPO) agents within the Ground Centralized Control Center (G-CCC), introducing “predict-then-decide” as a core innovation: the input to the Actor–Critic network is no longer the current instantaneous state but rather the future K-step network profile generated by the GCN-Transformer. This enables policy updates based on impending topological and traffic changes. To further compress the high-dimensional state space, the Actor network incorporates a spatiotemporal attention module: in the spatial dimension, it automatically focuses on predicting congested nodes and high-SNR links; in the temporal dimension, it prioritizes upcoming load peak windows. This enables a single output of forward-looking end-to-end routing policies, achieving simultaneous improvements in QoS and network utilization [38,39,40,41].
(4): Summary of Contributions: This paper proposes a GCN-Transformer hybrid encoder to achieve high-precision spatiotemporal prediction of SAGIN network states. By introducing Particle Swarm Optimization (PSO) for automatic search of encoder hyperparameters, prediction errors are significantly reduced. Building upon this, a PPO-based routing agent is designed, driven by “predicted information + spatiotemporal attention” to make forward-looking QoS decisions. Finally, leveraging real-world datasets from CelesTrak, CAIDA, and CRAWDAD, we establish a parameterized experimental platform. Through comparative, ablation, scalability, and robustness experiments, we validate that our proposed solution outperforms OSPF, Q-Routing, and standard DQN-Routing algorithms [21,23,24], offering a novel intelligent routing paradigm for integrated air–ground networks.

2. Problem Modeling

2.1. Network Model

As mentioned earlier, we model the Integrated Space–Air–Ground Network (SAGIN) as a dynamic spatiotemporal graph

G = (V, E, T)

, with the following notations:

Node set

V

: This includes space-based nodes (LEO/MEO satellites), air-based nodes (UAV swarms), and ground-based nodes (ground stations and user terminals). Node attributes consist of position coordinates, current load (CPU/memory utilization), movement speed and direction (for satellites and UAVs), and node type.

Edge set

E

: This represents communication links (e.g., inter-satellite laser links, satellite–ground radio frequency links, air–ground links). Edge attributes include latency, bandwidth, packet loss rate, Signal-to-Noise Ratio (SNR), and link stability.

Time dimension

T

: The network state evolves over time, forming spatiotemporal sequence data. Each time step records changes in the aforementioned attributes.

The adjacency matrix

A

of the graph at time

t

is defined as

A_{i j} (t) = \{\begin{matrix} 1 i f t h e r e i s a l i n k b e t w e e n n o d e s i a n d j a t t i m e t, \\ 0 o t h e r w i s e . \end{matrix}

(1)

The node feature matrix

X (t) \in ℝ^{|V| \times F}

captures

F

features for each node at time

t

.

2.2. Optimization Objective

The optimization objective is to find the optimal routing path to achieve the comprehensive optimization of Quality of Service (QoS) indicators. This paper introduces an adaptive weight adjustment mechanism based on reinforcement learning. Specifically, during training, the PPO agent dynamically adjusts the weight coefficients for delay, bandwidth, and packet loss rate within the objective function based on the current network state (e.g., link load, SNR, queue length). The weight adjustment strategy is implicitly learned through the reward function, eliminating the need for manual intervention. It aims to minimize end-to-end latency, maximize effective bandwidth (bottleneck bandwidth), and minimize packet loss rate, while satisfying QoS constraints of different services.

The multi-objective optimization is formulated as follows:

m i n \sum_{p \in P} (w_{1} \cdot d_{p} + w_{2} \cdot (1 - b_{p}) + w_{3} \cdot l_{p}),

(2)

where

P

denotes the set of all routing paths and

d_{p}

is the end-to-end latency of path

p

(sum of link latencies):

d_{p} = \sum_{e \in p} d_{e},

(3)

where

d_{e}

is the latency of edge

e

and

b_{p}

is the normalized minimum bandwidth of path

p

(bottleneck bandwidth):

b_{p} = \underset{e \in p}{m i n} \frac{b_{e}}{b_{m a x}},

(4)

where

b_{e}

is the bandwidth of edge

e

and

b_{m a x}

is the maximum possible bandwidth;

l_{p}

is the cumulative packet loss rate of path

p

:

l_{p} = 1 - \prod_{e \in p} (1 - l_{e}),

(5)

where

l_{e}

is the packet loss rate of edge

e

, and weights

w_{i}

are dynamically adjusted according to service priorities (e.g.,

w_{1}

= 0.5 for latency-sensitive services). Constraints include QoS thresholds (

d_{p}

≤

D_{\max}

), network load balancing (avoiding node load > 80%), and energy constraints (for UAV nodes).

2.3. Challenge Modeling

As mentioned earlier, the main challenges faced by this modeling are as follows:

Heterogeneity: Networks at different layers have distinct link characteristics and node capabilities, leading to the complexity of cross-domain link scheduling and resource allocation.
Dynamics: The high-speed movement of satellites and UAVs results in rapid time-varying topological structures, requiring routing algorithms to have high adaptability.
High-dimensional spatiotemporal data: The combination of spatial (node) and temporal (historical) dimensions leads to a huge data scale $(O (n^{2} t))$ , which places extremely high demands on the representation capability and computational efficiency of algorithms.
In actual deployment, SAGIN may encounter unpredictable sudden anomalies, such as link interruptions caused by extreme weather or temporary satellite failures.

2.4. Symbol Meaning

The notations used in this paper are summarized in Table 1.

3. Adopted Methods

To address the aforementioned challenges, this paper proposes a methodological framework. The overall system architecture, illustrating the interaction between the Ground Centralized Control Center (G-CCC) and the SAGIN tiers, is shown in Figure 1.

The detailed algorithmic framework of the proposed GCNT-PPO model, detailing the internal dataflow of the state prediction and routing decision modules, is illustrated in Figure 2. This framework includes a network state prediction model, an intelligent routing algorithm, and a centralized control architecture.

3.1. Network State Prediction Model (GCN + Transformer + PSO)

(1)

Hybrid encoder framework: We adopt a GCN + Transformer hybrid encoder to process high-dimensional spatiotemporal data and achieve high-precision prediction.

GCN (spatial dependence): The GCN adopted in this paper belongs to the Spatial Graph Convolutional Network (Spatial GCN). Its core convolution operator is “neighborhood feature weighted aggregation”, and the specific definition is as follows:

Type of convolution operator: Spatial convolution, which captures spatial dependencies by directly aggregating the multi-hop neighborhood features of nodes, rather than using the Fourier transform approach of spectral convolution.

Message passing/aggregation function: For the

l - t h

layer of GCN, the aggregation function of node

i

is the normalized weighted sum of the features of neighboring nodes, and the message passing process is defined as

m_{i}^{(l)} = \sum_{j \in N (i) \cup \{i\}} {\hat{A}}_{i j} H_{j}^{(l)} W^{(l)},

(6)

H_{i}^{(l + 1)} = σ (m_{i}^{(l)}),

(7)

where

N (i)

represents the set of first-order neighboring nodes of node

i

,

{\hat{A}}_{i j}

is an element of the normalized adjacency matrix

\hat{A}

(reflecting the contribution weight of node

j

to node

i

), and

σ (\cdot)

uses the ReLU activation function. The core of message passing is “balancing the differences in node degrees through the normalized adjacency matrix to achieve fair aggregation of neighborhood features”.

\hat{A}

is the normalized adjacency matrix:

\hat{A} = D^{- \frac{1}{2}} A D^{- \frac{1}{2}},

(8)

with

D

being the degree matrix;

H^{(0)}

is the initial node feature (position, load, etc.), and

W^{(l)}

is the trainable weight.

GCN topology graph construction logic: The input topology graph of the GCN directly reuses the SAGIN dynamic spatiotemporal graph

G = (V, E, T)

defined in Section 2.1. The specific mapping relationships are as follows:

Nodes: Exactly the same as

V

, including the space layer (LEO/MEO satellites), the air layer (UAV clusters), and the ground layer (gateways/user terminals).

Edges: In one-to-one correspondence with

E

, only retaining edges where “a communication link exists at the current moment” (i.e., edges where the element in the adjacency matrix

A (t)

is 1), including inter-satellite laser links, satellite–ground radio frequency links, and air–ground links.

Node features: The initial feature matrix input to GCN is

H^{(0)}

, including position coordinates (x, y, z), CPU utilization, memory utilization, moving speed, moving direction, and node type.

Edge features: Indirectly integrated through the adjacency matrix

\hat{A}

(the normalization process has implicitly included the connectivity weights of edges). Meanwhile, key attributes of edges (delay, bandwidth) are concatenated as supplementary dimensions to the node features, ensuring that the GCN can perceive differences in link quality during aggregation.

Transformer (temporal dependence): The output $H$ of GCN is regarded as a time-series input and fed into the Transformer encoder. The multi-head self-attention mechanism of the Transformer can capture the long-term temporal dependence of indicators such as traffic load and link quality:

$Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V .$

(9)
Hybridization and output: The embedding of the GCN is fused with the positional encoding of the Transformer, and finally, the predicted values of latency, bandwidth, and packet loss rate for the next k steps are output through a fully connected layer.

The pseudo-code for the network state prediction algorithm is presented in Algorithm 1.

Algorithm 1 Network state prediction using GCN + Transformer

Require: Historical network states

\{G (t - 1), \dots, G (t - 1)\}

, where

G (t) = (V, E, X (t), A (t))

Ensure: Predicted states

\{G (t + 1) . \dots, \hat{G} (t + k)\}

1: Initialize GCN layers and Transformer encoder
2: for each time step

τ = t - T

to

t - 1

do
3: Compute spatial embeddings:

H^{(τ)} = G C N (X (τ), A (τ))

4: end for
5: Form time-series input:

H = [H^{t - T}, \dots, H^{t - 1}]

6: Add positional encoding to

H

7: Compute temporal dependencies:

Z = TransformerEncoder (H)

8: Predict future states:

\hat{Y} = F C (Z)

// latency, bandwidth, packet loss
9: Return predicted graphs

\{\hat{G} (t + 1), \dots, \hat{G} (t + k)\}

from

\hat{Y}

(2): The specific configuration of PSO in this paper is designed for high-dimensional mixed search spaces (including integer- and continuous-value parameters), with details as follows:

Swarm size: 50 particles (each particle corresponds to a set of hyperparameter combinations, including 8 dimensions such as the number of GCN layers and the number of Transformer heads).

Iteration budget: 100 times (verified by pre-experiments, PSO converges to the global optimum after 100 iterations, and further increasing iterations does not improve performance).

Inertia weight (

w

): Linear decreasing strategy,

w \in [0.4, 0.9]

(initial 0.9 ensures global exploration, and later 0.4 strengthens local exploitation).

Acceleration coefficients:

c_{1} = 1.5

(individual cognitive weight),

c_{2} = 2.0

(group social weight).

Search space boundaries: Number of GCN layers

\in \{2, 3, 4, 5\}

(integer), number of Transformer heads

\in \{4, 6, 8, 10, 12, 16\}

(integer), learning rate

\in \{5 \times 10^{- 4}, 5 \times 10^{- 3}, 1 \times 10^{- 3}, 1 \times 10^{- 2}, 5 \times 10^{- 2}\}

, and hidden layer dimension

\in \{128, 256, 512\}

(integer).

Computational cost management:

Parallel computing: We use GPU multi-thread parallelism to evaluate the fitness (prediction MSE) of each particle, reducing the single iteration time from 120 s in serial to 30 s.

Early stopping mechanism: If the change in the global optimal fitness (MSE) for 10 consecutive iterations is <1 × 10⁻⁵, the iteration is terminated early. The average number of iterations is reduced from 100 to 75.

We introduce the Particle Swarm Optimization (PSO) algorithm to automatically search for the optimal hyperparameter combination. PSO initializes a swarm of particles (e.g., 50 particles), where each particle represents a set of hyperparameters. By iteratively updating the position and velocity of particles, the prediction Mean Squared Error (MSE) on the validation set is used as the fitness function to perform global search for the optimal solution, aiming to achieve a prediction accuracy of >95%. The update equations for the velocity and position of each particle are as follows:

\{\begin{cases} v_{i} (t + 1) = w v_{i} (t) + c_{1} r_{1} (p_{best, i} - x_{i} (t)) + c_{2} r_{2} (g_{best} - x_{i} (t)), \\ x_{i} (t + 1) = x_{i} (t) + v_{i} (t + 1) . \end{cases}

(10)

where

v_{i} (t)

and

x_{i} (t)

are the velocity and position of particle

i

at iteration

t

, respectively;

w

is the inertia weight;

c_{1}

and

c_{2}

are acceleration coefficients;

r_{1}

and

r_{2}

are random numbers in [0, 1];

p_{best, i}

is the best position of particle

i

so far; and

g_{best}

is the global best position in the swarm.

The fitness function is defined as

f (x_{i}) = \frac{1}{M} \sum_{m = 1}^{M} {(y_{m} - {\hat{y}}_{m})}^{2} .

(11)

where

y_{m}

and

{\hat{y}}_{m}

are the true and predicted values for sample

m

, and

M

is the number of validation samples.

3.2. Intelligent Routing Algorithm (PPO + Spatiotemporal Attention)

We adopt the Proximal Policy Optimization (PPO) reinforcement learning algorithm as the agent to achieve adaptive routing decision-making.

Reinforcement learning model (PPO):

State space ( $S$ ): Includes the predicted network state results for the next k steps from Section III-A, current service QoS requirements, and global load distribution.
Action space ( $A$ ): Probability distribution of the next-hop node (discrete) or routing path (continuous).
Reward function ( $R$ ): Designed as a weighted sum of multi-objective QoS, incentivizing the agent to select paths with low latency, high bandwidth, and low packet loss while considering load balancing u:

R = - α \cdot d_{p} + β \cdot b_{p} - γ \cdot l_{p} + δ \cdot u .

(12)

PPO ensures the stability of policy updates through the clipped surrogate objective.

Spatiotemporal attention mechanism:

To enable the PPO agent to focus on key information in the high-dimensional state space, we embed the spatiotemporal attention mechanism into the Actor (policy) network of PPO:

Spatial attention: Enables the agent to focus on important nodes and links in the current topology (e.g., links with high SNR and high remaining bandwidth):

α_{i j} = \frac{\exp (e_{i j})}{\sum_{k} \exp (e_{i k})}, e_{i j} = \frac{h_{i}^{T} h_{j}}{d} .

(13)

2.: Temporal attention: Enables the agent to focus on key time steps in the prediction sequence (e.g., upcoming peak load moments).

The pseudo-code for the PPO-based routing algorithm is presented in Algorithm 2.

Algorithm 2 Intelligent routing using PPO with spatiotemporal attention

Require: Predicted states

S = \{\hat{G} (t + 1), \dots, \hat{G} (t + k)\}

, QoS requirements
Ensure: Optimal routing policy

π

1: Initialize Actor network πθ and Critic network

V_{ϕ}

2: while not converged do
3: Collect trajectories using current policy

π_{θ}

4: Compute advantages

\hat{A} using Critic V_{ϕ}

   5:     Embed spatiotemporal attention in Actor:
   6:     for each state in trajectory do
   7:         Compute spatial attention weights

α_{i j}

   8:         Compute temporal attention over predictions
   9:         Update action probabilities
   10:    end for
   11:     Update policy with clipped surrogate objective:

L (θ) = \min (r (θ) \hat{A}), clip (r (θ), 1 - ε, 1 + ε) \hat{A}

   12:    Update Critic with MSE loss
   13: end while
   14: Return optimized policy

π

3.3. Centralized Control and Execution Architecture

This scheme adopts a “centralized training, centralized execution” architecture, with the core being the Ground Centralized Control Center (G-CCC). The centralized architecture of this scheme has taken into account the bandwidth constraints and uplink–downlink time characteristics of satellite networks during its design.

Data collection: G-CCC periodically collects state telemetry data from all network nodes (satellites, UAVs, ground stations) through standard network management protocols to form a spatiotemporal graph database.
State prediction: G-CCC uses its powerful ground computing resources to run the GCN-Transformer + PSO model in Section III-A to generate the global state prediction graph for the next K steps.
Routing decision (training): The PPO agent on G-CCC performs reinforcement learning training in an offline or quasi-online manner, using predicted data and reward functions to continuously optimize its Actor–Critic network.
Policy Execution: After training convergence, G-CCC calculates the optimal path (or routing policy) for carrying QoS services and issues explicit routing rules (e.g., source routing paths or updated forwarding table entries) to key nodes along the path (such as the entry ground station and core satellite nodes) for execution, realizing real-time adjustment of global routing and QoS guarantee.

3.4. Improvement Points

Compared to existing research, the main improvements of this plan are as follows:

Spatiotemporal hybrid encoding: Compared with traditional Transformers (only sequence modeling) or GCNs (only spatial modeling), this scheme introduces GCN-Transformer hybrid encoder, which captures both graph structure dependencies and time-series dependencies, significantly improving the representation and prediction ability of SAGIN heterogeneous time-varying topology protocols (such as gRPC telemetry) to form a spatiotemporal graph database.
PSO automatic optimization: Introducing the PSO algorithm to optimize complex hybrid encoder hyperparameters, replacing tedious and inefficient manual parameter tuning. This helps to avoid the model falling into local optima and improves the generalization ability of the prediction model (expected test set error reduction of 15–20%).
PPO fusion spatiotemporal attention: Improved the standard PPO algorithm. The spatiotemporal attention mechanism enables agents to automatically filter irrelevant feature interference from high-dimensional spatiotemporal data, focusing on future key nodes and time windows, improving the accuracy and efficiency of decision-making (expected computational complexity reduced by 20–30%).
Global forward-looking optimization: Different from the local and passive optimization of protocols such as OSPF, as well as the reactive optimization of standard DRL, this solution is based on the global perspective and predictive information of G-CCC, achieving joint and forward-looking optimization across the sky–air–ground three layers, and improving overall resource utilization (expected to increase by 25%).
Frontier of the scheme: This scheme draws on the latest research results of NeurIPS 2023 on a Dynamic Network Graph Transformer (GNN + Transformer) and IEEE JSAC 2024 on an Evolutionary Algorithm Optimization Transformer, ensuring the forefront of the scheme.

4. Experimental Evaluation and Result Analysis

To comprehensively verify the effectiveness of the proposed GCN-T-PPO algorithm, we designed and conducted a series of experiments based on parameterized real-world datasets.

4.1. Experimental Environment and Dataset Setup

(1)

Experimental platform and framework: The experimental environment is built based on Python 3.10.

Core framework: PyTorch (3.10) is used as the deep learning framework to implement the GCN, Transformer, and PPO algorithms.
Topology generation: The satgenpy library and PyEphem library are utilized. satgenpy can parse TLE (Two-Line Element) data, calculate satellite orbital positions, and generate time-stamp-varying adjacency matrices of satellite networks.
Graph analysis: NetworkX is used for network graph modeling, path calculation of benchmark algorithms (e.g., OSPF’s Dijkstra), and analysis of network metrics.
Design reference: The design of the experimental platform refers to the ideas of public satellite network research tools such as StarPerf and Hypatia.
The manual parameter tuning process: The number of GCN layers was varied across {2~5}, the number of Transformer heads was varied across {4~16}, and the learning rate was varied across $\in \{5 \times 10^{- 4}, 5 \times 10^{- 3}, 1 \times 10^{- 3}, 1 \times 10^{- 2}, 5 \times 10^{- 2}\}$

The combination yielding the minimum validation set MSE was ultimately selected (GCN = 3 layers, Transformer = 8 heads, lr = 1 × 10⁻³).

(2)

Parameterization of SAGIN network topology: We constructed the SAGIN topology based on real constellation and mobility model parameters:

1.

Space tier:

Constellation model: Walker–Delta constellation based on the Iridium constellation.
Parameters: 66 satellites, distributed in 6 orbital planes with 11 satellites per plane.
Orbit: Orbital altitude of 780 km and an inclination of 86.4° (polar orbit).
Data source: Orbital parameters are initialized using Iridium TLE data provided by CelesTrak.
Links: Inter-Satellite Laser (ISL) links are set with a bandwidth of 10 Gbps.

2.

Air tier:

Model: A swarm of 20 unmanned aerial vehicles (UAVs) covering a 10 km × 10 km hot-spot area.
Mobility: Adopting the 2D random walk (RW) mobility model commonly used in the CRAWDAD dataset.
Parameters: UAV speeds randomly vary between 5 m/s (low speed) and 20 m/s (high speed).

3.

Ground tier: 50 ground gateways. To realistically simulate large-scale backhaul traffic between the space-based network and the ground-based Internet backbone, we set 50 ground gateway nodes. According to the definition in (IETF RFC 9717), these gateways are core hubs connecting satellite networks and ground wired networks. The positions of these nodes are not randomly distributed but correspond to the locations of 50 major global Internet Exchange Points (IXPs) and core data center clusters (e.g., Frankfurt, Ashburn, Singapore, Tokyo, etc.). This ensures that the simulated traffic model (see Section IV-A3) reflects real-world global “satellite–ground gateway” connectivity and stress-tests the algorithm’s performance under “backhaul bottlenecks” (a key challenge identified earlier). The G-CCC (Ground Centralized Control Center) is deployed at one of the major gateway nodes.

(3)

Traffic load and QoS model: Traditional network evaluation often uses Poisson distribution to model service arrivals. However, numerous studies have confirmed that real-world Internet traffic (e.g., WAN and LAN traffic) exhibits “bursty” and “self-similar” characteristics, manifested as a “heavy-tailed” distribution. The Poisson model fails to capture such burstiness, leading to over-optimistic evaluations of algorithm performance.

Therefore, we adopt two traffic models: (1) Background traffic (Poisson): The inter-arrival times of traffic flows follow a Poisson distribution, used for standard load testing. (2) Bursty traffic (Pareto): The size and duration of traffic flows follow a Pareto distribution. (3) Data source: Statistical parameters of the traffic model, such as average flow rate, flow duration, and packet size distribution, are extracted from the public statistical data of the 2024–2025 100G link passive dataset released by CAIDA and used in the experiments. (4) QoS requirements: Two types of services are set: (1) real-time video (latency < 100 ms, bandwidth > 10 Mbps); (2) best-Effort data transmission.

Table 2 summarizes the key experimental parameters.

The key performance metrics used in the evaluation are defined as follows:

Average end-to-end delay: For a set of N packets, the average delay is computed as

\bar{D} = \frac{1}{N} \sum_{i = 1}^{N} (T_{r e c e i v e d, i} - T_{s e n t, i}),

(14)

where

T_{s e n t, i}

and

T_{r e c e i v e d, i}

are the sending and receiving times of packet

i

, respectively.

Packet loss rate: The ratio of lost packets to the total sent packets:

$P L R = \frac{N_{l o s t}}{N_{s e n t}} .$

(15)
QoS satisfaction rate: For real-time video services, the proportion of packets that meet the delay requirement (e.g., delay ≤ 100 ms) is defined as

Q o S_{s a t i s f a c t i o n} = \frac{N_{o n - t i m e}}{N_{t o t a l}},

(16)

where

N_{o n - t i m e}

is the number of packets with delay less than 100 ms.

4.2. Baseline Algorithms

To comprehensively evaluate the advancement of the GCNT-PPO algorithm, we select four state-of-the-art (SOTA) baseline algorithms representing different technical schools, covering a wide range from traditional protocols to 2024 SOTA graph neural network (GNN) architectures:

OSPF (Open Shortest Path First): Serves as the baseline for traditional Interior Gateway Protocol (IGP). In the experiments, it represents a greedy algorithm based on the instantaneous shortest delay path. Under the scenario with 50 ground gateways and highly dynamic topology, its performance is expected to collapse due to the Link State Advertisement (LSA) flooding overhead.
DDPG-Routing (Advanced DRL Baseline): It is a deep reinforcement learning algorithm based on the Actor–Critic (AC) architecture. Different from DQN which handles discrete Q-values, DDPG (Deep Deterministic Policy Gradients) utilizes a policy network (Actor) to directly output deterministic continuous actions (or high-dimensional discrete actions), making it more expressive than DQN in the high-dimensional and continuous state space of SAGIN (such as precise delay and bandwidth values). Similar to our method, it runs on G-CCC but is essentially reactive, i.e., making decisions based on the current state.
D2-RMRL (SOTA Meta-RL Baseline): It is a state-of-the-art meta-reinforcement learning (Meta-RL) routing algorithm specifically designed for satellite networks. The core idea of D2-RMRL (Distributed and Distribution-Robust Meta-Reinforcement Learning) is “learn to learn”: it is trained through meta-learning under various network topologies and traffic patterns, enabling it to fast adapt to unseen topology changes or sudden traffic encountered in real SAGIN. This makes it the ultimate test for the robustness and adaptability of our “predictive” model.
Graph-Mamba-Routing (SOTA GNN Baseline): To fairly compare the effectiveness of our GCN-Transformer encoder, we introduce a baseline based on the 2024 SOTA GNN architecture. This algorithm uses a Graph-Mamba encoder instead of our GCN-Transformer. Mamba (a state space model, SSM) is a major competitor of the Transformer in long sequence modeling, theoretically having equivalent sequence modeling capability and higher computational efficiency. The decision-making end of this baseline also uses PPO to ensure fair comparison, thereby isolating the performance differences of encoders (GCNT vs. Graph-Mamba).
The proposed method (GCN-T-PPO): The proposed complete scheme. It runs on G-CCC, based on spatiotemporal prediction of a GCN-Transformer and proactive decision-making of PPO.

4.3. Performance Comparison and Analysis

This section demonstrates the superiority of the proposed scheme in key performance metrics through comparative experiments.

(1)

Scenario 1: Performance under different network loads (Poisson traffic):

Experimental setup: Under the environment, Poisson traffic is injected into the network, with the total load increasing from 10% (low load) to 90% (high congestion).
Evaluation metrics: (1) Average end-to-end delay (ms); (2) packet loss rate (%).

The experimental results (Table 3 and Figure 3) show that after increasing to 50 ground gateways, OSPF, unable to handle large-scale dynamic topologies and backhaul bottlenecks, experiences a sharp performance degradation with exponentially increasing delay when the load exceeds 50.

DDPG-Routing (advanced reactive DRL) significantly outperforms DQN in the original paper but under high load (≥70%). The SOTA baselines D2-RMRL (Meta-RL) and Graph-Mamba (SOTA GNN) demonstrate strong performance. Graph-Mamba, with its powerful spatiotemporal encoding capability, approaches our method in delay control.

The proposed method (GCN-T-PPO) maintains the lowest delay across all loads. For example, under 90% high congestion load, the delay of our scheme (80.4 ms) is reduced by 18.6% compared to the SOTA Graph-Mamba (98.8 ms) and by 35.7% compared to D2-RMRL (125.1 ms). This strongly demonstrates the superiority of our GCN-Transformer predictor combined with spatiotemporal attention PPO in “proactive congestion avoidance”.

The variation trend of packet loss rate (Figure 4) is consistent with that of delay. OSPF suffers a packet loss rate as high as 25.4% under high load due to routing oscillation and slow convergence. The packet loss rates of DDPG-Routing, D2-RMRL, and Graph-Mamba decrease in sequence, demonstrating the capability of SOTA algorithms in handling congestion.

The proposed method (GCN-T-PPO) maintains a packet loss rate of 1.2% even under 90% load, significantly lower than that of Graph-Mamba (5.1%). This is because the packet loss rate is a key penalty term (

- γ \cdot l p

) in the reward function of our scheme. Through the prediction of future packet loss rate (e.g., links with high SNR) by GCN-T and the spatial attention mechanism in PPO, the agent learns to proactively select stable links with high QoS, avoiding packet loss from the source.

(2)

Scenario 2: QoS satisfaction under bursty traffic (Pareto):

Experimental setup: Fix the total network load at 70%, and generate bursty traffic using the aforementioned Pareto traffic model. The X-axis represents the burst intensity (the larger the value, the stronger the burst), and the Y-axis represents the QoS satisfaction rate of real-time video services (i.e., the proportion of packets with delay < 100 ms).

Figure 5 tests the algorithms’ capability to handle Pareto bursty traffic, which is the ultimate test of their temporal prediction ability. As the burstiness increases, the QoS satisfaction rates of all “reactive” algorithms (OSPF, DDPG-Routing) and “adaptive” algorithm (D2-RMRL) decrease sharply. Although D2-RMRL can adapt to topology changes, it is underprepared for unexpected traffic surges. Graph-Mamba, with its powerful Mamba sequence modeling capability, has a significantly higher QoS satisfaction rate (78.5% at intensity 2.0) than other baselines, demonstrating the effectiveness of SOTA spatiotemporal GNNs in traffic prediction. However, the QoS satisfaction rate of the proposed method (GCN-TPPO) still significantly outperforms Graph-Mamba (remaining 90.8% at intensity 2.0). This verifies the prediction capability of our GCN-Transformer encoder and the key role of the “temporal attention” mechanism in the PPO agent. This mechanism enables it to prioritize handling predicted future critical congestion time slots, thereby reserving bandwidth for bursty traffic.

4.4. Ablation Study

This section quantifies the contribution of each module by removing key components through ablation experiments.

(1)

Experimental setup: Use scenarios with high load and high burstiness. Compare the proposed method with five “incomplete” variants:

W/o PSO (Remove PSO): Use manually tuned GCN-T hyperparameters. The manual parameter tuning process: The number of GCN layers was varied across {2, 3, 4}, the number of Transformer heads was varied across {4, 6, 8}, and the learning rate was varied across {1 × 10⁻⁴, 5 × 10⁻⁴, 1 × 10⁻²}. The combination yielding the minimum validation set MSE was ultimately selected (GCN = 3 layers, Transformer = 8 heads, lr = 5 × 10⁻⁴).
W/o GCN (Transformer Only): Only use Transformer to process sequence data, ignoring topology.
W/o Transformer (GCN Only): Only use GCN to process instantaneous snapshots, ignoring temporal dependencies.
W/o Spatiotemporal Attention (Standard PPO): The Actor network of PPO uses standard fully connected layers.
W/o Predictor (RL Only): i.e., the DQN-Routing baseline, without prediction capability.

As shown in the experimental results of Table 4, the proposed method (GCN-T + PSO) achieves the lowest prediction MSE (0.04).

The MSE of W/o PSO is relatively high (0.11), demonstrating the effectiveness of PSO automatic hyperparameter optimization.
The MSE of W/o GCN is very high (0.25), demonstrating that spatial topology is key information for predicting delay, and time series alone cannot capture the influence of neighboring nodes.
The MSE of W/o Transformer is the highest (0.32), demonstrating that historical temporal dependencies are also critical, and the current GCN snapshot alone cannot predict congestion trends.

Conclusion: Both the GCN and Transformer are indispensable, and their spatiotemporal modeling is the foundation for achieving high-precision prediction.

As shown in the experimental results of Table 5, the proposed method (PPO + Attn) converges the fastest and achieves the highest average reward (0.95) finally:

The reward of W/o Spatiotemporal Attention (Standard PPO) is relatively low (0.80), which demonstrates the value of spatiotemporal attention. The Actor network of standard PPO is overwhelmed by high-dimensional states and cannot distinguish key information. In contrast, the attention mechanism helps PPO focus on the “future congestion points” predicted by GCN-T, leading to more accurate decisions.
The reward of W/o Predictor (i.e., DQN-Routing) is the lowest (0.65), which again demonstrates that “predictive” decision-making is superior to “reactive” decision-making.

4.5. Scalability Test

Scalability is a key factor determining whether an AI algorithm can be practical in large-scale networks.

(1): Experimental setup: Vary the network scale with the total number of nodes N = {50, 100, 200, 500}. (Small scale corresponds to Iridium, and large scale corresponds to Starlink.)
(2): Evaluation metrics: (1) Computational overhead of GCCC (CPU load %); (2) algorithm convergence time (s).

As shown in Table 6 and Figure 6, computational overhead results indicate that GNN-based algorithms (the proposed method and Graph-Mamba) benefit from the parameter sharing of the GCN, thus achieving the best scalability in computational overhead, with CPU load growing nearly linearly. The CPU load of DDPG (using standard DNN) increases linearly with N, as its input layer dimension is strongly correlated with N. D2-RMRL has the largest computational overhead due to the complexity of meta-learning, and encounters out-of-memory (OOM) when N > 200.

As shown in Table 7 and Table 8, convergence time results reveal a key academic trade-off:

D2-RMRL has the longest initial training time (850.5 s when N = 50) because it needs to learn “how to learn”. However, once trained, it demonstrates the hallmark advantage of Meta-RL when facing new topologies (N = 100, 200): extremely fast adaptation time (≤30 s).
DDPG and Graph-Mamba, as standard DRL/GNN, have their training convergence time increase significantly with N.
The proposed method (GCN-T-PPO) maintains the fastest and most stable growth in training convergence time across all scales (only 510.9 s when N = 500). This benefits from the inductive learning capability of GNN and the high sample efficiency of PPO (compared to DDPG).
Computational overhead: When the number of network nodes scales from 50 to 500, the CPU load of this scheme increases nearly linearly (which is lower than the exponential growth of other algorithms). This indicates that the parameter sharing of the GCN and the sample efficiency of PPO effectively control the computational overhead, making it suitable for the limited computing resources of satellite networks.

Conclusion: The proposed scheme is more suitable for fast offline retraining and policy updating in the SAGIN environment, while D2-RMRL is more suitable for scenarios requiring online real-time adaptation to unknown topologies.

Training performance under different learning parameters: The GCN-T-PPO algorithm was evaluated using various learning rates, as shown in Figure 7. It was observed that different learning rates yielded distinct performance characteristics regarding algorithm convergence and outcomes. At the onset of training, both rewards and performance were unsatisfactory. As the number of episodes increased, rewards showed a significant rise, and convergence accelerated. When the learning rate is 0.001, the reward performance surpasses that of other learning rates. If the learning rate is either higher or lower than 0.001, varying degrees of disadvantage are exhibited. Notably, the worst results occur at a learning rate of 0.05, where convergence to a stable outcome fails to occur. This indicates that larger learning rates may lead to local optima rather than global optima. Furthermore, an excessively small learning rate may cause PPO to become trapped in local optima, preventing it from escaping to find the global optimum. Due to the small step size, PPO may only explore within a narrow range around the local optimum, failing to conduct a broader search. Considering the practical implementation of the algorithm, a learning rate of 0.001 is selected.

As observed in Figure 8, the convergence rates of the four algorithms are similar. However, the final rewards converged by GCN-T-PPO and Graph-Mamba are significantly higher than those of the other two algorithms. Although the final rewards converged by GCN-T-PPO and Graph-Mamba are very close, GCN-T-PPO consistently achieved higher rewards than Graph-Mamba during training. Additionally, the convergence performance of all algorithms stabilizes after 200 training iterations, with no significant oscillations observed. Therefore, the proposed algorithm in this paper demonstrates certain advantages.

4.6. Robustness Test

This section tests the recovery capability of algorithms when facing sudden network failures.

(1): Experimental setup: In experiments running stably under high load (80%), at T = 100 s, 10% of Inter-Satellite Links (ISL) in the network fail simultaneously at random.
(2): Evaluation metric: Packet Loss Rate (PLR) over time.

As shown in Figure 9, the robustness test clearly demonstrates the failure recovery mechanisms of different algorithms:

T = 100 s (failure): With 10% ISL links failing, the PLR of all algorithms surges instantaneously.
T > 100 s (recovery period):

OSPF: PLR remains at a high level (≥30 s). It needs to wait for Link State Advertisement (LSA) flooding and global re-convergence, which is extremely slow in Low Earth Orbit (LEO) networks, during which packets are continuously lost.

DDPG/Graph-Mamba: As standard centralized “reactive” DRL, G-CCC detects link failures (state changes) at

T

= 101 s, and the algorithms start to recalculate and issue new paths. The recovery time is about 5–8 s.

D2-RMRL: Demonstrates the excellent adaptability of meta-learning. It restores PLR to a stable level (2.8%) within T = 102 s (only 2 s). It does not need to be issued by G-CCC, but autonomously “adapts” to network changes.

Proposed method (GCN-T-PPO): Shows the best robustness. PLR has recovered to 1.8% at

T

= 102 s (±2 s). This proves that the “prospective” nature of our scheme is reflected not only in predicting congestion but also in the fact that the PPO policy network has learned optimal avoidance strategies under various failure modes during the training phase. Different from the “online adaptation” of D2-RMRL, our model is “predicted offline”, so it has the fastest recovery speed and the strongest network resilience.

The reason why the proposed method has the fastest recovery is that the policy network of PPO has learned coping strategies under various failure scenarios during the training phase. When a failure occurs (

T

= 100 s), G-CCC collects telemetry at

T

= 101 s, the GCN-T predictor updates its prediction immediately (millisecond level), and the PPO agent does not recalculate but immediately matches and outputs the optimal failure avoidance path from its pre-learned policy network. This near-real-time failure recovery capability is far superior to the distributed slow convergence of OSPF and the centralized recalculation of DQN.

To verify the generalization ability of the single hyperparameter set optimized by PSO (GCN = 3 layers, Transformer = 8 heads, lr = 1 × 10⁻³, etc.), validation was conducted under different network scales (N = 50, 100, 200, 500), and the experimental scenario is consistent with the scalability test in Section 4.5. The results are as follows:

Robustness: The hyperparameter set optimized by PSO has a stable predicted MSE ranging from 0.038 to 0.048 across all network scales, with a fluctuation range of <26%, which is much lower than that of manual tuning (35% fluctuation) and random hyperparameters (18% fluctuation).

Generalization: At the maximum network scale N = 500, the MSE of PSO hyperparameters is only 26% higher than that at the minimum scale N = 50, while the MSE of manually tuned hyperparameters increases by 35%, proving that this hyperparameter set can adapt to SAGIN topologies of different scales.

4.7. Complexity–Performance Trade-Off Optimization

To quantify the balance between the performance gains of the proposed GCN-T-PPO architecture and the computational/communication costs, we refer to the design idea of the delay-weighted decoding metric in reference [42] and combine the characteristics of the SAGIN routing scenario to define the routing weighted complexity metric

W_{t r a d e o f f}

.

(1): Definition of core parameters:

Comprehensive QoS performance score $P_{perf}$ : Integrate the three core indicators of delay, packet loss rate, and bandwidth utilization, with weights consistent with the optimization objectives ( $ω_{d} = 0.4, ω_{l} = 0.3, ω_{b} = 0.3$ ):

P_{perf} = ω_{d} \cdot (1 - \frac{d}{d_{\max}}) + ω_{l} \cdot (1 - l) + ω_{b} \cdot b,

(17)

where

d

is the average end-to-end delay (ms),

d_{\max} = 100 ms

(delay threshold for real-time video services);

l

is packet loss rate (%); and

b

is the normalized bottleneck bandwidth (

b = \frac{b_{p}}{b_{\max}}

, where

b_{p}

is the minimum bandwidth of the path,

b_{\max} = 10 G b p s

).

Computational complexity $C_{c o m p}$ : It represents the number of floating-point operations per unit time (FLOPs/s) and is decomposed into three major modules:

C_{c o m p} = C_{G C N - T} + C_{P S O} + C_{P P O},

(18)

C_{G C N - T} = N_{n o d e} \cdot F \cdot (L_{G C N} \cdot K_{h o p} + L_{T r a n s} \cdot H_{h e a d}),

(19)

C_{P S O} = N_{p a r t i c l e} \cdot I_{P S O} \cdot D_{p a r a m},

(20)

C_{P P O} = T_{t r a j} \cdot (N_{a c t o r} + N_{c r i t i c}),

(21)

where

N_{n o d e}

is the number of network nodes,

F

is the node feature dimension,

L_{G C N}

is the number of GCN layers,

K_{h o p}

is the number of feature aggregation hops,

L_{T r a n s}

is the number of Transformer layers,

H_{h e a d}

is the number of attention heads,

N_{p a r t i c l e}

is the number of particles,

I_{P S O}

is the number of iterations,

D_{p a r a m}

is the hyperparameter dimension (number of GCN layers, number of Transformer heads, learning rate, hidden layer dimension),

T_{t r a j}

is the trajectory length, and

N_{a c t o r}

and

N_{c r i t i c}

are the number of parameters of the Actor/Critic networks.

Communication complexity $C_{c o m m}$ : It represents the control signaling overhead (bits/s), which consists of two parts: state reporting and policy issuance:

C_{c o m m} = N_{u p d a t e} \cdot (S_{s t a t e} + S_{p o l i c y}),

(22)

where

N_{u p d a t e}

is the strategy update frequency,

S_{s t a t e}

is the size of a single-node state report, and

S_{p o l i c y}

is the size of a single-path routing strategy instruction.

(2): Weighted complexity metric final formula:

W_{tradeoff} = \frac{P_{perf}}{α \cdot C_{c o m p} + β \cdot C_{c o m m}},

(23)

where

α = 0.6, β = 0.4

reflect the dominant position of computational complexity in routing decisions.

Normalization processing: For the convenience of horizontal comparison, the

W_{tradeoff}

of each algorithm is divided by the corresponding value of the benchmark algorithm (Graph-Mamba) to obtain

W_{norm}

.

W_{norm}

> 1 indicates better trade-off performance.

(3): Results and analysis:

Scenario: Using the “high load (80%) + burst traffic (Pareto intensity = 2.0)”.

As shown in Table 9, the normalized trade-off index of the algorithm in this paper,

W_{norm} = 1.21

, is significantly higher than that of all benchmark algorithms. This indicates that with a slightly higher computational complexity, by reducing communication complexity the QoS performance improvement is more prominent, verifying the rationality of the “spatiotemporal prediction + attention mechanism” architecture.

5. Conclusions

This paper addresses the intelligent routing challenges faced by the Integrated Space–Air–Ground Network (SAGIN) under high dynamics, strong heterogeneity, and high QoS requirements, and proposes a routing algorithm (GCN-T-PPO) based on a hybrid GCN + Transformer encoder and PPO reinforcement learning. The core idea of this scheme is to use the GCN-T hybrid encoder (optimized by PSO) to perform high-precision spatiotemporal prediction of the future network state of SAGIN, and enable the PPO agent (integrating spatiotemporal attention) to make proactive routing decisions based on this “future” information.

To validate the algorithm in a high-fidelity and highly competitive environment, we designed and executed a comprehensive parameterized experiment based on real datasets (CelesTrak, CAIDA, CRAWDAD). The experiment adopted a topology with 50 global IXP ground gateways and directly compared with SOTA baselines including DDPG-Routing, D2-RMRL, and Graph-Mamba.

The experimental results show that the following:

Comprehensive performance: In extreme scenarios with high load (80%) and high burstiness (Pareto), the proposed scheme (GCN-T-PPO) significantly outperforms all SOTA baselines in all key QoS metrics. Compared with the suboptimal Graph-Mamba algorithm based on the SOTA GNN, the proposed scheme reduces the average delay (68.9 ms) by 18.6% and the packet loss rate (1.0%) by 73.6%, and it improves the QoS satisfaction rate (91.5%) by 12.7%.
Component effectiveness: Ablation experiments prove that the combination of the GCN (spatial) and Transformer (temporal) is crucial for prediction accuracy. More importantly, the “spatiotemporal attention” mechanism is the key for the proposed scheme to outperform the Graph-Mamba-PPO baseline, improving the PPO decision efficiency (convergence reward) by about 18.8%.
Scalability: Thanks to the parameter sharing and inductive capability of the GNN, when the network scale expands, the proposed scheme exhibits SOTA-level computational overhead (CPU load %) and training convergence time (510.9 s, N = 500), significantly outperforming DDPG and D2-RMRL. Additionally, GCN-T-PPO demonstrates superior convergence performance compared to other algorithms.
Robustness: When facing a sudden failure of 10% of links, the recovery time of the proposed scheme (2 s) is comparable to that of the SOTA Meta-RL algorithm D2-RMRL (2 s), and both are much faster than other reactive baselines (5 s), demonstrating extremely strong network resilience. Moreover, the hyperparameter sets optimized by PSO exhibit strong robustness and generalization capabilities, effectively isolating the inherent advantages of the model architecture itself. This ensures fair performance comparisons across different experimental scenarios.
Complexity–performance trade-off: This paper further validates the practical value of the GCN-T-PPO algorithm by introducing a weighted complexity metric: under high load (80%) + burst traffic scenarios, the normalized trade-off metric of this algorithm is 1.21, demonstrating a significant improvement over other algorithms.

In summary, this study confirms the effectiveness and advancement of the GCN-T-PPO algorithm. It not only outperforms traditional protocols and standard DRL but also demonstrates comprehensive performance advantages in direct comparison with SOTA algorithms designed for satellite networks (D2-RMRL) and spatiotemporal modeling (Graph-Mamba). This proves that our architecture combining “spatiotemporal prediction” and “attention-based reinforcement learning” is a SOTA solution for solving the dynamic intelligent routing problem of SAGIN. However, this experiment remains grounded in fault modes within predictable ranges. For entirely unpredictable abnormal events, such as sudden satellite failures or transient link interruptions, the model still relies on real-time retraining or meta-learning mechanisms for adaptation. This aspect will be further explored in future work. Future work will focus on the following: accurately quantifying the communication overhead, computational overhead, and satellite–ground link bandwidth occupancy rate of this scheme, and establishing an overhead–performance trade-off model; and studying a distributed prediction framework based on federated learning to reduce real-time data interaction between G-CCC and satellite nodes and lower signaling overhead to further enhance the scalability and survivability of the system.

Author Contributions

Conceptualization, S.L. and X.L.; Methodology, J.L., X.L., F.Z. and J.W.; Software, J.W.; Validation, J.L. and J.W.; Writing—original draft, J.L.; Writing—review & editing, S.L., X.L. and F.Z.; Supervision, S.L., X.L. and F.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are openly available in [celestark] [https://www.space-track.org/; the accessed date (10 November 2025)].

Conflicts of Interest

The author declares that there is no conflict of interest.

References

Tang, F.; Chen, X.; Zhao, M.; Kato, N. The roadmap of com-munication and networking in 6g for the metaverse. IEEE Wirel. Commun. 2022, 29, 105–112. [Google Scholar]
Cheng, N.; He, J.; Yin, Z.; Zhou, C.; Wu, H.; Lyu, F.; Zhou, H.; Shen, X. 6G service-oriented space-air-ground integrated network: A survey. Chin. J. Aeronaut. 2022, 35, 1–18. [Google Scholar] [CrossRef]
Cui, H.; Zhang, J.; Geng, Y.; Xiao, Z.; Sun, T.; Zhang, N.; Liu, J.; Wu, Q.; Cao, X. Space-air-ground integrated network (SAGIN) for 6G: Requirements, architecture and challenges. China Commun. 2022, 19, 90–108. [Google Scholar] [CrossRef]
Zhang, R.; Du, H.; Niyato, D.; Kang, J.; Xiong, Z.; Jamalipour, A.; Zhang, P.; Kim, D.I. Generative AI for space-air-ground integrated networks. IEEE Wirel. Commun. 2024, 31, 10–20. [Google Scholar] [CrossRef]
Tan, J.; Tang, F.; Zhao, M.; Kato, N. Outage probability, performance, fairness analysis of space-air-ground integrated network (sagin): Uav altitude and position angle. IEEE Trans. Wirel. Commun. 2024, 24, 940–954. [Google Scholar] [CrossRef]
Tan, J.; Tang, F.; Zhao, M.; Kato, N. Performance analysis of space-air-groun integrated network (sagin): Uav altitude and position angle. In Proceedings of the 2023 IEEE/CIC International Conference on Communications in China (ICCC), Dalian, China, 10–12 August 2023; pp. 1–6. [Google Scholar]
Zhang, S.; Yin, B.; Zhang, W.; Cheng, Y. Topology aware deep learning for wireless network optimization. IEEE Trans. Wirel. Commun. 2022, 21, 9791–9805. [Google Scholar] [CrossRef]
Arani, A.H.; Hu, P.; Zhu, Y. HAPS-UAV-enabled heterogeneous networks: A deep reinforcement learning approach. IEEE Open J. Commun. Soc. 2023, 4, 1745–1760. [Google Scholar] [CrossRef]
Cao, X.; Li, Y.; Xiong, X.; Wang, J. Dynamic routings in satellite networks: An overview. Sensors 2022, 22, 4552. [Google Scholar] [CrossRef]
Zhang, T.; Zheng, Y.; Sheng, M.; Li, J. Efficient Air-ground Collaborative Routing Strategy for UAV-assisted MANETs. IEEE Trans. Veh. Technol. 2025. [Google Scholar] [CrossRef]
Fan, M.; Wu, Y.; Liao, T.; Cao, Z.; Guo, H.; Sartoretti, G.; Wu, G. Deep reinforcement learning for UAV routing in the presence of multiple charging stations. IEEE Trans. Veh. Technol. 2022, 72, 5732–5746. [Google Scholar] [CrossRef]
Kato, N.; Fadlullah, Z.M.; Mao, B.; Tang, F.; Akashi, O.; Inoue, T.; Mizutani, K. The deep learning vision for heterogeneous network traffic control—Proposal, challenges, and future perspective. IEEE Wirel. Commun. 2016, 24, 146–153. [Google Scholar] [CrossRef]
Qin, Y.; Yang, Y.; Tang, F.; Yao, X.; Zhao, M.; Kato, N. Differentiated federated reinforcement learning based traffic offloading on space-airground integrated networks. IEEE Trans. Mob. Comput. 2024, 23, 11000–11013. [Google Scholar] [CrossRef]
Tang, F.; Mao, B.; Fadlullah, Z.M.; Kato, N.; Akashi, O.; Inoue, T.; Mizutani, K. On removing routing protocol from future wireless networks: A real-time deep learning approach for intelligent traffic control. IEEE Wirel. Commun. 2018, 25, 154–160. [Google Scholar] [CrossRef]
Liu, S.; Wei, Y.; Hwang, S.-H. Guard band protection for coexistence of 5G base stations and satellite earth stations. ICT Express 2023, 9, 1103–1109. [Google Scholar] [CrossRef]
Tang, F.; Kawamoto, Y.; Kato, N.; Liu, J. Future intelligent and secure vehicular network toward 6g: Machine-learning approaches. Proc. IEEE 2020, 108, 292–307. [Google Scholar] [CrossRef]
Hu, S.; Chen, X.; Ni, W.; Hossain, E.; Wang, X. Distributed machine learning for wireless communication networks: Techniques, architectures, and applications. IEEE Commun. Surv. Tutor. 2021, 23, 1458–1493. [Google Scholar] [CrossRef]
Kim, B.; Kong, J.H.; Moore, T.J.; Dagefu, F.T. Deep Reinforcement Learning Based Routing for Heterogeneous Multi-Hop Wireless Networks. arXiv 2025, arXiv:2508.14884. [Google Scholar] [CrossRef]
Mahajan, S.; Harikrishnan, R.; Kotecha, K. Adaptive routing in wireless mesh networks using hybrid reinforcement learning algorithm. IEEE Access 2022, 10, 107961–107979. [Google Scholar] [CrossRef]
Vazquez, M.A.; Henarejos, P.; Pappalardo, I.; Grechi, E.; Fort, J.; Gil, J.C.; Lancellotti, R.M. Machine learning for satellite communications operations. IEEE Commun. Mag. 2021, 59, 22–27. [Google Scholar] [CrossRef]
Dahrouj, H.; Liu, S.; Alouini, M.-S. Machine learning-based user scheduling in integrated satellite-HAPS-ground networks. IEEE Netw. 2023, 37, 102–109. [Google Scholar] [CrossRef]
Liu, B.; Wang, S.; Li, Q.; Zhao, X.; Pan, Y.; Wang, C. Task assignment of UAV swarms based on deep reinforcement learning. Drones 2023, 7, 297. [Google Scholar] [CrossRef]
Shi, X.; Ren, P.; Du, Q. Reinforcement learning routing in space-air-ground integrated networks. In Proceedings of the 2021 13th International Conference on Wireless Communications and Signal Processing (WCSP), Changsha, China, 20–22 October 2021; pp. 1–6. [Google Scholar]
Guo, Q.; Tang, F.; Kato, N. Routing for space-air-ground integrated network with gan-powered deep reinforcement learning. IEEE Trans. Cogn. Commun. Netw. 2025, 11, 914–922. [Google Scholar] [CrossRef]
Tang, F.; Hofner, H.; Kato, N.; Kaneko, K.; Yamashita, Y.; Hangai, M. A deep reinforcement learning-based dynamic traffic offloading in space-air-ground integrated networks (sagin). IEEE J. Sel. Areas Commun. 2021, 40, 276–289. [Google Scholar] [CrossRef]
Raja, M.S.R.S. Reinforcement learning in dynamic environments: Challenges and future directions. Int. J. Artif. Intell. Data Sci. Mach. Learn. 2025, 6, 12–22. [Google Scholar] [CrossRef]
Wang, F.; Xin, X.; Lei, Z.; Zhang, Q.; Yao, H.; Wang, X.; Tian, Q.; Tian, F. Transformer-Based Spatio-Temporal Traffic Prediction for Access and Metro Networks. J. Light. Technol. 2024, 42, 5204–5213. [Google Scholar] [CrossRef]
Chen, Q.; Meng, W.; Quek, T.Q.S.; Chen, S. Multi-tier hybrid offloading for computation-aware IoT applications in civil aircraft-augmented SAGIN. IEEE J. Sel. Areas Commun. 2022, 41, 399–417. [Google Scholar] [CrossRef]
Mashiko, K.; Kawamoto, Y.; Kato, N.; Ariyoshi, M.; Sugyo, K.; Funada, J. Efficient Coverage Area Control in Hybrid FSO/RF Space-Air-Ground Integrated Networks. In Proceedings of the GLOBECOM 2024–2024 IEEE Global Communications Conference, Cape Town, South Africa, 8–12 December 2024; IEEE: Piscataway, NJ, USA, 2024. [Google Scholar]
Alam, S.; Song, W.C. Intent-Based Network Resource Orchestration in Space-Air-Ground Integrated Networks: A Graph Neural Networks and Deep Reinforcement Learning Approach. IEEE Access 2024, 12, 185057–185077. [Google Scholar] [CrossRef]
Zhang, S.; Liu, A.; Han, C.; Xu, X.; Liang, X.; An, K.; Zhang, Y. Grlr: Routing with graph neural network and reinforcement learning for mega leo satellite constellations. IEEE Trans. Veh. Technol. 2024, 74, 3225–3237. [Google Scholar] [CrossRef]
Ivanov, A.; Tonchev, K.; Poulkov, V.; Manolova, A.; Neshov, N.N. Graph-based resource allocation for integrated space and terrestrial communications. Sensors 2022, 22, 5778. [Google Scholar] [CrossRef]
Tam, P.; Ros, S.; Song, I.; Kang, S.; Kim, S. A survey of intelligent end-to-end networking solutions: Integrating graph neural networks and deep reinforcement learning approaches. Electronics 2024, 13, 994. [Google Scholar] [CrossRef]
Chen, B.; Zhu, D.; Wang, Y.; Zhang, P. An approach to combine the power of deep reinforcement learning with a graph neural network for routing optimization. Electronics 2022, 11, 368. [Google Scholar] [CrossRef]
Sun, X.; Xiong, R.; Shen, D.; Luo, J. Enhancing Network Traffic Prediction by Integrating Graph Transformer with a Temporal Model. In Proceedings of the 9th Asia-Pacific Workshop on Networking, Shanghai, China, 7–8 August 2025. [Google Scholar]
Kong, Q.; Zhang, X.; Zhang, C.; Zhou, L.; Yu, M.; He, Y.; Chen, Y.; Miao, Y.; Yuan, H. Network traffic prediction: Apply the transformer to time series forecasting. Math. Probl. Eng. 2022, 2022, 8424398. [Google Scholar] [CrossRef]
Lang, Z.; Liu, G.; Sun, G.; Li, J.; Wang, J.; Yuan, W.; Niyato, D.; Kim, D.I. Joint AoI and Handover Optimization in Space-Air-Ground Integrated Network. arXiv 2025, arXiv:2509.12716. [Google Scholar]
Zhang, P.; Li, Y.; Kumar, N.; Chen, N.; Hsu, C.-H.; Barnawi, A. Distributed deep reinforcement learning assisted resource allocation algorithm for space-air-ground integrated networks. IEEE Trans. Netw. Serv. Manag. 2022, 20, 3348–3358. [Google Scholar] [CrossRef]
Ren, Y.; Ye, Z.; Song, G.; Jiang, X.; Manolova, A.; Neshov, N.N. Space-Air-Ground Integrated Mobile Crowdsensing for Partially Observable Data Collection by Multi-Scale Convolutional Graph Reinforcement Learning. Entropy 2022, 24, 638. [Google Scholar] [CrossRef] [PubMed]
Cai, Y.; Cheng, P.; Chen, Z.; Xiang, W.; Vucetic, B.; Li, Y. Graphic Deep Reinforcement Learning for Dynamic Resource Allocation in Space-Air-Ground Integrated Networks. IEEE J. Sel. Areas Commun. 2024, 43, 334–349. [Google Scholar] [CrossRef]
Ansari, S.; Alnajjar, K.A.; Majzoub, S.; Almajali, E.; Jarndal, A.; Bonny, T.; Hussain, A.; Mahmoud, S. Attention-Enhanced Hybrid Automatic Modulation Classification for Advanced Wireless Communication Systems: A Deep Learning-Transformer Framework. IEEE Access 2025, 13, 105463–105491. [Google Scholar] [CrossRef]
Miuccio, L.; Panno, D.; Riolo, S. A flexible encoding/decoding procedure for 6G SCMA wireless networks via adversarial machine learning techniques. IEEE Trans. Veh. Technol. 2022, 72, 3288–3303. [Google Scholar] [CrossRef]

Figure 1. SAGIN architecture diagram.

Figure 2. Algorithmic framework and dataflow of GCN-T-PPO.

Figure 3. Average end-to-end delay vs. network load.

Figure 4. Packet loss rate vs. network load.

Figure 5. QoS satisfaction rate vs. traffic bursts.

Figure 6. Computational overhead vs. total number of networks.

Figure 7. Training performance of models at different learning rates.

Figure 8. Comparison of training performance among different reinforcement learning algorithms.

Figure 9. Packet loss rate (PLR) vs. time (s) (10% ISL sudden failure).

Table 1. Notation Table.

Symbol	Description
$G$	Dynamic spatiotemporal graph
$V$	Node set (satellites, UAVs, ground nodes)
$E$	Edge set (communication links)
$T$	Time dimension
$A (t)$	Adjacency matrix at time $t$
$X (t)$	Node feature matrix at time $t$
$P$	Set of all routing paths
$d_{p}$	End-to-end latency of path $p$
$b_{p}$	Normalized minimum bandwidth of path $p$
$l_{p}$	Cumulative packet loss rate of path $p$
$w_{i}$	Weights for multi-objective optimization
$H^{(l)}$	Node embeddings at GCN layer $l$
$\hat{A}$	Normalized adjacency matrix
$Q, K, V$	Query, key, value matrices in attention
$d_{k}$	Dimension of keys in attention
$v_{i} (t)$	Velocity of particle $i$ at iteration $t$ in PSO
$x_{i} (t)$	Position of particle $i$ at iteration $t$ in PSO
$S$	State space in reinforcement learning
$A$	Action space in reinforcement learning
$R$	Reward function
$α_{i j}$	Spatial attention weight between nodes $i$ and $j$

Table 2. Experimental key parameter settings.

Category	Parameter	Value
Experimental platform	Core framework Topology generation	Python 3.10 + PyTorch satgenpy, NetworkX
Space-tier network	Constellation Orbital altitude/inclination TLE data source Inter-satellite link (ISL)	Iridium-like (Walker 66/6/11) 780 km/86.4° CelesTrak (2025 data) Gbps (Laser)
Air-tier network	Number of nodes Mobility odel Mobility speed Data source	20 UAVs Random walk (RW) 5–20 m/s CRAWDAD mobility statistics
Ground-tier network	Number of nodes Topology role	50 ground gateways (IXPs) Global backhaul gateway
Traffic model	Traffic arrival Traffic burstiness Traffic Characteristics	Poisson distribution Pareto distribution CAIDA 100G link statistics
Algorithm model	Predictor (GCN) Predictor (Transformer) Reinforcement learning Training epochs Optimizer Discount factor (γ) PPO clipping parameter (ε) Hardware	3 layers 4 layers, 8 heads PPO (clipped) 1000 epochs Adam 0.95 0.2 NVIDIA RTX 4080 GPU

Table 3. Average end-to-end delay vs. network load (50 ground gateways).

Network Load (%)	OSPF (ms)	DDPG-Routing (ms)	D2-RMRL (ms)	Graph-Mamba (ms)	Proposed Method (ms)
10	36.1	42.1	41.5	38.2	35.9
30	42.5	49.8	48.9	45.1	40.3
50	115.3	70.2	64.8	58.9	50.2
70	240.8	105.6	85.3	74.5	61.5
90	410.2	160.4	125.1	98.8	80.4

Table 4. Ablation Study on Predictor MSE.

Experimental Variant	Predictor Mean Squared Error
W/o PSO (Manual Tuning)	0.11
W/o GCN (Transformer Only)	0.25
W/o Transformer (GCN Only)	0.32
Proposed Method (GCN-T + PSO)	0.04

Table 5. Ablation study on PPO convergence reward.

Experimental Variant	Final Convergence Average Reward
W/o Spatiotemporal Attention	0.80
W/o Predictor	0.65
Proposed Method	0.95

Table 6. Computational overhead vs. total number of network nodes (N).

Total Number of Nodes (N)	OSPF (CPU %)	DDPG (CPU %)	D2-RMRL (CPU %)	Graph-Mamba (CPU %)	Proposed Method (CPU %)
50	10.5	9.2	18.5	5.8	5.9
100	22.1	18.1	40.2	8.3	9.6
200	48.9	38.5	85.1 (OOM)	13.5	15.7
500	85.3	80.2	Failed	26.1	29.8

Table 7. Algorithm convergence/adaptation time vs. total number of network nodes (N).

Total Number of Nodes (N)	OSPF (s)	DDPG (s)	D2-RMRL (s)	Graph-Mamba (s)	Proposed Method (s)
50	5.2	410.1	850.5 (Training)	320.4	305.1
100	12.8	980.2	25.1 (Adaptation)	410.8	380.6
200	30.1	1850.6	28.3 (Adaptation)	501.2	450.3
500	80.5	4100.3	35.8 (Adaptation)	620.5	510.9

Table 8. Hyperparameter robustness test.

Total Number of Nodes (N)	PSO Optimized Hyperparameters (MSE)	Manually Tuned Hyperparameters (MSE)	Random Hyperparameters (MSE)
50	0.038	0.105	0.286
100	0.042	0.112	0.295
200	0.045	0.128	0.312
500	0.048	0.142	0.338
Average MSE	0.043	0.122	0.308

Table 9. Comparison table of trade-off performance of various algorithms.

Experimental Variant	$P_{perf}$	$W_{norm}$
OSPF	0.32	0.23
DDPG-Routing	0.65	0.14
D2-RMRL	0.78	0.10
Graph-Mamba	0.85	1.00
Proposed Method	0.92	1.21

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, J.; Li, S.; Li, X.; Zhang, F.; Wang, J. Intelligent Routing Optimization via GCN-Transformer Hybrid Encoder and Reinforcement Learning in Space–Air–Ground Integrated Networks. Electronics 2026, 15, 14. https://doi.org/10.3390/electronics15010014

AMA Style

Liu J, Li S, Li X, Zhang F, Wang J. Intelligent Routing Optimization via GCN-Transformer Hybrid Encoder and Reinforcement Learning in Space–Air–Ground Integrated Networks. Electronics. 2026; 15(1):14. https://doi.org/10.3390/electronics15010014

Chicago/Turabian Style

Liu, Jinling, Song Li, Xun Li, Fan Zhang, and Jinghan Wang. 2026. "Intelligent Routing Optimization via GCN-Transformer Hybrid Encoder and Reinforcement Learning in Space–Air–Ground Integrated Networks" Electronics 15, no. 1: 14. https://doi.org/10.3390/electronics15010014

APA Style

Liu, J., Li, S., Li, X., Zhang, F., & Wang, J. (2026). Intelligent Routing Optimization via GCN-Transformer Hybrid Encoder and Reinforcement Learning in Space–Air–Ground Integrated Networks. Electronics, 15(1), 14. https://doi.org/10.3390/electronics15010014

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Intelligent Routing Optimization via GCN-Transformer Hybrid Encoder and Reinforcement Learning in Space–Air–Ground Integrated Networks

Abstract

1. Introduction

1.1. The Vision and Driving Forces of the Integrated Air Space Network (SAGIN)

1.2. Core Challenges of SAGIN Routing: High Dynamics and Strong Heterogeneity

1.3. Core Challenges of SAGIN Routing

1.4. Evolution of Intelligent Routing: From Q-Routing to DRL

1.5. Contributions of This Paper: Deep Reinforcement Learning Based on Spatiotemporal Prediction

2. Problem Modeling

2.1. Network Model

2.2. Optimization Objective

2.3. Challenge Modeling

2.4. Symbol Meaning

3. Adopted Methods

3.1. Network State Prediction Model (GCN + Transformer + PSO)

3.2. Intelligent Routing Algorithm (PPO + Spatiotemporal Attention)

3.3. Centralized Control and Execution Architecture

3.4. Improvement Points

4. Experimental Evaluation and Result Analysis

4.1. Experimental Environment and Dataset Setup

4.2. Baseline Algorithms

4.3. Performance Comparison and Analysis

4.4. Ablation Study

4.5. Scalability Test

4.6. Robustness Test

4.7. Complexity–Performance Trade-Off Optimization

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI