Computation Offloading in Space–Air–Ground Integrated Networks for Diverse Task Requirements with Integrated Reliability Mechanisms

Chen, Yitian; Tong, Yinghua

doi:10.3390/fi17120542

Open AccessArticle

Computation Offloading in Space–Air–Ground Integrated Networks for Diverse Task Requirements with Integrated Reliability Mechanisms

by

Yitian Chen

¹ and

Yinghua Tong

^1,2,*

¹

School of Computer, Qinghai Normal University, Xining 810008, China

²

The State Key Laboratory of Tibetan Intelligence, Qinghai Normal University, Xining 810008, China

^*

Author to whom correspondence should be addressed.

Future Internet 2025, 17(12), 542; https://doi.org/10.3390/fi17120542

Submission received: 28 October 2025 / Revised: 16 November 2025 / Accepted: 25 November 2025 / Published: 27 November 2025

(This article belongs to the Special Issue Task Offloading and Resource Allocation for IoT in Next-Generation Networking)

Download

Browse Figures

Versions Notes

Abstract

The sixth-generation (6G) system has been attracting increasing attention from both industry and academia, with the space–air–ground integrated network (SAGIN) identified as one of its key applications. This study investigates a SAGIN framework tailored for deployment in remote areas. To address the differing needs of users with emergency and routine tasks, an offloading strategy is proposed that enables direct offloading for emergency tasks and optimized UAV-assisted offloading for routine tasks. Additionally, considering the limited satellite coverage duration, a reliability mechanism for task offloading is designed. The study formulates a task offloading optimization problem aimed at maximizing the completion rate of routine tasks—while reducing their energy consumption and latency—under the premise of guaranteeing the completion of emergency task offloading. The problem is modeled as a Markov Decision Process (MDP). To solve it, a D-MAPPO reinforcement learning algorithm is proposed, which integrates the Dirichlet distribution with the Multi-Agent Proximal Policy Optimization (MAPPO) framework. Simulation results show that, compared with the MAPPO and PPO algorithms, the delay is reduced by 38% and 31%, respectively, while the energy consumption is reduced by 7% and 48%, respectively.

Keywords:

6G; space–air–ground integrated network; task offloading; reliability; reinforcement learning

Graphical Abstract

1. Introduction

With the ongoing development of sixth-generation (6G) wireless communication systems, aerial access networks and space–air–ground integrated networks (SAGINs) have become major focal points for both industry and academia [1,2]. In recent years, significant research efforts have been dedicated to exploring the application potential of SAGIN in domains such as the Internet of Things (IoT) [3], cognitive communication [4], and edge computing [5], particularly in expanding its use cases in remote areas. By integrating satellites in space, unmanned aerial vehicles (UAVs) in the air, and terrestrial base stations and data centers, SAGIN constructs a cross-domain collaborative network architecture, offering critical support for achieving seamless global connectivity [6]. In particular, in remote areas such as deserts, oceans, and sparsely populated regions—where terrestrial base stations are scarce and costly to deploy—traditional communication networks often fail to provide adequate coverage. Meanwhile, data collected from environmental monitoring in these regions hold significant value for global information systems. Owing to its extensive coverage, high flexibility, and reliability, SAGIN is considered a promising solution to enable real-time data collection and transmission for the Internet of Remote Things (IoRT), and to address the connectivity challenges in remote environments [7].

Companies such as SpaceX and OneWeb are accelerating the deployment of large-scale low Earth orbit (LEO) satellite constellations, redefining network architecture to enable low latency, high capacity, and global services [8,9,10]. This trend offers new opportunities for building global connectivity. Compared to medium Earth orbit (MEO) and geostationary Earth orbit (GEO) satellites, LEO satellites are closer to the Earth and thus better suited for latency-sensitive communications. LEO satellites are rapidly advancing in terms of capacity and coverage [11,12,13,14]. Their low-latency characteristics make them ideal for IoT applications. The rise of large-scale LEO constellations is fundamentally reshaping network structures to achieve the goals of low latency, high capacity, and global coverage. This lays a solid foundation for realizing a truly interconnected world, provides reliable communication for remote areas, and helps bridge the digital divide. Moreover, Mobile Edge Computing (MEC), a revolutionary concept aimed at enhancing the low-latency and high-bandwidth capabilities of communication systems, has also been integrated into the SAGIN architecture to enable proximal deployment of computing resources and satisfy real-time processing and high-speed transmission requirements [15].

Despite SAGIN’s vast development potential, its practical implementation faces multiple challenges. Due to the highly dynamic nature of satellite communications and the inherent resource limitations, the network is prone to interruptions and data loss. Therefore, there is an urgent need to build highly reliable and stable systems to ensure communication quality [16]. At the same time, latency and energy consumption remain key optimization targets. In remote scenarios, user requirements differ significantly across applications. For instance, emergency rescue and telemedicine demand extremely low latency, while environmental monitoring and resource exploration focus more on energy efficiency and coverage. Balancing these heterogeneous demands while improving overall latency and energy performance has become a central research issue.

In edge computing environments, end devices often have limited computing capabilities, and edge servers themselves face constraints in both number and resources. When the volume of offloading requests exceeds the processing threshold of edge nodes, service quality inevitably degrades—leading to increased task failure rates and service latency [17]. To address this issue, researchers have proposed various resource allocation and task scheduling optimization strategies, aiming to alleviate edge overload through the design of effective offloading decision models [18]. Furthermore, UAV-based auxiliary computing platforms have been introduced to expand the service scope and capacity of edge computing, forming multi-level air–ground collaborative offloading systems. However, due to limitations in spectrum resources and node capacity—particularly in extreme environments such as disaster zones, oceans, and deserts—edge computing still faces significant challenges in providing adequate coverage and computational support for dense IoT terminals in remote areas [19,20].

In conclusion, SAGIN research aimed at remote regions must not only address fundamental connectivity issues but also optimize latency and energy performance across diverse application scenarios. It is equally critical to take into account the limited computing resources of edge nodes when designing efficient and intelligent task offloading and resource management mechanisms. In the future, by enabling intelligent coordination and scheduling of space–air–ground resources, SAGIN’s integrated performance in remote applications is expected to improve significantly, promoting a more balanced development of global communication networks.

In this study, we investigate a SAGIN that combines hybrid cloud services with mobile edge computing (MEC). In our design, both cloud servers and satellites are regarded as components of the MEC infrastructure, while UAVs are designated as the primary decision-making agents. The objective is to enhance the overall quality of service (QoS) by enabling UAVs to select optimal task offloading strategies. Beyond the conventional network parameters such as bandwidth and computational capacity, we further consider the limitations on the number of tasks that can be accepted by cloud servers and satellites. This allows for a more comprehensive analysis of how these constraints affect offloading decisions and overall network performance. To address the diverse requirements of different task types, we classify tasks into two categories: urgent tasks and normal tasks. Correspondingly, we design two distinct offloading strategies to ensure multi-tier QoS guarantees. Moreover, while conventional deep reinforcement learning (DRL) algorithms have demonstrated adaptability in dynamic environments, most rely on discrete action spaces, thereby limiting optimization efficiency. To overcome this, we propose a novel multi-agent reinforcement learning algorithm—D-MAPPO—by integrating the Dirichlet distribution with Multi-Agent Proximal Policy Optimization (MAPPO). This approach leverages continuous action spaces, enabling more efficient optimization in dynamic environments. This approach leverages continuous action spaces, enabling more efficient optimization in dynamic environments. The main contributions of this work are summarized as follows:

A SAGIN suitable for remote areas is proposed. Considering the unique application requirements of remote regions, computational tasks are classified into normal and urgent tasks to address diverse demands under different scenarios.
To address the limited satellite coverage time and improve communication quality, a reliability mechanism for task offloading from ground sensors and UAVs to satellites is proposed. In addition, the communication process, satellite coverage duration, and overall network cost are modeled within the proposed framework.
Based on the distinct characteristics of urgent and normal tasks, a computation offloading problem is formulated to jointly optimize network energy consumption and latency. The optimization problem is modeled as a Markov Decision Process (MDP). Considering the multidimensional offloading decision space and specific constraints, a Dirichlet-based Multi-Agent Proximal Policy Optimization (D-MAPPO) algorithm is proposed to enable the learning of optimal task offloading strategies.
Extensive simulation results demonstrate that the D-MAPPO algorithm achieves faster and more stable convergence. Moreover, it consistently outperforms benchmark methods—including Beta-MAPPO, PPO, Local, Offloading, and Random—in terms of latency reduction, energy efficiency, and offloading success rate.

The remainder of this paper is organized as follows: Section 2 reviews related work in the field of SAGIN. Section 3 presents the proposed system model. Section 4 formulates and analyzes the offloading optimization problem. Section 5 introduces and implements the MAPPO algorithm. In Section 6, we analyze the experimental results. Finally, Section 7 concludes the paper.

2. Related Work

As a core architecture to support future 6G communications and edge computing, SAGINs demonstrate great potential in applications such as emergency communications, large-scale IoT, and vehicular networks. Existing studies have extensively explored the design of network architectures, task offloading mechanisms, and algorithmic solutions to balance critical performance metrics such as energy consumption, latency, and robustness. This chapter systematically reviews relevant progress from two perspectives: computation offloading and algorithm design.

2.1. Task Offloading in SAGIN

Reference [21] proposes a task offloading model based on deep reinforcement learning and user experience. After a user generates a blockchain task, a Proof-of-Work (PoW) consensus mechanism is introduced to package transaction information into blocks, ensuring the authenticity and reliability of interactions within the system. By incorporating total system delay and user rewards, the concept of user experience is defined, and an optimal user experience objective is formulated.

A previous study [22] investigates data collection for IoT sensors using UAVs assisted by low Earth orbit (LEO) satellites. Specifically, UAVs collect data from IoT sensors and transmit it back to Earth via two modes: (1) a delay-tolerant mode using UAV-carried storage and (2) a delay-sensitive mode through the UAV–satellite communication network.

Another study [23] addresses communication challenges in remote areas lacking cellular infrastructure. It explores the role of UAVs in SAGIN, where UAVs act as control-layer nodes due to their flexible deployment and relay capabilities. Equipped with millimeter-wave radars and visual sensors, UAVs gather multisource data to reduce uncertainty and improve decision accuracy. Meanwhile, UAVs collect computation tasks from mobile devices within their coverage areas and offload them to other processing units to enhance computational efficiency.

Considering the unreliable aerial communication environment in remote regions, the offloading process for power-constrained IoT devices remains challenging. A previous study [24] proposes an energy-efficient EC-SAGIN architecture, where IoT devices select the most suitable LEO satellite or UAV for task offloading based on their energy levels, communication conditions, and computing capabilities. Similarly, Ref. [25] highlights the critical role of computation offloading and resource allocation under limited satellite communication and computing resources. By jointly optimizing offloading decisions and allocating wireless and computational resources, the approach minimizes system energy consumption while meeting latency constraints.

The authors of [26] study computation offloading in large-scale MIMO-enabled MEC systems, where mobile devices with limited resources offload tasks to high-performance edge servers. The objective is to minimize both the power consumption and offloading delay of mobile devices under stochastic network conditions.

Overall, the above studies are either conducted under general remote-area scenarios without considering emergency offloading situations or lack reliability mechanisms to ensure stable transmission of control information. Consequently, there remains a need for a network architecture and computation offloading mechanism that can operate efficiently under normal conditions while providing rapid responsiveness during emergencies in remote regions.

2.2. Optimization Algorithms

A previous study [27] proposes a hybrid algorithm combining Particle Swarm Optimization (PSO) and a Greedy Strategy (GS), referred to as PSO&GS, to minimize the system’s average response delay and obtain a near-optimal solution. Extensive simulations verify the convergence of the proposed algorithm, and numerical results show that it achieves excellent convergence performance, reducing the average response delay to about 0.65–0.85 times that of the baseline algorithms.

Another study [28] aims to minimize the overall energy consumption and delay during data aggregation and computation processes. A multi-agent deep reinforcement learning algorithm is employed, integrating value decomposition and double deep Q-network (DQN) techniques to optimize data aggregation and enable a cost-effective cooperative offloading process. Experimental results demonstrate that, compared with traditional approaches, the proposed method reduces training time, data processing volume, energy consumption, and task duration by 20%, 11.4%, 5.6%, and 11.2%, respectively, while serving up to 98% of IoT devices.

The authors of [29] develop a novel deep risk-sensitive reinforcement learning algorithm. Specifically, it evaluates the risk associated with each state and jointly learns optimal parameters to balance delay minimization and risk control. Simulation results show that, compared with probabilistic configuration methods, the proposed approach reduces task processing delay by 30% while satisfying UAV energy constraints.

In [30], the authors model the ground power facility (GPF) association and power control subproblems as a multi-agent, time-varying K-armed bandit problem. A joint optimization algorithm based on multi-agent temporal difference (TD) reinforcement learning is proposed, alternately optimizing the two subproblems. Results demonstrate that, under various conditions (including different noise power levels, GPF bandwidths, and GPF quantities), the proposed method improves overall system energy efficiency by 16.23%, 86.29%, and 5.11% compared with three baseline algorithms (random path, average transmission power, and random device association).

Another study [31] introduces an asynchronous federated learning-based adaptive collaborative aggregation algorithm (AFLS), enabling participating nodes to update model parameters dynamically during aggregation. Experiments show that with effective collaboration among nodes, the proposed framework achieves outstanding performance in image classification tasks, reaching 95% accuracy with fewer communication rounds.

Ref. [32] presents a reinforcement learning-based control framework for cascade formation of perturbed surface vehicles. The framework aims to achieve precise trajectory tracking while minimizing a predefined cost function to enhance system performance. Theoretical analysis ensures that both individual agents and the overall closed-loop control system satisfy the uniformly ultimately bounded (UUB) condition for tracking and learning errors. Results confirm that the proposed approach exhibits excellent robustness and feasibility in formation control under disturbed environments.

However, traditional intelligent algorithms and single-agent reinforcement learning methods can no longer meet the optimization requirements of large-scale network environments. In existing multi-agent reinforcement learning research, various constraints in offloading decisions are typically learned implicitly through network training. Nevertheless, there is still a lack of algorithms explicitly designed to adapt to action space constraints and structurally integrate these limitations within the learning process.

3. System Model

This section describes the overall network architecture and its mathematical models. Table 1 provides explanations for all parameters in this section.

3.1. Network Architecture

As shown in Figure 1, for urgent tasks, the sensor directly offloads them to the satellite for processing, and the satellite returns the computation results directly to the sensor. For normal tasks, they are first collected by UAVs. The sensors upload the tasks to the UAVs, which, based on the task offloading strategy, may then offload the tasks to either satellites or cloud servers. Each UAV supports bidirectional communication with both sensors and cloud servers. Note that since a return link between the satellite and the sensor has already been established, the satellite does not need to send the computation results back to the UAV. Therefore, the communication between the UAV and the satellite is unidirectional.

Let I denote the set of sensors, where

i \in I

; U denote the set of UAVs, where

u \in U

; S denote the set of satellites, where

s \in S

; and C denote the cloud server, which is assumed to be a single entity. Let M represent the set of tasks, where

m \in M

, with

M^{n}

denoting normal tasks and

M^{u r}

representing urgent tasks.

3.2. Task Model

In practical environmental monitoring IoT systems, the urgency levels of tasks vary across different scenarios. Based on their urgency, tasks are categorized into two types:

(1): Urgent Tasks: Tasks triggered by emergency events such as wildfires, earthquakes, or other sudden incidents are defined as urgent tasks. Among all tasks, urgent tasks represent a small proportion but must be processed with the highest priority. When there are too many tasks to unload, priority should be given to unloading the completion rate of urgent tasks.
(2): Normal Tasks: Most computational tasks generated by ground sensors are classified as normal tasks. Hence, optimizing normal tasks is the central focus of computation offloading in the system. Given that normal tasks are generally delay-tolerant, a joint optimization of energy consumption and latency is desirable.

3.3. Communication Model

(1): Communication Between UAVs and Ground Sensors: According to [33], the air-to-ground communication channel depends on the UAV’s altitude, the elevation angle, and the propagation environment [34]. As described in [35], the average path loss for the air-to-ground channel can be expressed as follows:

$\begin{matrix} P^{loss} (r, h) = 20 log (\frac{4 π f_{c} {(h^{2} + r^{2})}^{\frac{1}{2}}}{c}) + P^{Los} η^{Los} + (1 - P^{Los}) η^{NLos}, \end{matrix}$

(1)

where $P^{L o s}$ denotes the probability of a line-of-sight (LoS) link between the UAV and the ground devices(GDs); h is the UAV altitude and r is the horizontal distance to the sensor; $η^{L o s}$ and $η^{N L o s}$ are the additional losses associated with LoS and non-LoS (NLoS) conditions [36]. $f_{c}$ is the carrier frequency, and ccc is the speed of light. According to [37], the values in remote areas are $(η^{L o s}, η^{N L o s}) = 0.1, 2.1$ .
We assume that the communication link between the sensor and the UAV operates in the C-band spectrum. The maximum transmission rate for the ground sensor $(R_{I})$ and the UAV $(R_{U})$ is given by

$\{\begin{matrix} R_{I} = B_{I} {log}_{2} (1 + \frac{P_{I} \cdot 10^{\frac{P^{loss}}{10}}}{σ^{2}}) \\ R_{U} = B_{U} {log}_{2} (1 + \frac{P_{U} \cdot 10^{\frac{P^{loss}}{10}}}{σ^{2}}), \end{matrix}$

(2)

where $B_{I}$ and $B_{U}$ are the bandwidths of the ground sensor and the UAV, respectively. It is worth noting that device bandwidths are not fixed and may vary dynamically [38]. $P_{I}$ and $P_{U}$ represent the transmit power of the sensor and the UAV, respectively, while $σ^{2}$ denotes Gaussian noise power.
(2): Communication Between UAVs and the Cloud Server: The cloud server is also deployed on the ground, so its communication model with UAVs is similar to that between UAVs and sensors. The main difference is that the cloud server not only receives task data but also sends the processed results back to the UAV, which then delivers them to GD. Therefore, the maximum transmission rate $R_{c}$ between the UAV and the cloud server is given by

$R_{c} = B_{c} {log}_{2} (1 + \frac{P_{c} 10^{\frac{P^{loss}}{10}}}{σ^{2}}),$

(3)

where $B_{c}$ is the bandwidth of the cloud server, and $P_{c}$ is its transmit power.
(3): Communication between UAV and ground sensors and satellites: The communication links between UAV and satellites are mainly based on clear line-of-sight (LoS) links, supplemented by a small number of NLoS links. We model the communication channel between UAV and GD and satellites as a Rician channel [39,40]; therefore, the channel gains of LoS and NLoS are integrated, so the channel coefficients between UAV and ground sensors and satellites $ξ = \sqrt{\frac{F}{1 + F}} ξ^{L o S} + \sqrt{\frac{{(d_{s, u})}^{- α}}{1 + F}} ξ^{N L o S}$ , where F is the Rician factor, $α$ denotes the distance attenuation factor, and $ξ^{L o S}$ and $ξ^{N L o S}$ denote the LoS and NLoS channel gains between satellites and communication devices, respectively. Finally, the maximum transmission rates $R_{U, S}$ and $R_{I, S}$ from UAVs and ground sensors to the satellite, as well as the satellite’s maximum transmission rate $R_{S}$ , can be determined as follows:

$\{\begin{matrix} R_{U, S} & = B_{U} l o g_{2} (1 + \frac{P_{U} G_{0} {| ξ |}^{2}}{N_{0} B_{U}}), \\ R_{I, S} & = B_{I} l o g_{2} (1 + \frac{P_{I} G_{0} {| ξ |}^{2}}{N_{0} B_{U}}), \\ R_{S} & = B_{S} l o g_{2} (1 + \frac{P_{S} G_{0} {| ξ |}^{2}}{N_{0} B_{S}}), \end{matrix}$

(4)

where $B_{S}$ is the available bandwidth of the satellite, $G_{0}$ is the fixed antenna gain. $P_{s}$ denotes the transmit power of the satellite, and $N_{0}$ is the spectral density of the additional white Gaussian noise (AWGN).

3.4. Satellite Coverage Model

Considering the dynamic characteristics of low Earth orbit (LEO) satellites, sensors and UAVs can communicate with satellites only when they are within the satellite’s coverage area. Therefore, it is essential to model the coverage time of these satellites as a fundamental reference for making informed task offloading decisions. Figure 2 illustrates the geometric relationship between LEO satellites and GD, based on which the LEO satellite coverage time model is derived. It is worth noting that the satellites are positioned at an altitude of 300 km above the Earth’s surface, while UAVs operate at altitudes below 100 m. Given that the UAVs’ altitude is negligible compared to that of the satellites, the UAV height is disregarded in the satellite coverage model. Thus, UAVs and sensors share the same satellite coverage model, and in this context, both UAVs and sensors are collectively referred to as GD.

In Figure 2,

d_{E}

represents the Earth’s radius,

d_{o}

denotes the satellite’s orbital altitude, and

d_{G S}

is the distance between the GD and the satellite. According to [41], the elevation angle

θ_{c}

of communication between ground equipment and the satellite can be expressed as follows:

θ_{G} = arccos (\frac{d_{E} + d_{o}}{d_{G S}} sin θ_{c}),

(5)

where

θ_{c}

is the coverage angle of the LEO, denoted as follows:

θ_{c} = arccos (\frac{d_{E}}{d_{E} + d_{O}} c o s θ_{G}) - θ_{G} .

(6)

From this, we can calculate the coverage arc length of the satellite

L_{S}

as

L_{S} = 2 (d_{E} + d_{o}) θ_{c} .

(7)

Therefore, the coverage arc length of the satellite can be calculated, and then the coverage time of the satellite

T_{S}

can be calculated:

T_{S} = \frac{L_{S}}{V_{S}},

(8)

where

V_{S}

is the satellite’s flight speed.

3.5. Computational Model

In the operational environment considered in this study, tasks can be processed by UAVs, satellites, or cloud servers, while sensors are not responsible for task processing. For a given task mmm generated by a sensor, we define

m = \{φ_{m}, ρ_{m}\}

, where

φ_{m}

represents the data size of the task, and

ρ_{m}

denotes the computational complexity, i.e., the number of CPU cycles required to process one bit of data. Normal tasks are first collected by UAVs and then processed based on the UAVs’ offloading decisions. The offloading decision of UAV u is defined as

(μ_{(u, u)}, μ_{(u, C)}, μ_{(u, s)}), μ_{(u, u)} + μ_{(u, C)} + μ_{(u, s)} = 1

. Here,

μ_{(u, u)}

represents the proportion of tasks computed locally by UAV u,

μ_{(u, C)}

denotes the proportion of tasks offloaded to the cloud server, and

μ_{(u, s)}

represents the proportion of tasks offloaded to the satellite. In contrast, urgent tasks are directly offloaded by the sensors to the satellites for processing, with the results subsequently transmitted back to the sensors by the satellites.

The following section presents an analysis of the computational models for the processing devices within the network, including UAVs, satellites, and cloud servers.

(1): UAV computation: Each UAV collects tasks offloaded from multiple sensors. However, since UAVs have limited computational resources, offloading a portion of tasks is necessary. Let $N_{i}^{n}$ denote the total number of normal tasks generated by sensor iii; thus, the total number of tasks collected by UAV u is $N_{u} = \sum_{i}^{I} N_{i}^{n}$ . The transmission time for UAV u to collect all tasks is given by

$T_{u}^{t r a n} = \frac{b \sum_{x}^{N_{u}} φ_{x}^{n}}{R_{I}}$

(9)

where $b > 1$ is the transmission overhead factor [42]. The total transmission energy consumption for UAV u to collect all normal tasks is as follows:

$E_{u}^{t r a n} = P_{I} T_{u}^{t r a n} .$

(10)

According to the above equations, the number of tasks computed locally by UAV u can be represented as $⌊μ_{(u, u)} N_{u}⌋$ , the total local computational latency of UAV u is

$T_{u}^{c o m p} = \frac{\sum_{x}^{⌊ μ_{(u, u)} N_{u} ⌋} φ_{x}^{n} ρ_{x}^{n}}{f_{u}},$

(11)

where $φ_{x}^{n}$ and $ρ_{x}^{n}$ represent the data size and computational complexity of the x-th task, respectively. $f_{u}$ denotes the computational capability of UAV u. The total local computational energy consumption of UAV u is:

$E_{u}^{c o m p} = τ f_{u}^{2} \sum_{x}^{⌊ μ^{(u, u)} N_{u} ⌋} φ_{x}^{n} ρ_{x}^{n}$

(12)

where $τ$ is the energy coefficient and depends on the CPU structure of the UAV u. When the calculation is complete, the UAV u sends the resultant data back to the sensor with a return time of

$T_{u, i}^{b a c k} = \frac{b φ_{u}^{r e}}{R_{u, i}},$

(13)

where $φ_{u}^{r e}$ denotes the total data volume of the UAV u result data. The energy consumption of the UAV u return data is as follows:

$E_{u}^{b a c k} = P_{U} T_{u}^{b a c k} .$

(14)
(2): Cloud server computation: The cloud server processes tasks offloaded from multiple UAVs. Since the cloud server’s computational resources are limited, the total number of tasks received by the cloud server is $N_{C} = \sum_{u}^{U} μ_{(u, C)} N_{u}$ . The transmission time and the energy consumption for the cloud server to receive UAV-offloaded tasks are denoted as $T_{c}^{T}$ and $E_{c}^{T}$ , respectively:

$T_{c}^{t r a n} = \frac{b \sum_{x}^{N_{C}} φ_{x}^{n}}{R_{U}},$

(15)

$E_{C}^{t r a n} = P_{U} T_{C}^{t r a n} .$

(16)

The computational latency $T_{C}^{C}$ and computational energy consumption $E_{C}^{C}$ of the cloud server are

$T_{c}^{c o m p} = \frac{\sum_{x}^{N_{C}} φ_{x}^{n} ρ_{x}^{n}}{f_{c}},$

(17)

$E_{u}^{c o m p} = τ f_{C}^{2} \sum_{x}^{N_{C}} φ_{x}^{n} ρ_{x}^{n},$

(18)

where $f_{C}$ is the computational capability of the cloud server. The cloud server needs to return the computation result of the task to the UAV, and the return delay $T_{C}^{B a c k}$ and the energy consumption $E_{C}^{B a c k}$ of the cloud server will be

$T_{C}^{B a c k} = \frac{b φ_{C}^{r e}}{R_{C}},$

(19)

$E_{C}^{B a c k} = P_{C} T_{C}^{B a c k} .$

(20)
(3): Satellite computation: The satellite not only has to deal with the tasks offloaded by the UAV locally, but also has to deal with the urgent tasks uploaded by the sensors. When satellites receive offloaded tasks, urgent tasks are prioritized to ensure timely processing. The total number of normal tasks received by satellite s is $N_{s} = \sum_{u}^{U} μ_{(u, s)} N_{u}$ , where $N_{i}^{u r}$ represents the total number of urgent tasks generated by sensor iii. The transmission time and energy consumption for satellite s to receive both urgent and normal tasks are represented by $T_{s}^{t r a n}$ and $E_{s}^{t r a n}$ , respectively.

$\begin{matrix} T_{s}^{t r a n} = T_{s}^{t r a n, u r} + T_{s}^{t r a n, n} = \frac{b \sum_{i}^{I} \sum_{x}^{N_{i}^{u r}} φ_{x}^{u r}}{R_{I, S}} + \frac{b \sum_{x}^{N_{s}} φ_{x}^{n}}{R_{U, S}} \end{matrix}$

(21)

$E_{i, s, m}^{t r a n, u r} = P_{i, s} T_{s}^{t r a n, u r} + P_{u, s} T_{s}^{t r a n, n} .$

(22)

The computation delay and energy consumption of satellite s for processing urgent and normal tasks are denoted as $T_{s}^{c o m p}$ and $E_{s}^{c o m p}$ , respectively:

$T_{s}^{c o m p} = \frac{\sum_{l}^{I} \sum_{x}^{N_{l}^{u r}} φ_{x}^{u r} ρ_{x}^{u r} + \sum_{x}^{N_{s}} φ_{x}^{n} ρ_{x}^{n}}{f_{s}}$

(23)

$E_{s}^{c o m p} = τ f_{s}^{2} (\sum_{i}^{I} \sum_{x}^{N_{i}^{u r}} φ_{x}^{u r} ρ_{x}^{u r} + \sum_{x}^{N_{s}} φ_{x}^{n} ρ_{x}^{n})$

(24)

where $f_{s}$ represents the computational capability of satellite s. The satellite directly returns the computation results to the sensors, and the backhaul delay and energy consumption of satellite s are expressed as $T_{s}^{b a c k}$ and $E_{s}^{b a c k}$ , respectively:

$T_{s}^{b a c k} = \frac{b φ_{s}^{r e}}{R_{s}}$

(25)

$E_{s}^{b a c k} = P_{s} T_{s}^{b a c k}$

(26)

where $φ_{s}^{r e}$ denotes the amount of result data returned by satellite s.

3.6. Reliability Mechanisms

A task is deemed to have been offloaded successfully only when every single bit has been uploaded without fail. However, given that the coverage time of a satellite is limited, there is a possibility that task offloading to a satellite might end in failure. Should this happen to an urgent task, the consequences would be nothing short of catastrophic.

To address this, a two-pronged approach is proposed. First, a lower bound is established for the maximum transmission rate of the device, which can be determined through the satellite coverage model. Second, when each device proceeds to offload tasks to the satellite, it first calculates the maximum number of tasks that can be offloaded. In this process, priority is given to offloading urgent tasks.

From Section 3.4, we can determine the satellite coverage time

T_{S}

. The total transmission time for satellite mission data should be less than the satellite coverage time, i.e.,

T_{s}^{t r a n} = \frac{φ_{s}^{t r a n}}{R_{I, S}} < T_{s}

(27)

where

φ_{S}^{t r a n}

is the total data volume transmitted to the satellite; thus, it can be concluded that the limiting condition for the maximum amount of data transmitted by the UE is

φ_{s}^{t r a n} < R_{I, S} T_{s} .

(28)

4. Problem Analysis

This study considers both delay and energy consumption for normal and urgent tasks in SAGIN. For urgent tasks, the primary concerns are whether the task can be completed and how quickly it can be completed. In contrast, normal tasks focus on minimizing energy consumption and delay, provided that the task is successfully completed. According to the proposed task offloading mechanism, urgent tasks are directly offloaded to satellites for processing, where energy consumption and delay are disregarded, and only the task completion is considered. Moreover, for both urgent and normal tasks, successful offloading to the designated processing node is deemed sufficient for task completion. Therefore, the analysis focuses on the offloading success rate of all tasks. The offloading success rate

δ

can be expressed as follows:

δ = \frac{N^{s u}}{N^{u p}},

(29)

Here,

N^{s u}

and

N^{u p}

represent the total number of successfully offloaded tasks and the total number of offloaded tasks, respectively. Based on the above, the task offloading optimization problem can be formulated as follows: under the constraint of ensuring a high completion rate for urgent task offloading, the goal is to improve the completion rate of normal tasks while minimizing their energy consumption and delay. The optimization problem is formulated as Problem P1 :

\begin{matrix} (P 1) : & \underset{μ}{m i n} (E + T) a n d m a x (δ^{n}) \\ s . t . & C 1 : B_{m i n} \leq B \leq B_{m a x} \\ C 2 : φ < R_{u, s} T_{s} \\ C 3 : f_{m i n} \leq f \leq f_{m a x} \\ C 4 : 0 \leq μ_{(u, u)} \leq 1 \\ C 5 : 0 \leq μ_{(u, C)} \leq 1 \\ C 6 : 0 \leq μ_{(u, s)} \leq 1 \\ C 7 : μ_{(u, u)} + μ_{(u, C)} + μ_{(u, s)} = 1 \\ C 8 : δ^{u r} = 1 \\ C 9 : φ_{m i n}^{n} \leq φ^{n} \leq φ_{m a x}^{n} \\ C 10 : ρ_{m i n}^{n} \leq ρ^{n} \leq ρ_{m a x}^{n} \end{matrix}

(30)

where

φ^{n}

indicates the success rate of offloading common tasks.

C 1

denotes that the bandwidth resources of all the devices in the network are uncertain, but are within the limits.

C 2

denotes that the maximum amount of mission data from the UAV to the satellite should satisfy the reliability constraint. Constraint

C 3

specifies that the computational capacities of all computing devices in the network are uncertain but within a limited range. Constraints

C 4

–

C 6

ensure that the offloading ratios in the offloading strategy remain within (0, 1), while

C 7

requires that the sum of each UAV’s offloading decisions equals 1, noting that UAVs may offload computation tasks to multiple satellites. Constraint

C 8

guarantees a 100% success rate for urgent task offloading, and Constraints

C 9

–

C 10

ensure that the task size and complexity of normal tasks remain within their respective minimum and maximum bounds.

5. D-MAPPO Algorithm Design

To address the computation offloading optimization problem described above, we propose a reinforcement learning algorithm based on the Dirichlet probability distribution and Multi-Agent Proximal Policy Optimization (MAPPO). This algorithm demonstrates strong adaptability to continuous and dynamic environments, making it particularly well-suited for the SAGIN scenario considered in this study.

5.1. MDP Design

The Markov Decision Process (MDP) is a mathematical framework commonly used to model sequential decision-making problems and is widely applied in reinforcement learning (RL). An MDP characterizes the interaction between an agent and its environment through states, actions, state transitions, and a reward function, enabling the agent to learn an optimal decision-making policy [Reinforcement Learning: An Introduction]. In the context of computation offloading in space–air–ground integrated networks (SAGINs), the problem can be formulated as an MDP to optimize offloading strategies, thereby improving computational efficiency while reducing energy consumption and delay. In this study, we define the total operation time of the SAGIN system as T, which is further divided into t discrete time slots.

(1): State Space: The state space defines all possible states of the environment and serves to describe the current condition of the system, thereby providing a basis for the agent’s decision-making. In the context of the computation offloading problem, the state space should adequately represent the availability of computational resources and the execution status of tasks, enabling the agent to select an appropriate offloading strategy accordingly. In this study, each UAV is modeled as an individual agent. The state $s_{t}$ of each agent at time slot t is defined as follows:

$\begin{matrix} s_{t} = (f_{C}, B_{C}, ψ, f_{s}, B_{s}, θ_{c, s}, \dots, f_{S}, B_{S}, θ_{c, S}, f_{u}, B_{u}, N^{n}, N^{u r}, ρ, φ), \end{matrix}$

(31)

where $f_{C}$ is the computational resource of the cloud server, $B_{C}$ is the bandwidth resource of the cloud server, and $ψ$ is the number of tasks that the cloud server can handle. $f_{s}$ is the computational resource of the satellite s, $B_{s}$ is the bandwidth resource of the satellite s, and $θ_{(c, s)}$ is the coverage angle of the satellites, $s \in S$ ; note that there are multiple satellites in the network frame in the paper, and the status $s_{t}$ records the status of all satellites. $f_{u}$ is the computational resource of UAV u and $B_{u}$ is the bandwidth resource of UAV u. $N^{n}$ is the number of common tasks, $N^{u r}$ is the number of urgent tasks, $ρ$ is the computational complexity of the tasks, and $φ$ is the task size of a single task.
(2): Action Space: the action set defines the actions that an intelligent can perform in each state. In the computational offloading problem, the action determines the allocation ratio of tasks on different computational nodes. So, in this paper, the unloading ratio $μ_{(u, u)}, μ_{(u, C)}, μ_{(u, s)}, \dots, μ_{(u, S)}$ is taken as the action set, i.e., the action $a_{t} = (μ_{(u, u)}, μ_{(u, C)}, μ_{(u, s)}, \dots, μ_{(u, S)})$ of each intelligent body in the time slot t, where the unloading ratio in the action set $a_{t}$ has to satisfy the constraint $d \sim g$ in the problem $P 1$ .
(3): State Transition: State transition describes the process by which the system evolves from one state to another after an action is taken. In the computation offloading problem, state transitions depend on how offloading decisions affect resource availability and task status. In this study, state transitions are influenced by the following factors: Changes in the available computational resources of each device after task execution. The arrival of new tasks follows the completion of previous ones. The state transition process can be described by the probability model $P (s_{t + 1} | s_{t}, a_{t})$ , which captures the impact of offloading decisions on the future state of the system.
(4): Reward Function: The reward function evaluates the quality of offloading decisions and serves as the foundation for policy optimization in reinforcement learning. The objective of this study is to ensure a high completion rate for urgent task offloading while improving the completion rate of normal tasks and minimizing their energy consumption and delay. Accordingly, the reward function $r s_{t}, a_{t}$ is defined as follows:

$r (s_{t}, a_{t}) = \{\begin{matrix} - [(1 - k) E + k T] & , δ^{u r} = 1 a n d δ^{n} = 1 \\ - [(1 - k) E + k T] - ι (1 - δ^{n}) N^{n} & , δ^{u r} = 1 a n d δ^{n} \neq 1 \\ - [(1 - k) E + k T] - ι (1 - δ^{n}) N^{n} - ι (1 - δ^{u r}) N^{u r} & , δ^{u r} \neq 1 a n d δ^{n} \neq 1 \end{matrix}$

(32)

Among them, k is the weight coefficient between energy consumption and delay, which can be adjusted according to different user requirements, $ι$ indicates the penalty coefficient, which is penalized for every uninstalled task, and the total penalty will be reduced with the improvement of task uninstallation success rate, $(1 - δ^{n}) N^{n}$ indicates the number of uninstalled tasks in common tasks, and $(1 - δ^{u r}) N^{u r}$ indicates the number of uninstalled tasks in urgent tasks. In this paper, we set $k = 0.5, ι = 7$ .
(5): Policy: denoted by $π$ , is a static mapping from a state $s_{t}$ to an action $a_{t}$ . In other words, when the system is in state $s_{t}$ , it selects the corresponding action $a_{t}$ as prescribed by the policy.

Accordingly, the optimization problem of improving the success rate of general task offloading, while jointly minimizing latency and energy consumption based on an MDP framework, can be formulated as Problem

P 2

:

\begin{matrix} (P 2) : & min_{π} \underset{T \to \infty}{l i m} E [\frac{1}{T} \sum_{t}^{T} r (s_{t}, a_{t}) π] \\ s . t . & C 1 \sim C 1 \end{matrix}

(33)

Here,

(33)

denotes the expected average reward. The original Problem

P 1

is thereby transformed into Problem

P 2

, which aims to identify an optimal policy

π

associated with the reward

r (s_{t}, a_{t})

that chooses an action a in state s to minimize the expected average cost. This formulation of Problem

P 2

is a constrained Markov Decision Process (CMDP), representing a classic MDP with additional constraints. Solving such a CMDP in the presence of uncertainty is highly challenging.

On the one hand, standard MDPs have been extensively studied, and deterministic policies can be derived using iterative methods such as policy iteration or value iteration [43]. However, due to the simultaneous consideration of constraints and objectives, these methods are not applicable to CMDPs. On the other hand, while CMDPs with known transition probabilities can be solved via linear programming, this approach is not suitable for CMDPs with uncertainty, as the number of arriving tasks is not deterministic.

5.2. Dirichlet Probability Distribution and Representation of Offloading Decisions

In the computation offloading environment described in this paper, UAVs act as intelligent agents that allocate computation tasks among local processing, cloud servers, and satellites. The constraints on the action set

a_{t}

are defined in Formula (30) and

C 4

–

C 7

, where each variable lies between 0 and 1, and their sum equals 1.

Since the action space is continuous and constrained by a simplex, traditional Gaussian or Beta distributions cannot satisfy these conditions. Using such distributions would require additional normalization or regularization, increasing computational complexity, wasting resources, and slowing optimization.

To address this, the Dirichlet distribution is adopted for policy modeling, as it naturally satisfies the above constraints. The Dirichlet distribution is defined as follows:

P (x_{1}, \dots, x_{J} | α_{1}, \dots, α_{J}) = \frac{1}{B (α)} \prod_{j = 1}^{J} x_{j}^{α_{j} - 1}

(34)

where

x_{J}

represents the offloading ratios satisfying

\sum_{j}^{J} x_{j} = 1

, and

α_{J}

are the parameters of the Dirichlet distribution.

B (α)

denotes the normalization term of the Beta function. When all

α_{J}

are equal, the distribution is approximately uniform; when some

α_{J}

are larger, the distribution skews toward the corresponding dimensions.

In the Actor network, the state set

s_{t}

is input into a multilayer perceptron (MLP) to generate the Dirichlet distribution parameters:

α_{t} = softplus ({MLP}_{θ} (s_{t})) + ϵ

(35)

where

softplus (x) = l n (1 + e^{x})

ensures

α_{j} > 0

, and

ϵ > 0

is a small constant for numerical stability. The offloading ratios are obtained by sampling

μ_{t} \sim Dirichlet (α_{t}) .

(36)

5.3. MAPPO Algorithm Design

D-MAPPO is an extension of MAPPO [44] that incorporates a Dirichlet-based policy. It follows the centralized training and decentralized execution (CTDE) architecture, where each UAV maintains an independent Actor network for decision-making, while the Critic network shares parameters and uses global state information for value estimation.

In the multi-UAV computation offloading scenario considered in this paper, task offloading ratios must be dynamically allocated among local computation, cloud servers, and satellites. The decisions of different UAVs are interdependent, exhibiting strong multi-agent cooperation characteristics. To achieve efficient joint optimization, a multi-agent MAPPO framework is employed. This CTDE-based method enables global coordination while preserving distributed decision-making flexibility. Additionally, PPO’s clipping mechanism enhances policy convergence stability [45]. Combined with Dirichlet-based continuous action modeling, MAPPO effectively learns the optimal offloading ratio policy for UAVs, minimizing overall system latency and improving task success rates.

The key steps of the MAPPO algorithm are as follows:

(1): Network Structure Design: Both the Actor and Critic networks adopt three fully connected layers with tanh activation functions to enhance nonlinearity and expressive capacity. The Actor network includes an additional fully connected layer to output the Dirichlet distribution parameters, providing a probabilistic representation for continuous action selection. The Critic network outputs a scalar value representing the state value. Both networks are trained using the Adam optimizer.
(2): Advantage Estimation: Advantage estimation measures how much better a specific action is compared to the average. It is a critical component of MAPPO, helping reduce variance and improve learning efficiency. MAPPO employs Generalized Advantage Estimation (GAE), defined as follows:

$δ_{t} = r_{t} + γ V_{ϕ} (s_{t + 1}) - V_{ϕ} (s_{t})$

(37)

${\hat{A}}_{t} = \sum_{l = 0}^{T - t - 1} {(γ λ)}^{l} δ_{t + 1}$

(38)

where $δ_{t}$ is the Temporal Difference (TD) error, $r_{t}$ is the immediate reward after UAV action at time slot t, $V_{ϕ} (s_{t})$ is the Critic’s estimated value of state $s_{t}$ , $ϕ$ denotes Critic parameters, $γ$ is the discount factor, and $λ$ controls the bias–variance trade-off.
(3): Policy Update: The log-probability of the Dirichlet policy is given by

$log π_{θ} (μ_{t} | s_{t}) = \sum_{j = 1}^{3} (α_{t, j} - 1) log μ_{u, j} - log B (α_{t})$

(39)

where $π_{θ}$ denotes the current policy. The PPO clipping loss is defined as follows:

$r_{t} (θ) = \frac{π_{θ} (μ_{t} | s_{t})}{π_{θ_{o l d}} (μ_{t} | s_{t})}$

(40)

$L^{c l i p} (θ) = E_{t} [min (r_{t} (θ) {\hat{A}}_{t}, clip (r_{t} (θ), 1 - ϵ, 1 + ϵ) {\hat{A}}_{t})]$

(41)

where $r_{t} (θ)$ is the probability ratio used to control the update step size, $π_{θ_{o l d}}$ represents the previous policy, and $c l i p ()$ limits update magnitude to prevent gradient explosion. $ϵ$ is a hyperparameter. Incorporating Dirichlet entropy regularization enhances exploration, and the Actor loss is defined as follows:

$L^{π} (θ) = - L^{c l i p} (θ) + c_{H} E_{t} [H (Dir (α_{t}))]$

(42)

where $c_{H}$ is the entropy coefficient encouraging exploration, and $H (D i r (α_{t})$ denotes the entropy of the Dirichlet distribution.
(4): Value Function Update: The Critic network is optimized using the mean squared error (MSE) loss:

$L^{V F} (ϕ) = E_{t} [{(V_{ϕ} (s_{t}) - R_{t})}^{2}], R_{t} = {\hat{A}}_{t} + V_{ϕ} (s_{t})$

(43)

where $R_{t}$ is the immediate discounted return serving as the target for value function learning.
(5): Joint Parameter Optimization: The total loss function is expressed as

$L_{total} = L^{π} (θ) + c_{V} L^{V F} (ϕ)$

(44)

where $c_{V}$ is the weighting coefficient for the value loss. The Actor and Critic parameters are updated separately using the Adam optimizer.

In this study, the D-MAPPO framework is adopted to optimize UAV offloading strategies. Specifically, each UAV’s offloading decision is represented as

a_{t} = (μ_{(u, u)}, μ_{(u, C)}, μ_{(u, s)}, \dots, μ_{(u, S)})

, corresponding to the proportions of local computation, offloading to the cloud server, and offloading to multiple satellites. During training, the Actor network, based on the current observation

s_{t}

, outputs the Dirichlet parameters

α_{t}

to sample feasible offloading ratios

μ_{t}

, ensuring actions always satisfy the constraints. The Critic network evaluates the global state value for advantage estimation and policy updating. Following the CTDE framework, agents interact with the environment, receive rewards and next states, compute advantages

{\hat{A}}_{t}

via GAE, update the Actor using PPO clipping loss, and refine the Critic with MSE loss. Through iterative training, the joint optimization of Actor and Critic parameters progressively enhances policy performance until convergence to a near-optimal offloading strategy capable of efficiently allocating computation tasks in dynamic multi-agent environments.

The pseudo-code of the MAPPO algorithm is shown in Algorithm 1:

Algorithm 1 Multi-Agent Proximal Policy Optimization (MAPPO)

1:: Initialize policy network $π_{θ}$ and value network $V_{ϕ}$
2:: Initialize replay buffer $D$
3:: for episode $= 1$ to Max_Episodes do
4:: Reset environment and get initial state $s_{0}$
5:: for timestep $= 1$ to Max_Timesteps do
6:: for each agent i do
7:: Get observation $o_{i}$ from state s
8:: Sample action $a_{i} \sim π_{θ} (o_{i})$
9:: end for
10:: Execute joint action $(a_{1}, \dots, a_{N})$ in environment
11:: Receive next state $s^{'}$ , rewards $(r_{1}, \dots, r_{N})$ , and done flag
12:: Store $(s, o_{1}, a_{1}, r_{1}, \dots, o_{N}, a_{N}, r_{N}, s^{'}, done)$ in $D$
13:: Update state: $s \leftarrow s^{'}$
14:: end for
15:: for each agent i do
16:: Compute TD-error $δ_{t} = r_{t} + γ V_{ϕ} (o_{i}^{'}) - V_{ϕ} (o_{i})$
17:: Compute advantage estimate ${\hat{A}}_{t}$ using GAE
18:: end for
19:: for each optimization step do
20:: Sample mini-batch from $D$
21:: Compute PPO policy loss:

$L^{clip} (θ) = E [min (r_{t} (θ) {\hat{A}}_{t}, clip (r_{t} (θ), 1 - ϵ, 1 + ϵ) {\hat{A}}_{t}]$
22:: Compute value function loss:

$L^{VF} (ϕ) = {(V_{ϕ} (o_{i}) - R_{t})}^{2}$
23:: Update $θ$ using gradient ascent on $L^{clip} (θ)$
24:: Update $ϕ$ using gradient descent on $L^{VF} (ϕ)$
25:: end for
26:: Clear replay buffer $D$
27:: end for

6. Experiment Analysis

This subsection presents the results of our simulation experiments. We begin by detailing the parameter settings used in the experiments. Then, we compare the performance of the D-MAPPO algorithm under different hyperparameter configurations to select the optimal ones. Finally, we compare our proposed algorithm with several baseline algorithms to demonstrate its superiority.

6.1. Parameter Settings

In the simulations, we implemented the simulation environment and optimization algorithms using Python 3.8.10 and PyTorch 1.9.1. The environment was configured with (U = 5) UAVs, (S = 3) low Earth orbit (LEO) satellites, and (C = 1) cloud server. The total number of normal tasks was approximately 236, and the total number of urgent tasks was approximately 36, accounting for about 13% of all tasks. Unless otherwise noted, all parameter values are taken from the references [41,46,47], and the parameter settings are listed in Table 2. Table 3 lists the hyperparameter values used for the D-MAPPO algorithm.

Based on the above parameter settings, we optimized the task offloading strategies using our proposed algorithm along with five baseline algorithms under identical environmental conditions. The six algorithms compared are as follows:

Beta-MAPPO [48]: A MAPPO algorithm that models the action space using a Beta distribution. It requires training to adapt to the continuous action space constraints in this environment and is suitable for multi-agent offloading problems.

PPO [49]: A single-agent reinforcement learning algorithm. Given the 20-dimensional joint action space of the five UAVs, Dirichlet or Beta distributions are not applicable; hence, a discrete action space is used.

Local: Each UAV processes its collected tasks locally without offloading.

Offloading: All tasks are offloaded to cloud servers and satellites for execution.

Random: Actions are selected completely at random, without considering environmental states or policies, serving as a lower-bound baseline.

6.2. D-MAPPO Algorithm Convergence

To verify the convergence of the algorithm and to select optimal hyperparameters for achieving the best performance of the D-MAPPO algorithm, we first compare the convergence behavior under different hyperparameter settings, including the learning rate (lr), discount factor (gamma), and the number of policy update epochs (n_epochs). The comparison results are shown in Figure 3.

Figure 3 shows the performance of the D-MAPPO algorithm during training under different hyperparameter settings. In Figure 3a, with a learning rate of 0.0003 (red line), the algorithm converges the fastest, stabilizing and reaching the highest average reward within about 200 episodes, demonstrating the best learning efficiency. With a learning rate of 0.003 (blue line), convergence is slower, but the algorithm still achieves high performance after 400 episodes and maintains good stability. When the learning rate is set to 0.00003 (green line), the algorithm exhibits significant fluctuations, with the average reward significantly lower than the first two cases, and it fails to effectively converge. This indicates that a too-small learning rate limits policy updates, negatively affecting learning performance.

Figure 3b shows the results for different discount factors: 0.8, 0.95, and 0.99. With a discount factor of 0.95 (red line), the algorithm converges quickly and stabilizes at a high reward, indicating that long-term rewards are beneficial for policy learning. With a discount factor of 0.99 (blue line), although the optimization performance is poor before 700 episodes, there is an improvement later, though it still does not outperform the 0.95 case. With a discount factor of 0.8 (green line), the algorithm converges faster initially, but the rewards decrease later, suggesting that short-sighted reward estimates are detrimental to long-term optimization.

Figure 3c illustrates the effects of different numbers of policy update epochs (5, 7, and 10). When

n_{e p o c h s i s}

is set to 5 or 7 (red and green lines), the algorithm performs steadily and converges quickly, avoiding overfitting. When

n_{e p o c h s}

is set to 10 (blue line), the algorithm performs well initially, but the rewards decrease later, suggesting that excessive updates may lead to overfitting and reduce generalization performance.

Based on the comparison of the three hyperparameters, we find that the D-MAPPO algorithm achieves better optimization performance when lr = 0.0003, gamma = 0.95, and n_epochs = 5. Next, we will compare the above values with other benchmark algorithms.

After selecting appropriate hyperparameters, we compare the optimization performance of the proposed algorithm with five other baseline algorithms. During the optimization process, tasks that are not successfully offloaded are regarded as timeout tasks, and fixed delay and energy consumption values are assigned to these timeout cases.

6.3. Comparison of Reward Values Among Different Algorithms

As shown in Figure 4, the figure presents the training performance of six algorithms under the same environment: D-MAPPO (red line), Beta-MAPPO (green line), PPO (blue line), Local (black line), Offloading (orange line), and Random (sky blue line). The horizontal axis represents training episodes, and the vertical axis denotes average reward. D-MAPPO performs the best, with the average reward rising rapidly in a short time and stabilizing within approximately 200 episodes. The final converged reward is −73, significantly higher than the other algorithms, indicating strong convergence speed and policy quality. This demonstrates its notable advantages in handling dynamic action spaces and multi-agent coordination. Beta-MAPPO shows very low rewards in the initial stages, even below the Random strategy, indicating significant instability. However, it rebounds after about 150 episodes and continues to improve, eventually stabilizing at −105. Its performance surpasses PPO and Local but remains inferior to D-MAPPO. This “lagging” behavior stems from the unstable convergence of action sampling using the Beta distribution. Although its long-term performance is acceptable, its training efficiency is lower than that of D-MAPPO. The PPO algorithm shows a relatively smooth convergence process and eventually converges around −90. While it performs better than the Local, Offloading, Random, and Beta-MAPPO strategies, it is clearly outperformed by D-MAPPO. This suggests that single-agent strategies struggle to capture coordination in multi-agent environments, limiting adaptability. The Local strategy stabilizes around −126. Since all normal tasks are computed locally without involving offloading, no penalty is incurred. In contrast, the Offloading strategy offloads all tasks. Given the limited capacity of the cloud server and satellite, many tasks cannot be successfully offloaded, resulting in accumulated penalties and consistently low reward values around −500. The reward values of the Random strategy fluctuate greatly due to the highly stochastic nature of its offloading decisions, resulting in large performance variations.

6.4. Delay Optimization Comparison

Figure 5 compares the ordinary task delay optimization performance of the local, offloading (full), and random algorithms with D-MAPPO and Beta-MAPPO. D-MAPPO shows remarkable delay reduction from the early training stages, stably converging to the minimum delay (<0.6 s) after about 200 rounds, surpassing other algorithms. Beta-MAPPO, despite high initial delays, rapidly optimizes to stabilize between 0.8 s and 0.9 s, underperforming compared to D-MAPPO and PPO. PPO converges more slowly, with a final delay of approximately 0.95 s, second only to D-MAPPO. The local and random strategies stabilize at around 1.0 s and 1.1 s, respectively, highlighting the limitations of lacking dynamic scheduling capabilities. The offloading strategy remains at high delay (>3 s) without significant optimization. In summary, D-MAPPO demonstrates exceptional learning ability and decision-making efficiency in delay optimization, confirming its superiority in complex multi-agent offloading scenarios.

Figure 6 presents a performance comparison of the D-MAPPO algorithm with Beta-MAPPO, PPO, local, offloading, and random algorithms in terms of the minimum delay for normal tasks and the relative minimum delay change rate compared to the local algorithm. The analysis shows that D-MAPPO achieves the best performance in minimizing delay, with a value of 0.52 s, representing improvements of 38% and 31% over MAPPO and PPO, 47% over the local algorithm, and 50% over the fully random algorithm. Regarding the minimum delay change rate, the offloading algorithm exhibits the greatest change relative to the local algorithm, approximately 1.25, while D-MAPPO shows the least change at −0.47, demonstrating its significant advantage in delay optimization.

6.5. Energy Consumption Optimization Comparison

Figure 7 compares the performance of several algorithms in optimizing energy consumption for normal tasks. The D-MAPPO algorithm shows a rapid drop in energy consumption during the early training phase and stabilizes after approximately 100 episodes, eventually reaching around 0.07 KJ, achieving the best optimization performance. The energy consumption of the Beta-MAPPO algorithm decreases at a slower pace, stabilizing around 0.072 KJ after about 200 episodes. The PPO algorithm’s energy consumption gradually drops from 0.7 KJ to about 0.2 KJ, with considerable fluctuations, indicating unstable optimization performance. The energy consumption of the Local algorithm consistently remains at 0.025 KJ. This is because wireless data transmission requires significant transmission power, especially for long distances such as satellite communication. The Local algorithm avoids such communication, resulting in the lowest and most stable energy consumption, though at the cost of flexibility. The Offloading algorithm consumes as much as 2.7 KJ and shows no change throughout the training, indicating the worst performance. The Random algorithm exhibits significant fluctuations in energy consumption, but it remains generally stable between 0.5 and 0.6 KJ. Overall, both D-MAPPO and Beta-MAPPO have clear advantages in energy optimization and effectively reduce energy consumption, with D-MAPPO outperforming Beta-MAPPO in terms of optimization performance.

Figure 8 illustrates, in a bar-line chart, the comparison of different algorithms in terms of minimum energy consumption for normal tasks and the relative minimum energy change rate compared to the local algorithm. The x-axis represents the algorithm, the left y-axis indicates average energy consumption (KJ), and the right y-axis indicates the minimum energy change rate. D-MAPPO achieves an energy consumption of 0.072 KJ with a change rate close to 1.68, performing the best. Beta-MAPPO and PPO show slightly higher energy consumption and higher change rates of 1.87 and 4.14, respectively. The local algorithm consumes approximately 0.025 KJ, as it does not perform task offloading. The offloading algorithm has the highest energy consumption at 1.83 KJ and a change rate of 71.8%, performing the worst. Random achieves 0.48 KJ with a change rate of 18.27. Overall, D-MAPPO demonstrates the lowest and most stable energy consumption, with improvements of 7% and 48% over MAPPO and PPO, respectively.

6.6. Optimization Comparison of Task Offloading Success Rate

Figure 9 compares the optimization performance of D-MAPPO and five other algorithms in terms of the task offloading success rate for normal tasks. The D-MAPPO algorithm rapidly improves the offloading success rate in the early training phase, reaching 97% after around 100 episodes and continuing to rise slowly until it stabilizes at 100%, demonstrating excellent convergence and efficiency. The Beta-MAPPO algorithm shows a low success rate at the beginning of training but rises significantly after about 200 episodes, eventually stabilizing between 98% and 99%, performing slightly worse than D-MAPPO. The PPO algorithm’s offloading success rate increases slowly from 85–90% to 93–95%, reflecting a moderate level of performance. The Local algorithm consistently maintains a 100% success rate, representing an extreme strategy that always processes tasks locally. The Random algorithm’s success rate fluctuates between 90% and 93%, showing unstable performance. Due to the limited capacity of cloud servers and satellites to receive tasks, the Offloading algorithm exhibits very poor success rates. Overall, the D-MAPPO algorithm achieves the best performance in optimizing task offloading success rates, followed by Beta-MAPPO, while PPO and Random perform relatively worse. The Local and Offloading strategies lack flexibility. This indicates that reinforcement learning-based algorithms such as D-MAPPO and Beta-MAPPO have significant advantages in optimizing task offloading success rates and can effectively enhance system performance.

Figure 10 compares the offloading success rates for urgent tasks, normal tasks, and all tasks under each algorithm at their respective peak reward performance. The figure shows that the D-MAPPO algorithm achieves a 100% offloading success rate across all task types, indicating its ability to efficiently handle offloading under varying task conditions. Beta-MAPPO and Local algorithms also achieve 100% success rates for both urgent and normal tasks. The PPO algorithm performs slightly worse but still maintains a relatively high success rate. In contrast, the Offloading algorithm yields significantly lower success rates across all task types, ranging from approximately 0.7 to 0.75. The Random algorithm achieves a more consistent success rate of around 0.85, but still underperforms compared to reinforcement learning-based algorithms. This highlights the significant advantage of the D-MAPPO algorithm in handling various task types and effectively improving offloading success rates.

7. Conclusions

This paper aims to enhance the quality of service in network communications for environmental monitoring in remote areas by proposing a network framework based on SAGIN. A systematic introduction of the framework is provided, along with precise modeling of the communication model, task model, satellite coverage model, computation model, and reliability model. To handle different types of computational tasks, they are categorized into urgent and normal tasks. Urgent tasks are prioritized for offloading, with the success rate as the sole optimization objective, while normal tasks are optimized based on a combination of success rate, delay, and energy consumption. Based on the above models, an optimization problem is formulated to ensure the offloading success rate of urgent tasks while improving the success rate and reducing the delay and energy consumption of normal tasks. To solve the optimization problem, this paper proposes the D-MAPPO algorithm, which integrates the Dirichlet distribution with the MAPPO method. The feasibility of the proposed algorithm is validated through simulation experiments, and comparative results with benchmark algorithms further demonstrate the superiority of D-MAPPO in optimization performance.

Author Contributions

Conceptualization, Y.C. and Y.T.; methodology, Y.C. and Y.T.; software, Y.C.; validation, Y.C.; formal analysis, Y.C.; investigation, Y.C.; resources, Y.C.; data curation, Y.C.; writing—original draft preparation, Y.C.; writing—review and editing, Y.C.; visualization, Y.C.; supervision, Y.C. and Y.T.; project administration, Y.T.; funding acquisition, Y.T. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Qinghai Province Applied Basic Research Program (Grant No. 2023-ZJ-713).

Data Availability Statement

The system code in the article has been released to https://gitee.com/weidiao/SAGIN(accessed on 5 November 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

6G	Sixth-Generation
SAGIN	Space–Air–Ground Integrated Network
IoT	Internet of Things
MDP	Markov Decision Process
MAPPO	Multi-Agent Proximal Policy Optimization
UAVs	Unmanned Aerial Vehicles
LEO	Low Earth Orbit
MEO	Medium Earth Orbit
GEO	Geostationary Earth Orbit
MEC	Mobile Edge Computing
QoS	Overall Quality of Service
DRL	Deep Reinforcement Learning
JDACO	Joint Data Aggregation and Computation Offloading
DQN	Deep Q Network
PPO	Proximal Policy Optimization
MARL	Multi-Agent Reinforcement Learning
MADDPG	Multi-Agent Deep Deterministic Policy Gradient
PSO	Particle Swarm Optimization
GA	Genetic Algorithm
LoS	Line-of-Sight
NLoS	Non-LoS
AWGN	Additional White Gaussian Noise
UE	User Equipment
RL	Reinforcement Learning
CMDP	Constrained Markov Decision Process
CTDE	Centralized Training and Decentralized Execution
SMAC	StarCraft Multi-Agent Challenge
POEs	Partially Observable Environments
GRUs	Gated Recurrent Units
GAE	Generalized Advantage Estimation
TD Error	Temporal Difference Error
MSE Loss	Mean Square Error

References

Saad, W.; Bennis, M.; Chen, M. A Vision of 6G Wireless Systems: Applications, Trends, Technologies, and Open Research Problems. IEEE Netw. 2020, 34, 134–142. [Google Scholar] [CrossRef]
Zhang, Z.; Xiao, Y.; Ma, Z.; Xiao, M.; Fan, P. 6G wireless networks: Vision, requirements, architecture, and key technologies. IEEE Veh. Technol. Mag. 2019, 14, 28–41. [Google Scholar] [CrossRef]
Kodheli, O.; Maturo, N.; Chatzinotas, S.; Andrenacci, S.; Zimmer, F. NB-IoT via LEO Satellites: An Efficient Resource Allocation Strategy for Uplink Data Transmission. IEEE Internet Things J. 2022, 9, 5094–5107. [Google Scholar] [CrossRef]
Li, B.; Fei, Z.; Chu, Z.; Zhou, F.; Wong, K.K.; Xiao, P. Robust Chance-Constrained Secure Transmission for Cognitive Satellite–Terrestrial Networks. IEEE Trans. Veh. Technol. 2018, 67, 4208–4219. [Google Scholar] [CrossRef]
Xie, R.; Tang, Q.; Wang, Q.; Liu, X.; Yu, F.R.; Huang, T. Satellite-Terrestrial Integrated Edge Computing Networks: Architecture, Challenges, and Open Issues. IEEE Netw. 2020, 34, 224–231. [Google Scholar] [CrossRef]
Jia, Z.; Sheng, M.; Li, J.; Han, Z. Toward Data Collection and Transmission in 6G Space–Air–Ground Integrated Networks: Cooperative HAP and LEO Satellite Schemes. IEEE Internet Things J. 2022, 9, 10516–10528. [Google Scholar] [CrossRef]
Liu, Z.; Weng, J.; Guo, J.; Ma, J.; Huang, F.; Sun, H. PPTM: A Privacy-Preserving Trust Management Scheme for Emergency Message Dissemination in Space–Air–Ground-Integrated Vehicular Networks. IEEE Internet Things J. 2022, 9, 5943–5956. [Google Scholar] [CrossRef]
Pachler, N.; del Portillo, I.; Crawley, E.F.; Cameron, B.G. An Updated Comparison of Four Low Earth Orbit Satellite Constellation Systems to Provide Global Broadband. In Proceedings of the 2021 IEEE International Conference on Communications Workshops (ICC Workshops), Montreal, QC, Canada, 14–23 June 2021; pp. 1–7. [Google Scholar] [CrossRef]
Li, J.; Lu, H.; Xue, K.; Zhang, Y. Temporal Netgrid Model-Based Dynamic Routing in Large-Scale Small Satellite Networks. IEEE Trans. Veh. Technol. 2019, 68, 6009–6021. [Google Scholar] [CrossRef]
Di, B.; Zhang, H.; Song, L.; Li, Y.; Li, G.Y. Ultra-Dense LEO: Integrating Terrestrial-Satellite Networks Into 5G and Beyond for Data Offloading. IEEE Trans. Wirel. Commun. 2019, 18, 47–62. [Google Scholar] [CrossRef]
You, X.; Wang, C.-X.; Huang, J.; Gao, X.; Zhang, Z.; Wang, M.; Huang, Y.; Zhang, C.; Jiang, Y.; Wang, J.; et al. Towards 6G wireless communication networks: Vision, enabling technologies, and new paradigm shifts. Sci. China Inf. Sci. 2020, 64, 110301. [Google Scholar] [CrossRef]
Giordani, M.; Zorzi, M. Non-Terrestrial Networks in the 6G Era: Challenges and Opportunities. IEEE Netw. 2021, 35, 244–251. [Google Scholar] [CrossRef]
Chen, S.; Sun, S.; Kang, S. System integration of terrestrial mobile communication and satellite communication—The trends, challenges and key technologies in B5G and 6G. China Commun. 2020, 17, 156–171. [Google Scholar] [CrossRef]
Zhu, X.; Jiang, C. Integrated Satellite-Terrestrial Networks Toward 6G: Architectures, Applications, and Challenges. IEEE Internet Things J. 2022, 9, 437–461. [Google Scholar] [CrossRef]
Tran, T.X.; Hajisami, A.; Pandey, P.; Pompili, D. Collaborative Mobile Edge Computing in 5G Networks: New Paradigms, Scenarios, and Challenges. IEEE Commun. Mag. 2017, 55, 54–61. [Google Scholar] [CrossRef]
Li, Z.; Wang, Y.; Liu, M.; Sun, R.; Chen, Y.; Yuan, J. Energy Efficient Resource Allocation for UAV-Assisted Space-Air-Ground Internet of Remote Things Networks. IEEE Access 2019, 12, 145348–145362. [Google Scholar] [CrossRef]
Zhou, H.; Wang, Z.; Zheng, H.; He, S.; Dong, M. Cost Minimization-Oriented Computation Offloading and Service Caching in Mobile Cloud-Edge Computing: An A3C-Based Approach. IEEE Trans. Netw. Sci. Eng. 2023, 10, 1326–1338. [Google Scholar] [CrossRef]
Li, K.; Wang, X.; He, Q.; Ni, Q.; Yang, M.; Dustdar, S. Computation Offloading for Tasks With Bound Constraints in Multiaccess Edge Computing. IEEE Internet Things J. 2023, 10, 15526–15536. [Google Scholar] [CrossRef]
Dai, X.; Xiao, Z.; Jiang, H.; Lui, J.C.S. UAV-Assisted Task Offloading in Vehicular Edge Computing Networks. IEEE Trans. Mob. Comput. 2024, 23, 2520–2534. [Google Scholar] [CrossRef]
Subburaj, B.; Jayachandran, U.M.; Arumugham, V.; Suthanthira Amalraj, M.J.A. A Self-Adaptive Trajectory Optimization Algorithm Using Fuzzy Logic for Mobile Edge Computing System Assisted by Unmanned Aerial Vehicle. Drones 2023, 7, 266. [Google Scholar] [CrossRef]
Fengjun, S.; Guo, J. An Energy-efficient Task Offloading Model based on Trust Mechanism and Multi-agent Reinforcement Learning. Res. Sq. 2024, 1, 1–19. [Google Scholar] [CrossRef]
Jia, Z.; Sheng, M.; Li, J.; Niyato, D.; Han, Z. LEO-Satellite-Assisted UAV: Joint Trajectory and Data Collection for Internet of Remote Things in 6G Aerial Access Networks. IEEE Internet Things J. 2021, 8, 9814–9826. [Google Scholar] [CrossRef]
Gao, Y.; Ye, Z.; Yu, H. Cost-Efficient Computation Offloading in SAGIN: A Deep Reinforcement Learning and Perception-Aided Approach. IEEE J. Sel. Areas Commun. 2024, 42, 3462–3476. [Google Scholar] [CrossRef]
Liu, Y.; Jiang, L.; Qi, Q.; Xie, S. Energy-Efficient Space–Air–Ground Integrated Edge Computing for Internet of Remote Things: A Federated DRL Approach. IEEE Internet Things J. 2023, 10, 4845–4856. [Google Scholar] [CrossRef]
Xie, J.; Jia, Q.; Chen, Y.; Wang, W. Computation Offloading and Resource Allocation in Satellite-Terrestrial Integrated Networks: A Deep Reinforcement Learning Approach. IEEE Access 2024, 12, 97184–97195. [Google Scholar] [CrossRef]
Sadiki, A.; Bentahar, J.; Dssouli, R.; En-Nouaary, A.; Otrok, H. Deep reinforcement learning for the computation offloading in MIMO-based Edge Computing. Ad Hoc Netw. 2023, 141, 103080. [Google Scholar] [CrossRef]
Xu, Y.; Deng, F.; Zhang, J. UDCO-SAGiMEC: Joint UAV Deployment and Computation Offloading for Space–Air–Ground Integrated Mobile Edge Computing. Mathematics 2023, 11, 4014. [Google Scholar] [CrossRef]
Raivi, A.M.; Moh, S. JDACO: Joint Data Aggregation and Computation Offloading in UAV-Enabled Internet of Things for Post-Disaster Scenarios. IEEE Internet Things J. 2024, 11, 16529–16544. [Google Scholar] [CrossRef]
Zhou, C.; Wu, W.; He, H.; Yang, P.; Lyu, F.; Cheng, N. Deep Reinforcement Learning for Delay-Oriented IoT Task Scheduling in Space-Air-Ground Integrated Network. IEEE Trans. Wirel. Commun. 2020, 20, 911–925. [Google Scholar] [CrossRef]
Chen, Z.; Zhou, H.; Du, S.; Liu, J.; Zhang, L.; Liu, Q. Reinforcement Learning-Based Resource Allocation and Energy Efficiency Optimization for a Space–Air–Ground-Integrated Network. Electronics 2024, 13, 1792. [Google Scholar] [CrossRef]
Chen, C.; Zhao, M. Orbital collaborative learning in 6G space-air-ground integrated networks. Neurocomputing 2022, 497, 94–109. [Google Scholar] [CrossRef]
Dao, P.N.; Duc, H.A.N. Reinforcement-learning-based control framework for leader-following cascade formation of multiple perturbed surface vehicles. Syst. Control Lett. 2025, 200, 106077. [Google Scholar] [CrossRef]
Chen, Y.; Ai, B.; Niu, Y.; Zhang, H.; Han, Z. Energy-constrained computation offloading in space-air-ground integrated networks using distributionally robust optimization. IEEE Trans. Veh. Technol. 2021, 70, 12113–12125. [Google Scholar] [CrossRef]
Chandrasekharan, S.; Gomez, K.; Al-Hourani, A.; Kandeepan, S.; Rasheed, T.; Goratti, L. Designing and implementing future aerial communication networks. IEEE Commun. Mag. 2016, 54, 26–34. [Google Scholar] [CrossRef]
Seid, A.M.; Boateng, G.O.; Mareri, B.; Sun, G.; Jiang, W. Multi-Agent DRL for Task Offloading and Resource Allocation in Multi-UAV Ena-bled IoT Edge Network. IEEE Trans. Netw. Serv. Manag. 2021, 18, 4531–4547. [Google Scholar] [CrossRef]
Al-Hourani, A.; Kandeepan, S.; Lardner, S. Optimal LAP Altitude for Maximum Coverage. IEEE Wirel. Commun. Lett. 2014, 3, 569–572. [Google Scholar] [CrossRef]
Cheng, N.; Lyu, F.; Quan, W.; Zhou, C.; He, H.; Shi, W.; Shen, X. Space/Aerial-Assisted Computing Offloading for IoT Applications: A Learning-Based Approach. IEEE J. Sel. Areas Commun. 2019, 37, 1117–1129. [Google Scholar] [CrossRef]
Chakour, I.; Daoui, C.; Baslam, M.; Sainz-De-Abajo, B.; Garcia-Zapirain, B. Strategic Bandwidth Allocation for QoS in IoT Gateway: Predicting Future Needs Based on IoT Device Habits. IEEE Access 2024, 12, 6590–6603. [Google Scholar] [CrossRef]
Abdi, A.; Lau, W.C.; Alouini, M.S.; Kaveh, M. A new simple model for land mobile satellite channels: First- and second-order statistics. IEEE Trans. Wirel. Commun. 2003, 2, 519–528. [Google Scholar] [CrossRef]
Di, B.; Song, L.; Li, Y.; Poor, H.V. Ultra-Dense LEO: Integration of Satellite Access Networks into 5G and Beyond. IEEE Wirel. Commun. 2019, 26, 62–69. [Google Scholar] [CrossRef]
Huang, C.; Chen, G.; Xiao, P.; Xiao, Y.; Han, Z.; Chambers, J.A. Joint Offloading and Resource Allocation for Hybrid Cloud and Edge Computing in SAGINs: A Decision Assisted Hybrid Action Space Deep Reinforcement Learning Approach. IEEE J. Sel. Areas Commun. 2024, 42, 1029–1043. [Google Scholar] [CrossRef]
Ding, C.; Wang, J.B.; Zhang, H.; Lin, M.; Li, G.Y. Joint Optimization of Transmission and Computation Resources for Satellite and High Altitude Platform Assisted Edge Computing. IEEE Trans. Wirel. Commun. 2022, 21, 1362–1377. [Google Scholar] [CrossRef]
He, H.; Shan, H.; Huang, A.; Ye, Q.; Zhuang, W. Edge-Aided Computing and Transmission Scheduling for LTE-U-Enabled IoT. IEEE Trans. Wirel. Commun. 2020, 19, 7881–7896. [Google Scholar] [CrossRef]
Yu, C.; Velu, A.; Vinitsky, E.; Gao, J.; Wang, Y.; Bayen, A.; Wu, Y. The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; Curran Associates, Inc.: Red Hook, NY, USA, 2022; Volume 35, pp. 24611–24624. [Google Scholar]
Song, F.; Xing, H.; Wang, X.; Luo, S.; Dai, P.; Xiao, Z. Evolutionary Multi-Objective Reinforcement Learning Based Trajectory Control and Task Offloading in UAV-Assisted Mobile Edge Computing. IEEE Trans. Mob. Comput. 2023, 22, 7387–7405. [Google Scholar] [CrossRef]
Zhou, C.; Wu, W.; He, H.; Yang, P.; Lyu, F.; Cheng, N. Delay-Aware IoT Task Scheduling in Space-Air-Ground Integrated Network. In Proceedings of the 2019 IEEE Global Communications Conference (GLOBECOM), Waikoloa, HI, USA, 9–13 December 2019; pp. 1–6. [Google Scholar] [CrossRef]
Ebadi, F.; Shah-Mansouri, V. Online and Utility-Power Efficient Task Scheduling in Homogeneous Fog Networks. Technical report. arXiv 2024, arXiv:2409.18675. [Google Scholar] [CrossRef]
Li, B.; Liu, W.; Xie, W.; Zhang, N.; Zhang, Y. Adaptive Digital Twin for UAV-Assisted Integrated Sensing, Communication, and Computation Networks. IEEE Trans. Green Commun. Netw. 2023, 7, 1996–2009. [Google Scholar] [CrossRef]
Lan, W.; Chen, K.; Cao, J.; Li, Y.; Li, N.; Chen, Q. Security-Sensitive Task Offloading in Integrated Satellite-Terrestrial Networks. IEEE Trans. Mob. Comput. 2025, 24, 2220–2233. [Google Scholar] [CrossRef]

Figure 1. Framework for an integrated air and space network.

Figure 2. Satellite coverage model.

Figure 3. Convergence rewards of the D-MAPPO algorithm for different learning rates, gamma and n_epochs, respectively.

Figure 4. Convergence of reward values for different algorithms.

Figure 5. Comparison of different algorithms for optimization of latency for common tasks.

Figure 6. Comparison of the minimum latency of common tasks optimized by different algorithms, and the variation of the minimum latency of the remaining algorithms based on the LOCAL algorithm.

Figure 7. Comparison of different algorithms for optimizing energy consumption for common tasks.

Figure 8. Comparison of the minimum energy consumption of common tasks optimized by different algorithms, and the variation of the minimum energy consumption of the remaining algorithms based on the LOCAL algorithm.

Figure 9. Comparison of different algorithms for optimizing the success rate of unloading common tasks.

Figure 10. Comparison of offloading success rates for different tasks with different algorithms at the highest rewards.

Table 1. Summary of Notations in Our System.

Notation	Description	Notation	Description
$I, U, S$	Sets of sensors, UAVs, and satellites	$N_{0}$	Spectral density of added white Gaussian noise (AWGN)
$i, u, s$	Numbers of sensors, UAVs, and satellites	$θ_{G}$	Communication between ground
C	Cloud server	$d_{E}$	Earth’s radius
G	Ground equipment (including ground sensors, UAVs, and cloud servers)	$d_{o}$	Satellite orbit altitude
M	Set of tasks	$d_{G S}$	Distance between GD and satellite
m	Number of tasks	$θ_{c}$	Coverage angle of low Earth orbit (LEO) satellites
$φ$	Data amount of a task	$L_{S}$	Coverage arc length of the satellite
$ρ$	Complexity of a task	$T_{S}$	Coverage time of the satellite
n	Normal task	$V_{S}$	Satellite speed
$u r$	Urgent task	$μ_{(u, u)}$	Proportion of local computation done by UAV u
$p^{l o s s}$	Average path loss of the air-to-ground channel	$μ_{(u, C)}$	Proportion of task offloaded to the cloud server by UAV u
$p^{l o s}$	Line-of-sight (LOS) between ground equipment and UAV	$μ_{(u, s)}$	Proportion of task offloaded to the satellite by UAV u
r	Horizontal distance between UAV and ground equipment	$τ$	Energy coefficient
h	Flight altitude of the UAV	$N_{i}^{n}$	Total number of normal tasks generated by sensor i
$η^{L o S}$ , $η^{N L o S}$	Additional losses in LOS and non-LOS links based on free-space path loss	$N_{i}^{u r}$	Total number of urgent tasks generated by sensor i
$f_{c}$	Carrier frequency	$N_{u}$ , $N_{C}$ , $N_{s}$	Total number of normal tasks received by UAV u, cloud server, and satellite s
c	Speed of light	$f_{u}$ , $f_{C}$ , $f_{s}$	Computational capacities of UAV u, cloud server, and satellite s
$R_{I}, R_{U}, R_{S}, R_{C}$	Maximum transmission rates of ground sensors, UAVs, satellites, and cloud servers	$T_{u}^{t r a n}, T_{C}^{t r a n}, T_{s}^{t r a n}$	Transmission delays for task collection by UAV u, cloud server, and satellite s
$B_{I}, B_{U}, B_{S}, B_{C}$	Bandwidth of ground sensors, UAVs, satellites, and cloud servers	$E_{u}^{t r a n}, E_{C}^{t r a n}, E_{s}^{t r a n}$	Transmission energy consumption for task collection by UAV u, cloud server, and satellite s
$P_{I}, P_{U}, P_{S}, P_{C}$	Transmission powers of ground sensors, UAVs, satellites, and cloud servers	$T_{u}^{c o m p}, T_{C}^{c o m p}, T_{s}^{c o m p}$	Computation delays for UAV u, cloud server, and satellite s
$σ^{2}$	Gaussian noise power	$E_{u}^{c o m p}, E_{C}^{c o m p}, E_{s}^{c o m p}$	Computation energy consumption for UAV u, cloud server, and satellite s
$ξ$	Channel coefficient	$T_{u}^{b a c k}, T_{C}^{b a c k}, T_{s}^{b a c k}$	Backhaul delays for UAV u, cloud server, and satellite s
F	Rician factor	$E_{u}^{b a c k}, E_{C}^{b a c k}, E_{s}^{b a c k}$	Backhaul energy consumption for UAV u, cloud server, and satellite s
$α$	Distance attenuation factor	b	Overhead coefficient
$ξ^{L o S}, ξ^{N L o S}$	LOS and non-LOS channel gains between satellites and communication devices	$T_{S}^{t r a n}$	Total time for transmitting tasks to the satellite
$R_{I, S}, R_{U, S}$	Maximum transmission rates of ground sensors and UAVs to satellites	$φ_{S}^{t r a n}$	Total data transmitted to the satellite
$G_{0}$	Fixed antenna gain

Table 2. Parameter Settings.

Parameter	Value	Parameter	Value
$P_{i}$	$0.1 W$	$θ_{G}$	$40^{\circ}$
$P_{u}, P_{c}, P_{s}$	$1 W$	$V_{s}$	$7.8 km / s$
$B_{s}, B_{c}$	≈ $10 MHz$	$τ$	$10^{- 25}$
$B_{u}$	≈ $5 MHz$	$φ_{m}$	$[0.6, 1.2] M$
$B_{i}$	≈ $1 MHz$	$ρ_{m}$	≈ $500 cycles / bit$
$f_{u}$	≈ $0.5 GHz$	$f_{s}, f_{C}$	≈ $1 GHz$

Table 3. Hyperparameter Settings of D-MAPPO.

Parameter	Value	Parameter	Value
Number of agents	5	Actor network dim	(128, 128, 64, 5)
entropy coefficient	0.01	Critic network dim	(128, 128, 64, 1)
Clipping parameter	0.2	Discount factor	0.95
GAE parameter	0.95	batch size	128
rollout length	75	number of training epochs per update	128

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, Y.; Tong, Y. Computation Offloading in Space–Air–Ground Integrated Networks for Diverse Task Requirements with Integrated Reliability Mechanisms. Future Internet 2025, 17, 542. https://doi.org/10.3390/fi17120542

AMA Style

Chen Y, Tong Y. Computation Offloading in Space–Air–Ground Integrated Networks for Diverse Task Requirements with Integrated Reliability Mechanisms. Future Internet. 2025; 17(12):542. https://doi.org/10.3390/fi17120542

Chicago/Turabian Style

Chen, Yitian, and Yinghua Tong. 2025. "Computation Offloading in Space–Air–Ground Integrated Networks for Diverse Task Requirements with Integrated Reliability Mechanisms" Future Internet 17, no. 12: 542. https://doi.org/10.3390/fi17120542

APA Style

Chen, Y., & Tong, Y. (2025). Computation Offloading in Space–Air–Ground Integrated Networks for Diverse Task Requirements with Integrated Reliability Mechanisms. Future Internet, 17(12), 542. https://doi.org/10.3390/fi17120542

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Computation Offloading in Space–Air–Ground Integrated Networks for Diverse Task Requirements with Integrated Reliability Mechanisms

Abstract

1. Introduction

2. Related Work

2.1. Task Offloading in SAGIN

2.2. Optimization Algorithms

3. System Model

3.1. Network Architecture

3.2. Task Model

3.3. Communication Model

3.4. Satellite Coverage Model

3.5. Computational Model

3.6. Reliability Mechanisms

4. Problem Analysis

5. D-MAPPO Algorithm Design

5.1. MDP Design

5.2. Dirichlet Probability Distribution and Representation of Offloading Decisions

5.3. MAPPO Algorithm Design

6. Experiment Analysis

6.1. Parameter Settings

6.2. D-MAPPO Algorithm Convergence

6.3. Comparison of Reward Values Among Different Algorithms

6.4. Delay Optimization Comparison

6.5. Energy Consumption Optimization Comparison

6.6. Optimization Comparison of Task Offloading Success Rate

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI