1. Introduction
The escalating global demand for seamless communication and the proliferation of Internet of Things (IoT) applications have rendered terrestrial networks inadequate to fulfill the stringent real-time requirements of users, especially in remote environments such as deserts, oceans, and disaster-stricken areas where terrestrial infrastructure is either sparse or entirely absent [
1,
2]. These scenarios underscore the critical need for next-generation wireless networks that prioritize ultra-low latency and ultra-high reliability to support mission-critical IoT deployments [
3]. In this context, the integration of satellite networks—particularly low-Earth orbit (LEO) constellations—into terrestrial ecosystems emerges as a transformative strategy to achieve ubiquitous connectivity, enabling continuous data acquisition and dissemination across underserved regions [
4]. By leveraging the unique attributes of LEO satellites, including their low propagation delays, wide coverage footprints, and inherent resilience against ground-based disruptions, integrated satellite–terrestrial networks (ISTNs) can effectively bridge the digital divide, ensuring robust IoT services even in the most challenging operational theaters.
The rapid expansion of remote sensing satellite systems has transformed them into a multifaceted ecosystem encompassing communication, navigation, Earth observation, and experimental satellites, enabling unprecedented integration with cutting-edge technologies like artificial intelligence (AI), advanced networking, and IoT. This convergence has dramatically expanded the utility of remote sensing data, rendering it indispensable across critical domains such as land resource management, economic development, disaster mitigation, meteorology, and environmental monitoring. As these fields increasingly rely on real-time, context-aware insights, the demand for remote sensing resources has undergone a paradigm shift, necessitating a departure from conventional query-based and subscription-based service models [
5]. Existing paradigms often falter in addressing IoT-driven requirements due to their high technical barriers, limited contextual relevance, and inflexibility in dynamic scenarios, impeding efficient access to high-value, domain-specific data. This gap highlights the urgent need for innovative, proactive data acquisition strategies capable of delivering timely, tailored remote sensing resources to meet the evolving demands of smart applications, particularly within ISTNs where seamless cross-modal data fusion and adaptive decision making are paramount.
To address this critical need, there is a pressing demand for a remote sensing information service that prioritizes speed, accuracy, and adaptability, particularly for IoT applications requiring dynamic real-time data to enhance situational awareness and enable informed decision making. Low-orbit remote sensing satellite networks have emerged as the preferred infrastructure for real-time data acquisition and dissemination, leveraging their inherent strengths in global coverage, cost-effective deployment, and minimal propagation delays [
6,
7]. These networks are pivotal for IoT-centric applications such as weather forecasting, disaster monitoring, and remote sensing mapping, where rapid access to high-resolution spatial data is non-negotiable. Recent advancements have yielded successful implementations of low-orbit satellite networks, demonstrating their potential to revolutionize data delivery [
8,
9,
10]. However, traditional low-Earth-orbit (LEO) satellite systems often face constraints due to the limited onboard processing capabilities. Fortunately, breakthroughs in onboard hardware technology have introduced a new generation of processors that deliver high performance with minimal power consumption, enabling efficient onboard compression and the slicing of remote sensing data [
11]. Additionally, the adoption of relay-assisted access strategies within the space–air–ground integrated network (SAGIN) framework enhances connectivity for diverse IoT devices, ensuring seamless data flow across heterogeneous network infrastructures [
12,
13]. Looking ahead, intelligent remote sensing satellites are poised to function as dynamic content service providers within the emerging satellite Internet ecosystem. By seamlessly integrating remote sensing capabilities with communication satellite networks, it becomes feasible to establish real-time communication channels between satellites and IoT users, thereby optimizing resource allocation and significantly improving responsiveness to evolving user demands [
14].
As user applications and data demands expand, the volume and diversity of data generated by remote sensing satellites have surged, outstripping the load capacity of existing data transmission systems [
15]. This disparity has made it increasingly challenging for users to promptly access relevant remote sensing resources from the vast datasets available, thereby underscoring the need for real-time data transmission and intelligent push capabilities. To tackle this issue, an intelligent push strategy that proactively delivers potentially relevant remote sensing data from satellite networks to IoT users can significantly reduce content request delays. Such a strategy is crucial for maximizing the utilization of satellite and user cache resources and enhancing the overall performance of integrated satellite–terrestrial IoT networks. Deep reinforcement learning (DRL) has proven effective in addressing optimization challenges within dynamic network environments, as it leverages its perception and decision-making capabilities to discover optimal solutions through agent–environment interactions. Recently, several studies have applied DRL to optimize resource allocation in satellite networks [
16,
17].
In this work, we focus on the integration of large-scale remote sensing constellations with communication satellite networks and propose a multi-modal multi-agent proximal policy optimization (MAPPO) algorithm, termed multi-modal-MAPPO, to achieve the real-time intelligent push of multi-modal remote sensing data. The primary contributions of this paper are as follows:
Firstly, to address the distinctive challenges of link intermittency and resource heterogeneity in satellite–terrestrial integrated networks, we introduce a link state-aware proactive delivery algorithm tailored for satellite networks. This algorithm innovatively constructs a multi-granularity network state monitoring model to dynamically capture the real-time fluctuations and outage characteristics of both satellite/user cache resources and link states. By framing the dynamic delivery problem as a Markov decision process (MDP), we provide a robust theoretical foundation for intelligent decision making.
Secondly, our research comprehensively incorporates the spatio-temporal characteristics of remote sensing data that are vital for IoT applications. These characteristics encompass content generation time, geographic parameters (such as longitude and latitude ranges), payload type, resolution, and product level. To the best of our knowledge, this study represents the first application of multi-agent DRL to optimize an in-orbit proactive push for remote sensing data within a large-scale integrated satellite–terrestrial IoT network. This novel approach enhances data availability and access efficiency for a wide range of satellite applications.
Thirdly, to achieve the active delivery of multi-modal remote sensing data in ISTNs, we propose a multi-modal collaborative multi-agent deep reinforcement learning framework. This framework is the first application of MAPPO to in-orbit remote sensing data delivery scenarios. By adopting a two-branch architecture: the MobileNet-v3-small pre-training model extracts deep features from remote sensing images and generates semantic feature vectors representing the image content, while the heterogeneous feature processing channel integrates the MLP branch (which encodes attribute features, such as the content generation time, geographic location range, and payload type, as well as the satellite cache state) and the LSTM branch (which simulates the temporal evolution of the user-cached content). Based on this, a dynamically weighted attention mechanism adaptively fuses the outputs of the dual branches with the image feature vectors to build a multi-modal correlation model that connects the image content, attribute data, and user requirements.
Finally, simulation experiments demonstrate that the proposed multi-modal-MAPPO framework significantly enhances data access efficiency and delivery timeliness. This is achieved through the joint optimization of multi-modal features and agent collaboration mechanisms, which exhibit robust adaptive capabilities and scalability in dynamic topological environments. Consequently, our work establishes a novel paradigm for satellite–terrestrial collaborative services.
The rest of this article is structured as follows.
Section 1 discusses current research on existing remote sensing data push technology in ISTNs.
Section 2 models the network scenario and formulates the optimization problem.
Section 3 formulates the proposed multi-modal-MAPPO intelligent push strategy.
Section 4 shows the experimental simulation results. Finally, we conclude this article in
Section 5.
2. Related Work
As satellite remote sensing technology advances, various satellite systems are being developed, significantly enhancing the comprehensive observation capabilities of Earth. The systems for acquiring remote sensing data are evolving steadily, leading to an exponential increase in the quantity and diversity of data, along with improvements in technical parameters like spatial, temporal, and spectral resolution [
18]. At the same time, the frequency of updates for remote sensing data is increasing, enhancing their timeliness. Hence, it is urgent to provide users with intelligent and real-time remote sensing data services [
19].
At present, the majority of existing remote sensing information services operate on query-based and subscription-based service models. In query-based services, users typically input criteria such as their area of interest and time to perform a search. The system subsequently returns retrieval results, enabling users to download the desired data. Conversely, in subscription-based services, users encapsulate their query conditions into resource orders and submit these to the remote sensing information system. When relevant remote sensing resources become available, the system automatically pushes them to the users. Although these two data service models facilitate the discovery of remote sensing resources to some extent, they are not without limitations, including passivity, latency, and a reliance on specialized knowledge. Moreover, there are increasing demands for enhanced precision and timeliness from various entities, including national agencies, industry-specific clients, and personalized users. Consequently, satellite information services are transitioning from reliance on large ground stations to vehicle-mounted, shipborne, and handheld terminals for data reception and transformation. It is evident that traditional processing and application models for satellite remote sensing data are insufficient to meet the current requirements for high timeliness, personalized data collection, and the delivery of information products.
In recent years, research on intelligent distribution services for spatial information has made some progress. In order to fully explore users’ potential demand for remote sensing images, the authors in [
20] divided the remote sensing image coverage area into grid cells based on a unified grid system and proposed a spatio-temporal recommendation method for remote sensing data based on a probabilistic latent topic model–spatio-temporal periodic task model (STPT), in which the user retrieval behaviors of remote sensing images are represented as a mix of latent tasks, which act as a link between the user and the images, and each task is associated with a joint probability distribution of spatial, temporal and image features. Since the model adopts the minimum bounding rectangle as the filtering condition for spatial features and assumes that all user acquisitions of images are periodic, the model tends to return data that conform to the periodicity rule, which is less accurate and unsuitable for diverse users. The authors in [
5] propose spatio-temporal embedded topic model (STET), which adopts the latent Dirichlet allocation (LDA) model to transform all user’s data records into documents through spatial embedding, temporal embedding, and content embedding, and constructs the document topic model with spatio-temporal embedding to simulate the user’s preferences, so as to fully utilize the spatial and temporal continuity features to recommend appropriate remote sensing images. The authors in [
18] proposes a new multi-modal knowledge graph based on the graph convolutional network to perceive depth graph attention network, which firstly constructs the multi-modal knowledge graph of remote sensing images, integrates various attributes and visual information of remote sensing images, and then carries out the information aggregation based on the depth relational attention, and enriches the node representations of remote sensing images by using the multi-modal information and the higher-order collaborative signals to significantly improve the accuracy of the remote sensing image recommendation.
The above study has conducted in-depth research on users’ potential interests, which has resulted in a significant improvement in the accuracy of remote sensing image recommendation. However, the above studies did not consider the real-time network environment of end users, and the time cost for users to acquire remote sensing images is still high, which cannot meet the real-time demand. In order to alleviate the time cost of acquiring remote sensing images, the authors in [
21] combined the stream push technology and proposed a recommendation algorithm based on attention and multi-view fusion, constructs a value assessment function from multiple perspectives such as remote sensing users, data and services, defines a multi-view heuristic strategy to support resource discovery, and fuses these strategies through an attention network to realize the active and precise push of massive multi-source remote sensing resources. The authors in [
22] proposed a remote sensing resource service recommendation model combining time-aware bidirectional long short-term memory (LSTM) neural network and similarity graph learning in combination with stream push technology. First, a sequence of interaction history behaviors is constructed based on the user’s resource search history, and then, a graph structure of category similarity relationships between remote sensing resource categories is established based on the cosine similarity matrix, using LSTM to represent the history sequence, graph convolutional network to represent the graph structure, and combining the history sequence to construct the sequence of similarity relationships, and exploring the precise similarity relationships using the LSTM method. In [
19], the authors proposes an efficient remote sensing image data retrieval and distribution scheme based on spatio-temporal features, which utilizes the spatio-temporal attributes of the remote sensing images as the markers of the content, introduces the in-orbit remote sensing image data segmentation algorithm according to the user’s needs, and adopts multi-path transmission to solve the problems of limitation of the transmission resources, such as the capacity of the link and the length of the connection time, so as to realize the rapid retrieval and distribution of the in-orbit remote sensing image data.
With the development of in-orbit processing technology and artificial intelligence technology, the future intelligent remote sensing satellite system will make the satellite in-orbit instant information push become a popular research direction in the field of remote sensing information recommendation. At that time, in addition to considering the characteristics of remote sensing data, it is inevitable that satellite topology changes, network performance, satellite service resources, user cache resources, user location, and other information need to be taken into account in order to realize the remote sensing data in-orbit proactive, real-time, intelligent push.
In recent years, in order to improve the quality of service (QoS) for users in the ISTNs, there have been some studies on network resource optimization problems [
23,
24,
25], but the active content distribution in ISTNs is still in the preliminary exploration stage. The problem is still in the preliminary exploration stage, mainly focusing on utilizing in-network caching techniques to improve the efficiency of content distribution in satellite networks. In the study of collaborative caching algorithms for dynamic network environments, multi-agent deep reinforcement learning (MARL) has demonstrated significant potential. The authors in [
26] proposed the C-MAAC algorithm based on the multi-agent actor–critic (MAAC) framework, which innovatively formulates the collaborative caching problem as a multi-agent Markov decision process (MDP). By adopting a centralized training with decentralized execution (CTDE) framework, this method effectively addresses the instability in training environments caused by decentralized multi-agent decision making. The introduction of a Gumbel-Softmax sampler to generate differentiable action spaces significantly enhances the stability of the training process. Experimental results demonstrate that, under dynamic content popularity scenarios, C-MAAC achieves notable improvements in cache hit rates compared to traditional methods (LFU, LRU) and deep reinforcement learning benchmarks (I-DDPG and D-DQN). In contrast, [
27] approached the problem from the perspective of partially observable environments, proposing a multi-agent deep reinforcement learning framework based on the partially observable Markov decision process (POMDP) modeling. Their algorithm integrates a communication module to enable the interactive sharing of state-action information among base stations, employs variational recurrent neural networks for content request state estimation, and innovatively incorporates LSTM to model the temporal characteristics of user access patterns. Experimental results indicate that, compared to existing methods such as LRU and MAAC, this algorithm achieves significant enhancements in both caching rewards and hit rates. [
28] addresses the collaborative edge caching problem in LEO satellite networks, aiming to reduce traffic congestion and request latency by caching popular content. It proposes the SEC-MAPPO method, which is based on the multi-agent proximal policy optimization (MAPPO) algorithm and models the problem as a POMDP to tackle the challenges of a brief satellite overhead time and dynamic network topology. The experimental results indicate that SEC-MAPPO significantly outperforms baseline algorithms like RANDOM, LFU, LRU, and deep Q-learning network (DQN) in reducing the average video request latency, demonstrating its effectiveness in practical LEO satellite network scenarios.
Current research on the remote sensing data push in satellite–terrestrial integrated networks centers on three pivotal dimensions: user demand modeling, dynamic network adaptation, and collaborative distribution mechanisms. While significant progress has been made in each area, critical challenges persist. In user demand modeling, innovative approaches such as multi-attribute decision frameworks leveraging quadruple user profiling, spatio-temporal topic models like STPT and STET, and advanced deep learning architectures including multi-modal knowledge graphs and attention fusion networks have demonstrably enhanced demand prediction accuracy [
29,
30]. However, these methodologies exhibit notable limitations in capturing the intricate heterogeneity of spatio-temporal correlations and cross-modal feature interactions inherent in remote sensing data, often relying on overly coarse spatial representations and static feature weighting schemes that fail to adequately resolve complex semantic patterns. Transitioning to dynamic network adaptation, existing strategies employing density-based partitioning and spatio-temporal feature tagging attempt to mitigate the challenges posed by intermittent connectivity and resource constraints in hybrid networks. Many of these solutions remain anchored to static network assumptions or rely on approximation algorithms, rendering them ill-equipped to handle the high-frequency topological variations characteristic of large-scale LEO constellations. Furthermore, the precision of coupling models linking remote sensing data semantics with dynamic network states requires substantial refinement. In the realm of collaborative distribution mechanisms, while cache optimization and multi-path transmission techniques have improved distribution efficiency, the absence of global joint scheduling frameworks for computational, storage, and communication resources across satellite–terrestrial nodes represents a critical gap. Centralized decision-making paradigms dominate current implementations, introducing scalability limitations and compromising real-time responsiveness in dynamic environments. Collectively, these challenges underscore the need for more sophisticated models that can unify fine-grained demand understanding with adaptive network management and distributed resource orchestration to achieve truly efficient and intelligent remote sensing data dissemination in next-generation integrated networks.
3. System Model Description
Figure 1 illustrates the integrated satellite–terrestrial remote sensing satellite network architecture, containing
N satellites,
M users, and a ground station. Among them, the ground station is denoted by
G, the set of users is denoted by
, the location of the user
i is denoted as
, the remote sensing satellite set
S is denoted by
, the set of communication satellites
B is denoted as
, and
. The remote sensing satellite set is divided into
K clusters, with the satellite set of the
k-th cluster denoted as
, and
represents the number of remote sensing satellites in the
k-th cluster. Remote sensing satellites generate real-time remote sensing data with spatio-temporal characteristics. We consider remote sensing data that include both imagery and attribute data, where the attribute data format is (shooting time, latitude starting point, latitude ending point, longitude starting point, longitude ending point, payload type, resolution level, product level). Among them,
P remote sensing satellites are equipped with caching modules, while communication satellites are solely responsible for data transmission. Users directly communicate with satellites via satellite terminals to access data, ensuring visibility to at least one communication satellite at any given time. Satellites can relay content to ground stations via feeder links; however, ground stations are typically located in remote areas, incurring high transmission costs and significant service delays. Therefore, proactively pushing content of potential interest to users can substantially reduce request latency. On the other hand, limited user cache resources necessitate intelligent push strategies to conserve cache usage. Thus, the core challenge lies in designing an intelligent delivery strategy to dynamically push newly generated remote sensing data to relevant users, aiming to minimize the average content delivery latency. The mathematical model of this system can be described as follows.
3.1. Channel Model
Satellite-to-user link: The channel model between satellites and users is represented by the Weibull channel model, mainly considering rainfall attenuation, which has been utilized in numerous existing works [
31,
32]. The distance between them can be regarded as the height
of the LEO satellite. Due to channel attenuation, the power attenuation of the satellite to user link is expressed as
where
is the wavelength of the carrier wave,
and
denote the antenna gains of LEO satellite and user satellite terminals, and
represents the rainfall attenuation following the Weibull distribution [
33]. Based on the power attenuation of the channel, it is easy to use the Shannon equation to calculate the maximum transmission rate between LEO satellites and users as
where
is the transmission power of the LEO satellite,
is the bandwidth allocated to LEO satellites, and
denotes the background noise power.
Inter-satellite link: Due to there being less scattering in the space environment, the large-scale fading of electromagnetic waves in propagation is mainly caused by the free space propagation effect [
34]. Hence, the power attenuation of the link between satellites can be expressed as
where
and
are the transmitter gain and receiver gain of the LEO satellite, respectively;
is the optics efficiency of the transmitter, and
is the optics efficiency of the receiver. Utilizing the power attenuation of the channel, the maximum transmission rate between LEO satellites can be calculated as follows
It is assumed that the link rates of the satellite-to-user and inter-satellite links are time-varied functions denoted by and , respectively, which are associated with Equations (2) and (4). The link rates and are randomly generated within the ranges and , respectively. In addition, temporary interruptions may occur on each link, at which time the link rate is 0.
3.2. Push Model
To address the challenge of minimizing user-perceived latency in accessing time-sensitive remote sensing data, we propose a dynamic intelligent dissemination framework that leverages real-time contextual awareness and adaptive network resource allocation. Recognizing that conventional batch-processing methodologies inherently introduce delays through periodic data aggregation and post-acquisition analysis, our strategy emphasizes proactive data delivery by continuously monitoring evolving user requirements and operational environments. By capitalizing on the temporal dynamics of satellite-derived data streams, the system performs in-transit analytics to predict user needs and pre-emptively routes relevant information across dynamically reconfigurable network topologies. This approach ensures the optimal utilization of intermittent connectivity windows while maintaining service continuity across time-slot transitions, where network configurations can be instantaneously adjusted based on prevailing conditions.
The core innovation lies in synchronizing data generation cycles with user decision workflows through the predictive modeling of demand patterns and network state variables, thereby transforming reactive data retrieval into an anticipatory service paradigm that significantly reduces the latency without compromising data integrity or relevance. It is assumed that the network topology is fixed during each time slot, but can be changed instantaneously during time slot transitions. The set of time slots is defined as , and each time slot will have multiple remote sensing data randomly generated in remote sensing satellite network using to represent all remote sensing data generated by the remote sensing satellite network. The data-generating satellite transmits content summary information to the cluster head node, which then performs intelligent analysis to determine the optimal caching and dissemination strategy. Instead of random caching, the cluster head dynamically selects appropriate remote sensing satellites within the cluster to store the content based on real-time demand forecasts and network conditions. Simultaneously, it employs predictive algorithms to match content relevance with user profiles, scheduling proactive pushes during low-traffic time slots to ensure the timely delivery of high-priority data while maintaining the efficient utilization of user terminal cache resources through coordinated content rotation and priority management.
We establish a unified framework where all remote sensing data items have equal size
. Cache-enabled satellites and user terminals possess storage capacities of
and
, respectively, with both significantly smaller than the total data size (
,
). This configuration allows each satellite to store
data items, while user devices can hold
contents. When storage reaches capacity, the first-in–first-out (FIFO) policy is implemented for cache replacement. The content dissemination decision making process is centralized at the cluster head corresponding to the originating satellite’s cluster. For cluster
k, the binary decision variable
determines whether content
generated by cluster
k is pushed to the
j-th user at time slot
l, if
, the content
is not pushed to the
j-th user. To enforce efficient cache utilization, each content item can only be actively pushed once across all time slots, governed by the constraint
3.3. Content Delivery Model
In the satellite network system, user-requested content can be responded to directly at the user’s terminal or be responded to by a ground station, a remote sensing satellite, depending on whether the content is actively pushed and where the content is cached. Due to the short queuing and processing delays, this article only considers propagation and transmission delays as content delivery delay.
Direct delivery mode If the content of the request has been cached in the user terminal, the user terminal responds directly, and the delay for user to obtain content f is 0.
Access delivery mode If the requested content has been cached in access remote sensing satellite
, the content is directly transmitted to the user terminal. The delay
for user
to obtain content
f can be calculated as follows
where
represents the Euclidean distance from user
to access satellite
,
is the propagation rate of the satellite–ground link, and
is the transmission rate between access satellite
and
at time slot
t.
Collaborative delivery mode If the requested content is not cached in the access remote sensing satellite , but another remote sensing satellite has cached the content, then the user’s request will be transmitted to through the communication satellite network, and the content will be distributed through inter-satellite links from . Due to the dynamically varying link rates, the content delivery delay dynamically varies across multiple transmission paths between and . We use the Dijkstra algorithm to solve the shortest path with minimum content delivery delay.
The propagation delay for user
to obtain content f is denoted by
, which can be calculated as follows
where
denotes the content delivery path from access satellite
to satellite
,
, and
is the inter-satellite link propagation rate. The content transmission delay of user
to obtain content
f is denoted by
, which can be calculated as
where
is the transmission rate between satellite
and
at time slot
t. Then, the delay
for user
to obtain content
f is
Ground station delivery mode If the requested content is not cached by the remote sensing satellite network, the content will be distributed from the ground station through the inter-satellite links. Due to the distance between the ground station and the user, users can only obtain content from the ground station through satellite networks, resulting in significant delay when users obtain content from the ground station. This paper uses a large constant to represent the delay for users requesting content from ground stations.
5. Algorithm Design
Targeting large-scale satellite–terrestrial integrated network scenarios, we propose a multi-modal deep reinforcement learning-based proactive remote sensing data delivery algorithm (Multi-modal-MAPPO). By integrating multi-modal data—including remote sensing imagery, attribute data, and user/satellite network cache states—and combining multi-agent collaboration mechanisms, the algorithm achieves high-precision and low-latency data distribution in dynamic network environments. The core design is elaborated through three key aspects: multi-modal feature modeling, cross-modal dynamic fusion, and collaborative decision optimization.
5.1. Multi-Modal Feature Modeling
We address the proactive push task for remote sensing data by categorizing input data into two modalities—image and attributes—for multi-modal feature modeling. For remote sensing image data, we consider visible-light images and SAR images. To address the heterogeneity of multi-source remote sensing data and onboard computing constraints in the integrated satellite–terrestrial networks, we propose dual-modal adaptation optimizations based on the lightweight MobileNetV3-Small architecture:
Heterogeneous input layer adaptation: To address the data discrepancy between single-channel SAR images and three-channel visible-light images, the input layer’s convolution kernel dimensions are restructured—adjusting the original 3 × 3 × 3 kernel to a 3 × 3 × 1 configuration. This reduces the number of parameters in the initial layer by 66.7% while retaining the lightweight characteristics of depthwise separable convolutions, ensuring channel compatibility for SAR data feature extraction.
Pre-training strategies: Distinct approaches are adopted for visible-light and SAR modalities. For visible-light image processing, ImageNet pre-trained weights are directly loaded, with the first three convolutional layers frozen to leverage the transfer learning for inheriting general visual semantic representations, while only the deep feature extraction layers (Layers 4–16) and the classification head are fine-tuned to accelerate convergence. For single-channel SAR images, a three-stage adaptation is implemented: (1) the input layer adaptation averages the weights of the original three-channel convolution kernel along the channel dimension to align with the SAR’s single-channel input; (2) Intermediate layers reuse pre-trained weights, exploiting the cross-modal generalization capability of depthwise separable convolutions; and (3) Kaiming normal initialization is applied to the classification head to mitigate the channel dimension mismatch and gradient vanishing issues. This strategy ensures efficient cross-modal compatibility while preserving computational efficiency.
The attribute data include remote sensing attribute metadata, satellite cache states, and user cache states from the previous
time slots. We employ a three-layer MLP to jointly encode the remote sensing attribute data and satellite cache states:
where
denotes the remote sensing attribute data,
represents the satellite cache state, and
,
,
are the weight matrices of each layer, with
,
,
being the corresponding bias terms.
An LSTM module is employed to model the temporal evolution of user cache states over the previous
T time slots
where
contains the hidden states across
T time steps,
and
represent the final hidden and cell states. The last time-step hidden state is extracted as the temporal feature of user cache dynamics. The MLP-extracted attribute features
and LSTM temporal features are fused via the residual addition
5.2. Multi-Modal Fusion
To enable the fine-grained interaction between the image and attribute modalities, a gated attention mechanism is designed to dynamically allocate modality weights. The image feature
and fused attribute feature
are concatenated and processed through a fully-connected network to generate attention weights
where
and
are the hidden-layer and output-layer weight matrices, respectively, with
and
as biases. The attention weights satisfy
. The weighted fusion of the two modalities is then performed based on the attention weights:
This mechanism adaptively prioritizes dominant the modalities while preserving complementary information for robust decision making.
5.3. Multi-Agent Collaborative Decision-Making Optimization
The major advantage of DRL is that, in the process of continuous interaction between agents and wireless environments, the mobility of wireless networks, such as satellite mobility and link status, can be resolved, especially among multi-agent systems that can collaborate to solve problems through shared environments [
35].
Markov Decision Process
The next section revolves around constructing Markov decision processes (MDPs), the core reason being its ability to formalize sequential decision-making problems in dynamic environments, providing structured modeling for an AI to optimize long-term cumulative gains through a mathematical framework of states, actions, rewards, and state transfer probabilities. At the same time, the Markovian nature of the MDP (where the current state is only dependent on the previous state) drastically simplifies the policy learning process in complex environments, enabling reinforcement learning methods to efficiently solve optimal decision-making policies, as follows
State space: At time slot
t, we define the state of the
k-th agent as
where
represents the attribute characteristics of the contents,
,
represents the cache status of remote sensing satellites with cache capabilities, for remote sensing satellite
i, its cache status is denotes as
represents the cache status of users in the previous time slot and current time slot, respectively,
represents the number of users involved in a single decision, for user i, its cache status in the previous time slot is denotes as
Action space: We define the multi-dimensional discrete action space
to represent the push decision actions, and the action vector can be represented as follows
where
,
, with
indicating that the content
is pushed to user
j at time slot
l, otherwise
. If
, the content is not pushed to user
j.
Reward function: To minimize the total content delivery delay, we define the following reward function.
with
where
denotes the actual latency generated by the content distribution within time slot
t.
5.4. Proposed R-MAPPO Algorithm
PPO is a robust policy gradient algorithm that addresses the problems of traditional policy gradient methods in terms of stability and sample efficiency through careful policy network update clipping and the use of a surrogate objective function. It has shown excellent performance in various reinforcement learning tasks [
36]. Beyond the scope of single-agent DRL methods, MAPPO [
37] significantly expands the applicability of PPO. It excels in scenarios where multiple agents must interact and make decisions autonomously, demonstrating superior performance in environments that require stability and sample efficiency within discrete action spaces. The robust handling of discrete action spaces and overall resilience in multi-agent systems make MAPPO an ideal choice for the dynamic active push problem explored in this study. The MAPPO algorithm employs a centralized training with decentralized execution (CTDE) framework, which is suitable for the fully cooperative relationship in satellite collaborative decision-making scenarios. In this paper, at time slot t, agent i can observe the global state of the environment through communication between cluster heads, represented as
, and then, based on its own actor network
, take action
. Therefore, the actions of all agents can be represented as
. Due to the complete cooperative relationship between the agents in this research, all agents share the same reward
. The critic network then uses the collected global
and
to generate a Q-value function to maximize the reward function. The Q-value function is defined as follows
where
. Therefore, the transition buffer can be expressed as follows
where
is an advantage function that evaluates whether the behavior of the new policy is better than that of the old policy, and it can be defined as follows:
where
is the state-value function output of the value network of each agent
i. Therefore, the loss function of the policy network can be expressed as follows:
where
represents the clipping function that limits the change between the old and new policy to the range
. Then, the parameter
is updated using a gradient ascent method, i.e.,
, where
is the learning rate of the new policy network. The loss function of the value network
is a mean square error function, defined as
The parameter
is updated by gradient descent, i.e.,
. To address the challenge of high-dimensional decision making in this data push scenario, we redesign the actor network of MAPPO by adopting a multidimensional discrete action space. Specifically, the original action space for recommending 15 content items across 3 users (with 19 candidate time slots each) would naively require a single softmax output of size
, leading to an intractable action space of
dimensions. Instead, we decompose the action space into
independent sub-policies, each implemented as a softmax head with 19 dimensions. This multidimensional discretization reduces the effective action space to
dimensions, mitigating the "curse of dimensionality" while preserving decision granularity. Hence, the gradient of the actor network is expressed by
Moreover, we introduce a term for strategy entropy into the actor’s loss L(
), and multiply it by a coefficient
to avoid the agent getting trapped in a sub-optimal state. Consequently, the final loss function for the updated policy network can be expressed as follows
LSTM [
38] is a specialized type of recurrent neural network created to address issues related to long-term dependencies. Since the data push decision problem in this paper is inherently a long-term dynamic decision challenge, we incorporate an LSTM layer into the actor network to better capture the correlation of the state information of different agents in the time domain. The sequence of the shared state space within a range of
timesteps starting from time
t is defined as
.
Figure 2 shows the multi-modal-MAPPO algorithm framework, and the training process pseudo-code for the proposed multi-modal-MAPPO is summarized in Algorithm 1.
Algorithm 1 Proposed Multi-modal-MAPPO Algorithm |
1: | Initialize the parameters , , D and ; |
2: | for do |
3: | Initiate the state of all agents; |
4: | for each do |
5: | for each do |
6: | Each agent obtains the local action , reward |
| and next state ; |
7: | end for |
8: | end for |
9: | Collect the sample trajectory of each agent ; |
10: | Compute the and ; |
11: | Store sample trajectory to transition buffer D; |
12: | for do |
13: | Reshuffle the data order and reorder the sequence; |
14: | Randomly choose data from D; |
15: | Update and ; |
16: | end for |
17: | Clear the transition buffer D; |
18: | end for |
5.5. Complexity Analysis
The training of models is usually performed offline, while the computational complexity of each agent during execution depends on the combination of neural network forward computations. The proposed multi-modal-MAPPO framework involves multiple neural networks. For the image processing branch, the computational complexity of each CNN unit is denoted as
, where
and
represent the dimensions of the output feature map and convolutional kernel size, and
and
denote the number of input and output channels, respectively. For the attribute processing branch, the three-layer MLP used for encoding attribute features has a computational complexity of
for each fully connected layer, where
and
are the input and output sizes of the layer. The LSTM model used to capture temporal patterns in user cache states has a computational complexity of
, where
is the input feature dimension and
is the hidden state dimension. The attention mechanism involves matrix operations to compute attention weights. For input vectors of dimension
, the complexity of the fully connected layers in the attention mechanism is
, and the softmax operation adds
complexity. Overall, the attention mechanism has a complexity of
. The overall complexity of the MAPPO algorithm depends on the number of agents
K and the complexity of the actor and critic networks. The critic network is a three-layer MLP, with each fully connected layer having a complexity of
, where
and
are the input and output sizes of the layer. Therefore, the overall computational complexity of the proposed multi-modal-MAPPO framework can be summarized as
where
V is the number of layers in the CNN networks, and
Y and
Z are the number of fully connected networks in the actor network and critic network, respectively.
7. Analysis
To validate the advantages of the multi-modal-MAPPO algorithm in multi-modal dynamic environments, this section analyzes its convergence and final performance through comparative experiments. First,
Figure 5a presents the reward training curves of the three algorithms, showing that their rewards gradually converge as training progresses. The multi-modal-MAPPO algorithm achieves the highest reward, while MAPPO attains suboptimal performance, demonstrating that the agents can effectively learn and perceive changes in the dynamic system. Compared to MAPPO, multi-modal-MAPPO not only achieves higher rewards but also exhibits faster convergence speed and lower standard deviation, indicating a more stable training process. This validates that the introduced multi-modal dynamic attention mechanism enables the refined fusion of image-attribute features, capturing strong correlations between user demands and network states to generate high-value decisions. In contrast, MAPPO’s lack of multi-modal feature interaction limits its adaptability to partially observable scenarios, resulting in constrained long-term gains.
The evaluation results presented in
Figure 5b demonstrate the superiority of the intelligent push algorithm framework in reducing content delivery delays. By leveraging predictive capabilities for user demands and network states, both multi-modal-MAPPO and MAPPO achieve significant latency improvements over the MAAC baseline. Specifically, multi-modal-MAPPO demonstrates the lowest average delay of approximately 0.086 s, showing a 53.55% reduction compared to MAAC, while MAPPO achieves 0.1321s with a 29.55% decrease. These findings underscore the effectiveness of the proposed multi-modal architecture in optimizing proactive content caching strategies.
To analyze system adaptability under varying operational conditions, we conducted parametric studies on cache capacities and user loads. As visualized in
Figure 6a, increasing cache sizes from 2000 MB to 4500 MB universally reduces delivery delays due to improved content availability. Notably, multi-modal-MAPPO maintains consistently superior performance, achieving sub-0.1 s latency across all configurations. At the minimum cache capacity of 2000 MB, it outperforms MAPPO by 0.053 s and MAAC by 0.116 s. The scalability of the proposed algorithm is further validated in
Figure 6b, where multi-modal-MAPPO achieves the lowest delays under all tested user counts (up to 18 users), demonstrating 23.17% and 38.58% improvements over MAPPO and MAAC, respectively. These results confirm the algorithm’s robustness in dynamic environments. Real-time implementation feasibility is verified through
Table 3, which shows the decision-making overhead of multi-modal-MAPPO (0.08654 s per decision) meets operational requirements for remote satellite caching systems.
Proactive caching effectiveness is further evaluated using the push hit rates in
Figure 7. Multi-modal-MAPPO achieves the highest hit rate of 0.7148, surpassing MAPPO (0.543) by 0.1718 and MAAC (0.29) by 0.4248. This substantial improvement highlights the algorithm’s enhanced ability to accurately predict user demands. The learning dynamics analysis in
Figure 8 reveals critical insights into parameter optimization. While a high critic learning rate (e.g.,
) accelerates the initial convergence, it risks suboptimal final performance. Conversely, dual low learning rates (
,
) yield slower initial progress but achieve better ultimate delays. Notably, maintaining
>
improves the convergence efficiency, as can be seen in the
,
configuration. Inverting this ratio (e.g.,
,
) causes training instability and degraded performance. These findings provide valuable guidelines for balancing exploration–exploitation trade-offs in reinforcement learning-based caching systems.