Next Article in Journal
Spatiotemporal Variations in Vegetation Phenology in the Qinling Mountains and Their Responses to Climate Variability
Next Article in Special Issue
Multispectral Sparse Cross-Attention Guided Mamba Network for Small Object Detection in Remote Sensing
Previous Article in Journal
Quantifying Broad-Leaved Korean Pine Forest Structure Using Terrestrial Laser Scanning (TLS), Changbai Mountain, China
Previous Article in Special Issue
Fine-Grained Multispectral Fusion for Oriented Object Detection in Remote Sensing
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

DSEPGAN: A Dual-Stream Enhanced Pyramid Based on Generative Adversarial Network for Spatiotemporal Image Fusion

School of Geophysics and Geomatics, China University of Geosciences, Wuhan 430074, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2025, 17(24), 4050; https://doi.org/10.3390/rs17244050
Submission received: 14 October 2025 / Revised: 5 December 2025 / Accepted: 15 December 2025 / Published: 17 December 2025

Highlights

What are the main findings?
  • A Dual-Stream Enhanced Pyramid based on GAN (DSEPGAN) is proposed for spatiotemporal fusion of remote sensing images.
  • The model integrates reversible detail preservation and large-kernel feature reconstruction to enhance fine spatial details.
What are the implications of the main finding?
  • DSEPGAN significantly improves detail and edge restoration in regions with pronounced phenological and land-cover changes, ensuring high-fidelity spatiotemporal reconstruction.
  • The dual-stream reversible pyramid design provides a new framework for multi-modal image fusion and change analysis.

Abstract

Many deep learning-based spatiotemporal fusion (STF) methods have been proven to achieve high accuracy and robustness. Due to the variable shapes and sizes of objects in remote sensing images, pyramid networks are generally introduced to extract multi-scale features. However, the down-sampling operation in the pyramid structure may lead to the loss of image detail information, affecting the model’s ability to reconstruct fine-grained targets. To address this issue, we propose a novel Dual-Stream Enhanced Pyramid based on Generative Adversarial Network (DSEPGAN) for the spatiotemporal fusion of remote sensing images. The network adopts a dual-stream architecture to separately process coarse and fine images, tailoring feature extraction to their respective characteristics: coarse images provide temporal dynamics, while fine images contain rich spatial details. A reversible feature transformation is embedded in the pyramid feature extraction stage to preserve high-frequency information, and a fusion module employing large-kernel and depthwise separable convolutions captures long-range dependencies across inputs. To further enhance realism and detail fidelity, adversarial training encourages the network to generate sharper and more visually convincing fusion results. The proposed DSEPGAN is compared with widely used and state-of-the-art STF models in three publicly available datasets. The results illustrate that DSEPGAN achieves superior performance across various evaluation metrics, highlighting its notable advantages for predicting seasonal variations in highly heterogeneous regions and abrupt changes in land use.

1. Introduction

Spatiotemporal fusion (STF) [1,2] aims to integrate multi-source remote sensing images with complementary temporal and spatial resolutions to generate high-spatiotemporal-resolution products for diverse applications, including surface urban heat islands [3], evapotranspiration estimation [4], grassland biomass estimation [5], crop phenology analysis [6], and other fields. STF typically relies on coarse images that provide rich temporal dynamics and fine images that contain abundant spatial details, making effective modeling of their heterogeneous information a central challenge.
Traditional STF methods, including unmixing-based [7,8], weight function-based [9,10,11], Bayesian [12,13], and machine learning-based methods [14,15], often depend on assumptions such as linear spectral mixing or linear temporal change, which are difficult to satisfy in complex scenes. Deep learning (DL)-based methods [16,17,18] alleviate these issues by learning nonlinear relationships directly from data. The Convolutional Neural Network (CNN) is the most commonly used DL architecture. Early CNN-based models (e.g., STFDCNN [19]) incorporate hand-crafted modulation strategies, while recent end-to-end frameworks (e.g., EDCSTFN [20]) directly extract and fuse spatiotemporal features. However, fixed-size convolutional kernels of typically 3 × 3 may not adapt well to the irregular shapes and varying sizes of features in remote sensing images, potentially leading to insufficient local feature extraction, information loss, distortion, and blurring.
Therefore, the multiscale mechanism in DL has been introduced into STF. Recent work introduces multi-branch kernels [21], dilated convolutions [22], and pyramid-based designs [23,24]. Pyramid structures, named for their pyramid-like shape, in which the bottom is larger in scale and the top is smaller, can capture information at different resolutions. Multiple downsampling operations using stride-2 convolutions are applied, progressively reducing the size of the feature maps. Although the pyramid framework is effective for capturing multi-scale features, it often leads to the degradation of high-frequency spatial details. This issue is primarily attributable to two compounded factors: the progressive downsampling operations, which introduce an irreversible information bottleneck, and the resulting deep-layer features, which lack the fidelity required for precise texture reconstruction due to high compression. This structural fidelity loss, when combined with the inherent smoothing effect of the Mean Squared Error (MSE) loss function, ultimately results in severe degradation of texture details and a lack of realism in the fused imagery. Generative Adversarial Network (GAN)-based STF models (e.g., GAN-STFM [25], MLFF-GAN [23]) leverage adversarial learning to enhance perceptual sharpness. However, these approaches typically apply the uniform feature extractor to fine and coarse inputs, neglecting their different information characteristics—coarse images emphasize temporal dynamics, while fine images provide spatial structure. This uniform treatment can hinder feature modeling and limit fusion performance. Thus, there is a clear need for an advanced pyramid architecture that can both mitigate the high-frequency detail loss from downsampling and apply tailored, asymmetric extraction strategies to the distinct coarse and fine inputs. This motivates our work on a dual-stream, detail-preserving pyramid structure.
Effectively fusing the multiscale information extracted by the pyramid architecture constitutes another key challenge. Most existing models resort to simple concatenation or element-wise addition followed by standard 1 × 1 or 3 × 3 convolutions, which have limited receptive fields. Reconstructing fine pixel-level details, however, often requires contextual information from a substantially wider spatial area, particularly in highly heterogeneous regions characterized by complex and fragmented landscapes. Transformer-based STF models [26,27], such as STF-Trans [28], address this limitation but at a substantially higher computational cost, thereby limiting their scalability to large remote sensing scenes. Recent advances in large-kernel CNNs [29,30] provide a more efficient alternative, substantially expanding the receptive field while preserving computational practicality.
Building upon the aforementioned analysis, we propose a Dual-Stream Enhanced Pyramid based on Generative Adversarial Network (DSEPGAN) for the Spatiotemporal Fusion of remote sensing images. The primary methodological innovations and contributions of this work are summarized as follows:
  • A Dual-Stream Decoupling Framework: We introduce the Pyramid Time Change Extractor (PTCE) and the Pyramid Space Detail Extractor (PSDE), creating a specialized architecture that explicitly decouples the extraction of temporal dynamics from coarse imagery and high-frequency spatial structure from fine imagery. This asymmetric approach ensures dedicated modeling tailored to the heterogeneous nature of multi-sensor inputs.
  • Structurally Lossless Spatial Detail Preservation: To fundamentally address the irreversible information bottleneck caused by downsampling, the PSDE is customized with a novel detail-preserving strategy that integrates Affine Coupling Layers and Patch Merging operations. This combination maximizes the fidelity of extracted features, which is crucial for reconstructing sharp texture and edge details in STF products.
  • Hierarchical Long-Range Feature Aggregation (HLRFA): We designed the HLRFA module, incorporating advanced feature fusion techniques such as Large Kernel Fusion Blocks. This mechanism efficiently combines segregated temporal and spatial features across multiple scales, substantially expanding the receptive field to capture necessary long-range contextual information without incurring the heavy computational burden associated with Transformer-based models.
  • Superior Fusion Performance: Through comprehensive experiments on three benchmark datasets, DSEPGAN demonstrates significant improvements over existing state-of-the-art models, achieving superior performance in both spatial fidelity (sharpness and texture preservation) and spectral consistency (radiometric accuracy), validated by both objective metrics and visual quality assessment.
The remainder of the article is structured as follows: Section 2 reviews the related work. Section 3 introduces the proposed DSEPGAN method in detail. Section 4 evaluates the performance of the proposed approach in predicting both seasonal variations and abrupt changes. Section 5 discusses model efficiency, ablation experiments, and parameter analysis. Finally, Section 6 concludes the study.

2. Related Work

2.1. Multiscale Mechanisms in STF

Multiscale mechanisms used in DL-based STF models can be divided into three categories: (1) Parallel usage of convolutional kernels of different sizes to flexibly perceive multiscale information from feature maps. For example, DMNet [21] simultaneously employs convolutional kernels of sizes 3, 5, and 7. However, increasing the size of the kernels also leads to an increase in parameters and computational load. (2) Using dilated convolutions with different dilation rates can effectively increase the network’s receptive field without introducing additional parameters, such as MANet [22], PDCNN [31], and STFMCNN [32]. (3) Using a pyramid structure, which progressively reduces spatial resolution to expand the receptive field and significantly reduces computational cost. For example, the encoders of MLFF-GAN [23], PSTAF-GAN [33], and DCDGAN-STF [34] employ a cascade of residual blocks and downsampling layers. However, this reliance on standard downsampling operations introduces an irreversible information bottleneck, severely compromising the fidelity of high-frequency spatial details in the generated features. This deficiency motivates the subtle combination of a reversible feature transformation strategy and the pyramid structure.

2.2. Long-Range Dependency Modeling in STF

Long-range spatial dependency modeling is essential in STF. Transformer architectures have been introduced into STF to model Long-range spatial dependency. For instance, SwinSTFM [35] achieves robust spatiotemporal feature modeling by integrating a window-based self-attention mechanism [36] with linear unmixing theory. STF-Trans [28] adopts a two-stage strategy where CNNs handle initial (shallow) feature extraction, while a Vision Transformer module is subsequently employed for deeper modeling. CTSTFM [37] focuses on efficient feature processing by incorporating spatial and channel attention modules alongside cross-attention mechanisms to enhance both extraction and fusion stages. However, their high computational cost stems primarily from the quadratic scaling of self-attention. Furthermore, while focusing on the global context, these models can sometimes neglect the preservation of fine local details. Consequently, a novel solution is urgently required to effectively model long-range dependencies while maintaining computational equilibrium.

3. Proposed Methods

3.1. Pyramid Generator

The core difficulty in STF lies in addressing the asymmetry in input data (a high-resolution image is available only at the reference time) and ensuring maximal preservation of spatial details. To tackle these challenges, we designed the DSEPGAN generator based on three core principles: Decoupling, Preservation, and Efficiency.
As shown in Figure 1, first, a Dual-Stream Encoder is employed to decouple feature extraction, ensuring that time change information and spatial detail are handled specifically. The Pyramid Time Change Extractor (PTCE) is dedicated to capturing the time change from coarse images ( C 0   and   C 1 ), while the Pyramid Space Detail Extractor (PSDE) is optimized for extracting and preserving the high-frequency spatial information from the fine image ( F 0 ). Second, Invertible Neural Networks (INNs) are introduced into the PSDE to enable theoretically lossless feature transformation, which is crucial for maximizing the fidelity of limited high-resolution spatial details. Third, feature aggregation is performed by the Hierarchical Long-Range Feature Aggregation (HLRFA) module, which efficiently models necessary contextual dependencies across multiple scales using large kernels, avoiding the high computational cost of global attention.
The overall fusion process involves four stages: (1) Decoupled Extraction: PTCE processes C 0 and C 1 to extract four multi-scale features C 0 i and C 1 i ( i = 1 , 2 , 3 , 4 ). PSDE processes F 0 to extract four detailed features D i ( i = 1 , 2 , 3 , 4 ). (2) Initial Fusion: The deepest features C 0 4 ,   C 1 4 , and D 4 first enter the Large Kernel Fusion Block for initial fusion. (3) Hierarchical Aggregation: These fused features, along with the temporal change features C 0 i , C 1 i and spatial detail features D i from remaining stages, sequentially undergo HLRFA fusion from stage 3 to stage 1. (4) Reconstruction: After the final aggregation stage, the features are recovered to the original input size and passed through three convolutional layers to reconstruct the final high-resolution fusion result F ^ 1 .

3.1.1. Pyramid Time Change Extractor

Considering that it is difficult to represent more comprehensive information of remote sensing images with single-scale features, a Pyramid Time Change Extractor is designed. Crucially, the PTCE employs a Siamese network structure with shared weights. This architectural choice ensures that the consistency and discrepancies in features between coarse images are maximized for comparison, allowing us to effectively capture the temporal change information that occurred between the two time points.
As shown in Figure 2, the extractor consists of four convolutional blocks stacked together, and each convolutional block consists of a convolutional layer and a LeakyRelu activation function, where (3,3) denotes a convolution kernel of size 3, and Stride denotes the step size of the convolution operation. To extract features at various scales, the Stride is set to 2 starting from the second convolutional block, thereby creating a pyramidal structure for the features. Each layer of features must be output to the subsequent HLRFA modules. By adopting this structure, firstly, features of different scales and depths relevant to the temporal context can be extracted, which is more important for remote sensing images with more complex scenes. Secondly, reducing feature size also reduces computational cost, enabling efficient extraction of multi-scale temporal features.

3.1.2. Pyramid Space Detail Extractor

In STF, most high-frequency information—such as edges and textures—comes from the fine image at the reference time. Since these details are essential for improving the fidelity of the predicted fine image, accurately preserving them is crucial for STF. The invertible mapping design in Invertible Neural Networks (INNs) [38,39] enables theoretically lossless transformations of features at each layer, allowing the network to preserve information in feature representations better. They have been applied in image super-resolution [40], image denoising [41], image hiding [42], and multimodal image fusion [38,39].
The preservation capability is fundamentally critical for STF because the problem is ill-posed and information-deficient, relying solely on the single high-resolution input ( F 0 ). Standard non-invertible convolutions, while effective for feature extraction, inherently lead to information degradation and detail blurring. Mathematically, the invertibility ensures that the Jacobian determinant of the transformation matrix is maintained close to unity ( d e t J 1 ). This property guarantees the preservation of volume in the latent feature space, meaning the limited, high-frequency spatial information (such as edges and textures) is transferred across the deep network layers without being compressed or destroyed. This maximal fidelity is key to improving the structural quality of the final fused image.
The INN is invertible, meaning the network’s input and output can generate each other without any loss of information, making it particularly suitable for tasks requiring the preservation of image integrity. The invertible layers of INN perform specific transformation operations. The affine coupling layer [43,44] employed in this study first divides the input feature tensor Z l into two parts along the channel dimension, Z l 1 : c and Z l c + 1 : C (Here, C is the total number of channels, and c is the split point). An additive transformation is applied to Z l c + 1 : C and an affine transformation is applied to Z l 1 : c . Finally, the two transformed feature parts are concatenated along the channel dimension to form the output tensor Z l + 1 . The forward propagation of this process is formulated as follows:
Z l + 1 c + 1 : C = Z l c + 1 : C + ϕ Z l 1 : c
Z l + 1 1 : c = Z l 1 : c exp ρ Z l + 1 c + 1 : C + η Z l + 1 c + 1 : C
Z l + 1 = cat Z l + 1 1 : c , Z l + 1 c + 1 : C
where is the Hadamard product. exp · is the exponential function. ρ · ,   η · and ϕ · are the arbitrary mapping functions which are not necessarily invertible, so it is possible to implement them via a neural network. cat denotes the concatenation function along the feature dimension. The inverse propagation of this transformation, which reconstructs the original input Z l , can be expressed as follows:
Z l 1 : c = Z l + 1 1 : c η Z l + 1 c + 1 : C exp ρ ( Z l + 1 c + 1 : C )
Z l c + 1 : C = Z l + 1 c + 1 : C ϕ Z l 1 : c
The PSDE architecture is expected to preserve information from fine images as much as possible. The input and output of invertible layers can be mutually generated, preventing information loss. Therefore, invertible layers have been introduced into PSDE.
The PSDE, illustrated in Figure 3, adopts a pyramid architecture similar to the PTCE. To preserve the extracted fine-grained information, a Detail Preservation Module composed of N affine coupling layers is incorporated. Considering the balance between performance and computational cost, the Bottleneck Residual Block (BRB) from MobileNetV2 [45] is employed to implement the mapping functions ρ · , η · , ϕ · . The depthwise separable convolutions in BRB significantly reduce computational load and the number of parameters. The overall process of the PSDE is as follows: shallow features are first extracted using a convolutional block, increasing the channel dimension of the input image to C . These shallow features then enter four Detail Preservation Modules. To construct the pyramid structure and progressively reduce spatial resolution, a Patch Merging (PM) operation adapted from SwinTransformer [36] is added before the Detail Preservation Module, halving the width and height of the feature map.

3.1.3. Hierarchical Long-Range Feature Aggregation

The design of the HLRFA module is motivated by the critical need for efficient long-range dependency modeling and multiscale feature integration in high-resolution STF. Recovering complex spatial structures requires a vast receptive field. However, global attention mechanisms, while effective at capturing long-range context, incur quadratic computational complexity ( O ( N 2 ) ), making them computationally prohibitive for high-resolution input images.
To address this efficiency bottleneck, we design HLRFA, composed of three sequential components: Feature Upsampling Unit (FUU), Large-Kernel Fusion Block (LKFB), and Cross-Scale Fusion Unit (CSFU), as shown in Figure 4. These components collaboratively enable the aggregation of features across multiple hierarchical levels, capture long-range temporal and spatial dependencies, and maintain a scalable, linear computational complexity ( O N ).
The previous-stage output F i 1 is first upsampled to match the spatial resolution of the current stage, ensuring that multiscale features are properly aligned for subsequent fusion. This operation, referred to as the FUU, employs pixel shuffle followed by convolution to expand both spatial and channel dimensions, which can be formally expressed as follows:
F up i 1 = PixShuffle Conv F i 1
where Conv denotes a convolution operation, and PixShuffle represents the pixel shuffle function used to rearrange and expand spatial dimensions.
To capture long-range dependencies within the current stage, D i , C 0 i and C 1 i are first concatenated along the feature dimension and reduced via 1 × 1 convolution. In the LKFB, large-kernel convolutions are employed to aggregate contextual information across an extended receptive field, inspired by the self-attention mechanism in Transformer architectures, which enables direct modeling of long-range spatial and temporal dependencies. To further enhance fine-grained information, the output of the large-kernel convolutions is combined with the original detail feature D i , providing residual guidance that preserves high-frequency structures while integrating long-range context. Depthwise and pointwise convolutions are then applied to maintain computational efficiency, and a 1 × 1 convolution-based Feed-Forward Network (ConvFFN) further refines the fused representation, yielding features that simultaneously capture global context and local details.
F same i = D i + LKCNN cat ( D i , C 1 i C 0 i )
The upsampled features from the previous stage, F up i 1 are integrated with the refined current-stage features F same i and the latest temporal features C 0 i within the CSFU. This module not only aggregates multiscale features from consecutive stages but also explicitly accounts for substantial land-cover changes between temporal observations. In regions where significant changes occur, the newly emerged information is primarily captured by C 1 i , ensuring that the fusion process incorporates these novel details. The combined features are then processed through two successive convolutional layers to produce the output of the current stage:
F i = Conv Conv cat F up i 1 , F same i , C 1 i
This hierarchical design enables progressive aggregation of multiscale features, preserves fine temporal and spatial details, and ensures effective communication between consecutive fusion stages, which is crucial for high-fidelity STF in heterogeneous remote sensing scenes.

3.2. PatchGAN-Based Discriminator

The discriminator architecture adopts PatchGAN [46,47]. Unlike a traditional discriminator that outputs one pixel, signifying whether the generated image is real or fake, PatchGAN produces a patch map of size N × N ( N is much smaller than the input image size), indicating whether the N × N overlapping image patches from the image are real or fake. This patch-level discriminator design requires fewer parameters than a full-image discriminator and enables handling images of arbitrary size in a fully convolutional manner.
As illustrated in Figure 5, the discriminative network is a straightforward CNN classification network comprising convolutional layers, Batch Normalization (BN), and a LeakyReLU activation function. To ensure the stability of model training, C 1 is introduced into the discriminator as a conditional label along with either the generated image or the real image. Specifically, during the training phase of the discriminator, when presented with inputs of real images F 1 and C 1 , it is expected to produce a matrix with values of 1. Conversely, when the inputs are generated images F ^ 1 and C 1 , the discriminator is expected to yield a matrix with values of 0.

3.3. Compound Loss Function

The compound loss function consists of four components: the l 1 loss, the spectral loss l spect and the structural loss l struct and the GAN loss l GAN G . The total loss function of the generated network is as follows:
l G = α l GAN G + l 1 + l spect + l struct
where α = 0.01. The l 1 loss is used to calculate the difference between the values of the fused image and the true image at the same pixel position. To ensure the spectral and structural fidelity of the generated remote sensing images, the cosine similarity and multi-scale structural similarity (MS-SSIM) between the true image and the fused image are computed, respectively:
l 1 = 1 K k = 1 K T P 1
l spect = I T P T 2 P 2
l s t r u c t = I [ l M ( T , P ) ] υ M · j = 1 M   [ c j ( T , P ) ] ρ j [ s j ( T , P ) ] μ j
where T and P denote the ground truth and fused images, respectively. · p represents the p -norm, K is the number of pixels in the image, I is the tensor of 1, M is the number of scales, and l j , c j , s j are the luminance, contrast, and structural comparisons at the j -th scale, respectively. υ M , ρ j , μ j are the corresponding weighting coefficients.
The GAN loss is the Least Squares GAN loss, specifically:
l GAN G = 1 2 E F 0 , C 0 , C 1 [ ( D G F 0 , C 0 , C 1 , C 1 1 ) 2 ]
l GAN D = 1 2 E F 1 , C 1 [ ( D F 1 , C 1 1 ) 2 ] + 1 2 E F 0 , C 0 , C 1 [ ( D G F 0 , C 0 , C 1 , C 1 ) 2 ]

4. Experiments and Results

4.1. Study Areas and Datasets

In the experiments, three public datasets from different locations are used in this study: the Coleambally Irrigation Area (CIA) [2], the Lower Gwydir Catchment (LGC) [2], and the Tianjin [48]. The CIA study site is located in southern New South Wales, Australia. The dominant vegetation in the area is rice with modern irrigation systems. The CIA dataset contains 17 cloud-free MODIS-Landsat data pairs from 2001 to 2002. Each image has a spatial size of 1720 × 2040 with six bands. Land cover was essentially unchanged during the period of the CIA dataset collection, but phenology varied considerably and can thus be used to verify the performance of DSEPGAN in predicting phenological change. The LGC study site is located in northern New South Wales and contains 14 pairs of 6 × 2720 × 3200 cloud-free data pairs from 2004 to 2005. A major flood occurred in mid-December 2004, causing inundation of about 44% of the area. The LGC can therefore be recognized as a site where abrupt change scenarios exist and is well-suited to test the performance of DSEPGAN in land cover change prediction. The Tianjin study site is located in the northern part of China and encompasses 27 pairs of MODIS-Landsat data from 2013 to 2019, with a data size of 6 × 2100 × 1970. This dataset can be utilized to evaluate the accuracy of the STF methods in predicting phenological changes in urban areas.
ALL datasets are divided into training and testing parts. For the CIA dataset, images other than the MODIS-Landsat image pairs on 25 November 2001, 12 January 2002, and 22 February 2002 are used as training data. In the test phase, the image pair from 25 November 2001 and the coarse image from 12 January 2002 are used to predict the fine image from 12 January 2002. For the LGC dataset, the image pairs of other dates, excluding 26 November 2004, 12 December 2004, and 28 December 2004, are used as the training data, and in the testing phase, the image pair on 26 November 2004 and the coarse image on 12 December 2004 are used to predict the fine image on 12 December 2004. For the Tianjin dataset, excluding the images from 16 April 2015, 18 May 2015, and 4 May 2016, all other images are designated as training data. During the testing phase, images from 16 April 2015 are used, along with the coarse image from 18 May 2015, to predict the fine image from 18 May 2015. It is worth noting that due to striping noise present in the last two bands of some Tianjin data, only the first four bands of the Tianjin dataset are used. Figure 6 presents three sets of test data, showcasing noticeable differences among them. Specifically, the CIA data exhibit a relatively similar land cover type at two different times, yet significant phenological disparities are evident with noticeable color changes. The LGC data display considerable differences in imagery due to sudden floods. As for the Tianjin dataset, it not only captures phenological changes in urban areas but also reflects color variations stemming from data acquired by different sensors.

4.2. Experiment Design and Evaluation

Three traditional STF methods (STARFM [9], FSDAF [8], and Fit-FC [11]) and five DL-based methods (EDCSTFN [20], GAN-STFM [25], MLFF-GAN [23], STF-Trans [28], and CTSTFM [37]) are included for comparison. These methods cover representative architectures such as GAN-based models, pyramid structures, and CNN-Transformer hybrids. To clearly highlight what distinguishes DSEPGAN from existing approaches, Table 1 provides a structured comparison of DL-based methods in terms of network architecture, multiscale feature modeling, long-range dependency modeling, and loss function design.
For a fair comparison, similar experimental settings are used for the traditional and DL-based methods. For the three traditional algorithms, the size of the moving window for searching similar pixels is uniformly set to 41 × 41, and the number of similar pixels is 20. Meanwhile, for the five DL-based methods, the same training dataset generation method and data enhancement method are used. When generating the training set, for each prediction date, a pair of images from another date is randomly selected as reference images. The input images are 256 × 256 in size and are cropped from the original images with a step of 200. A total of 1260 sets of samples are obtained from the CIA, 2464 sets of samples are obtained from the LGC, and 2530 sets of samples are obtained from Tianjin. The parameter settings of EDCSTFN, GAN-STFM, and MLFF-GAN follow the original design. Due to the significant difference in size between STF-Trans and other DL-based models (even the smallest model, STF-Trans-Small, contains approximately 20 million parameters), STF-Trans-Small is selected as the comparison model, with the dimension set to 128. CTSTFM uses an embedding dimension of 64. For DSEPGAN, the initial learning rate is 2 × 10−4. All DL-based methods are trained on 1 NVIDIA RTX 3090 GPU using the data enhancement strategy of image random flip or rotation, and the training strategy of starting from scratch; no other tricks are used. For reproducibility purposes, the implementation of the proposed DSEPGAN is publicly available at https://github.com/ZhouDDCUG/DSEPGAN (accessed on 14 December 2025).
Six different model evaluation metrics are used: Root Mean Square Error (RMSE), Structure Similarity (SSIM), Universal Image Quality Index (UIQI), Correlation Coefficient (CC), Spectral Angle Mapper (SAM), and Erreur Relative Global Adimensionnelle de Synthène (ERGAS). The ideal values of RMSE, SSIM, UIQI, and CC are 0, 1, 1, and 1. Smaller values of RMSE, SAM, and ERGAS and larger values of SSIM, UIQI, and CC indicate the lower uncertainty of the fusion result. Additionally, for image visualization, due to the significant presence of vegetation in the image scenes, the NIR-Red-Green channels are selected as the RGB channels. Scatterplots are also utilized to compare the performance of various methods, depicting the correspondence between pixel values of predicted and actual images across different ranges. Data for generating scatterplots are derived from image band averages with scatter points rendered in different colors based on scatter density. The average absolute difference (AAD) maps are used to observe the error distribution across the entire image.

4.3. Experimental Results for CIA

The ground truth for the CIA data, along with the fusion results obtained by the nine methods and their corresponding AAD maps, is shown in Figure 7. It is worth pointing out that the lower-right corner of the CIA dataset is partially missing, so corresponding deletions have been made in the results for all methods. The image results obtained by all eight methods are greenish, illustrating that all methods correctly predicted the phenological changes to some extent, but there are obvious differences in the details.
Specifically, the results of STARFM, FSDAF, and Fit-FC all suffer from unpleasant coloring-mixing problems. Where the ground truth is white patches, the results obtained by STARFM and FSDAF show varying degrees of yellow, and the results obtained by Fit-FC are light red; where the ground truth is red patches, the results obtained by STARFM and FSDAF appear black, and the boundaries between patches are blurred, which is particularly serious in the STARFM method. The fusion results obtained by EDCSTFN do not have the above serious spectral distortion problem, but they are affected by the spatial blurring checkerboard effect, which makes the feature boundaries seriously distorted. In terms of appearance and texture, GAN-STFM, MLFF-GAN, STF-Trans, CTSTFM, and DSEPGAN outperform the other methods. From the AAD maps, it can be observed that the errors in all results are mostly concentrated in the detail-rich farmland areas, with the latter five methods outperforming the former four methods. The zoomed-in subregion in the black-boxed portion in Figure 7a is shown in Figure 8. Firstly, it is obvious that DSEPGAN predicts the farmland boundary accurately and clearly, which is significantly better than GAN-STFM, MLFF-GAN, STF-Trans, and CTSTFM. Secondly, the color of the circular object in the yellow box is also more similar to the ground truth. It can be concluded that the visual evaluation has a more accurate boundary prediction and detail preservation ability than other DL-based methods by considering multi-scale features, detail preservation, and long-distance feature aggregation.
Table 2 shows the quantitative assessment metrics for the nine methods on CIA data, with the optimal values shown in bold. Compared with all eight comparison methods, DSEPGAN has excellent performance on the CIA data, achieving the best results in five of the six overall metrics (RMSE, SSIM, CC, ERGAS, and SAM) and obtaining the best experimental results. CTSTFM proves to be the strongest competitor, achieving the highest average UIQI value and securing the second-best performance in RMSE and CC. However, DSEPGAN still significantly surpasses CTSTFM in crucial structural (SSIM: 0.8557 vs. 0.8503) and spectral accuracy metrics (e.g., SAM: 8.1507 vs. 8.2494), solidifying its position as the optimal performer. From the table, it can also be found that the experimental results of the DL-based method EDCSTFN are poorer compared with the traditional methods (STARFM, FSDAF, Fit-FC), which may be because EDCSTFN, with a small number of parameters, cannot satisfy the demand of spatiotemporal fusion when the training data are more. The other five DL-based methods outperform traditional methods. GAN-STFM, MLFF-GAN, and DSEPGAN all use the GAN-based STF framework, indicating that the generative adversarial approach is well-suited for STF tasks. Meanwhile, the strong results achieved by CTSTFM and STF-Trans highlight the critical importance of effective long-range dependency modeling in current SOTA fusion architectures. Figure 9 shows the scatter plot of predicted image values and ground truth values. The narrower the scatter region, the higher the pixel similarity. Intuitively, it can be seen that the images generated by DSEPGAN are more similar to the ground truth and have the highest coefficient of determination. Overall, the qualitative and quantitative results show that the proposed DSEPGAN is advantageous in capturing the characteristics of phenological changes.

4.4. Experimental Results for LGC

In the STF task, abrupt land cover changes are more challenging to capture than phenological changes. The ground truth of the LGC, along with the fusion results and AAD maps, is presented in Figure 10. As can be seen, while the eight comparison methods successfully predict the approximate coverage area of the sudden flooding portion, the hybrid model, CTSTFM, struggles with accurately reconstructing the flood’s contour. Figure 11 presents a zoomed-in subregion of the black-boxed portion in Figure 10a. Firstly, the prediction results from the farmland in the upper right corner indicate that DSEPGAN produces clearer boundaries and more uniform colors compared to other methods. Secondly, upon further zooming into the waters within the yellow box, it is evident that results from STARFM, FSDAF, Fit-FC, and EDCSTFN exhibit fuzzy edges, with the waters blending with surrounding features. GAN-STFM and MLFF-GAN fail to predict the shape of the waters accurately, with incorrect or fuzzy boundaries in the lower part. The result of STF-Trans is overall relatively blurry. CTSTFM’s predicted water body exhibits a notably dark tone, and its edges are significantly blurred. This suggests that DSEPGAN outperforms other methods in recovering the target image in scenes with complex changes.
A quantitative comparison of the fusion results for the LGC data is listed in Table 3. It is inferred that FSDAF is not necessarily better than STARFM with the same parameter settings. Meanwhile, unlike the CIA data, the gap between the DL-based methods and the traditional methods narrows in the first four bands of LGC data but achieves excellent performance in the last two bands. CTSTFM shows lower robustness in LGC. DSEPGAN demonstrates superior performance, achieving the best results in all six overall assessment metrics. Additionally, DSEPGAN’s SSIM (0.7814) metric is significantly better than that of all other methods (e.g., 0.0173 higher than the next best, MLFF-GAN’s 0.7641), indicating its superior ability to restore image structure in complex, changing scenes. The scatterplot of the averaged bands presented in Figure 12 also indicates that the predicted values of DSEPGAN are closer to the actual values than the other methods, with the highest coefficient of determination. Overall, the proposed method achieves the best fusion results closest to the ground truth.

4.5. Experimental Results for Tianjin

The Tianjin dataset, originating from urban areas, exhibits richer detail features compared to the CIA and LGC. Figure 13 illustrates the ground truth of the Tianjin data, along with the fused images obtained by nine different methods and their corresponding AAD maps. Overall, all results show minimal influence from sensor disparities, essentially predicting urban phenological changes. However, there are some differences in detail among them. The image generated by STARFM displays erroneous yellow and green patches in the upper portion, while Fit-FC shows a yellow area resembling cloud or haze pollution on the left side. Due to pixel-level fusion, the images produced by STARFM, FSDAF, and Fit-FC demonstrate satisfactory detail effects. Color deviations in EDCSTFN are mainly observed in the water regions. The results of STF-Trans and CTSTFM exhibit significant issues with detail loss and blurriness, failing to reconstruct the complex urban textures accurately. From the AAD maps, it can be observed that EDCSTFN exhibits significant errors in the upper part, while DSEPGAN shows relatively lower overall errors.
Figure 14 depicts an enlarged sub-area within the black box of Figure 13a. It can be observed that the upper portion of the results from STARFM, FSDAF, and Fit-FC appear pinkish compared to the ground truth, while EDCSTFN displays a more pronounced color deviation. The MLFF-GAN fused image has a lighter shade of red in the lower portion. The results of STF-Trans exhibit issues with detail loss. Crucially, the result generated by CTSTFM in this magnified sub-area is noticeably blurry and exhibits severe green color distortion compared to the ground truth. Zooming into the magnified section within the yellow box, DSEPGAN appears more similar to the ground truth.
Table 4 presents a quantitative comparison of the fusion results for the Tianjin dataset. Figure 15 illustrates the average band scatter plots for the fusion results of all methods. It can be observed that significant deviations occur in the first and third bands of the results obtained from STARFM, as depicted in Figure 15a, where the scatter points exhibit a noticeable dispersion trend, indicating an uneven distribution with a relatively low coefficient of determination. Interestingly, the traditional methods FSDAF and Fit-FC exhibit strong quantitative indicators in this scenario, outperforming several DL-based methods, in Avg RMSE and Avg SSIM. FSDAF (Figure 15b) shows points densely distributed along the diagonal line. FSDAF and Fit-FC, on the other hand, exhibit promising indicators, especially in the near-infrared band for FSDAF, outperforming all other methods. The scatter points in Figure 15b are densely distributed along the diagonal line. This suggests that conventional methods demonstrate comparable performance to DL-based methods in certain scenarios. GAN-STFM and MLFF-GAN show superior RMSE and SSIM metrics in the first three bands. STF-Trans performs worse on the Tianjin dataset compared to its performance on the CIA and LGC datasets. CTSTFM, similar to its performance on the LGC dataset, fails on this urban scene as well. However, DSEPGAN is the only method that delivers top-tier performance consistently across all bands and all evaluation criteria, demonstrating its adaptability in urban areas.

5. Discussion

5.1. Model Efficiency

The model size, computational complexity, training speed, and inference efficiency are tested to show the model’s efficiency. Model size is typically represented by the number of parameters in the model, which includes all learnable weights and biases. Computational complexity is denoted using floating-point operations (FLOPs) when the input size is (8, 6, 256, 256). FLOPs do not perfectly reflect the model’s training time; therefore, the average training iteration time, referred to as Batch-Time, is also included. Test-Time represents the average inference time required to generate one predicted image.
As listed in Table 5, the inference time for traditional models is significantly higher than that of DL methods, as they rely on computationally expensive, pixel-wise operations like extensive neighborhood searching and iterative optimization. We provide the parameters and FLOPs for both the generators and discriminators of the GAN-based STF methods. EDCSTFN has the fewest parameters and the shortest training time. Despite GAN-STFM having a relatively smaller generator, its computational workload is higher due to the network performing feature extraction at full resolution during forward propagation. This results in the longest training time among the compared models. STF-Trans has the highest computational load, requiring more resources during inference, but its Transformer-based design allows for competitive training speeds due to easier parallelized computation. Crucially, when comparing inference speed, STF-Trans is the fastest deep learning model (7.06 s). DSEPGAN shares the same discriminator as MLFF-GAN, with over 500,000 fewer generator parameters than MLFF-GAN. However, DSEPGAN’s use of INN and large convolutional kernels increases its computational load, resulting in longer training times compared to MLFF-GAN. DSEPGAN achieves a test time of 15.10 s, demonstrating that while it is not the fastest deep learning model, its efficiency is orders of magnitude better than traditional methods, successfully balancing superior accuracy with manageable computational overhead.

5.2. Ablation Study

Different models are designed to investigate the effects of the PTCE, PSDE, and HLRFA modules in the proposed method.
(a) DSEPGA-Diff: The input to the PTCE module is replaced from two coarse images to the difference in the coarse images Δ C = C 1 C 0 . This model is used to analyze the impact of temporal change features obtained in different ways on DSEPGAN.
(b) DSPGAN: The PTCE module is directly replaced with the PSDE module to verify the role of invertible layers in information retention.
(c) DSEPGAN-Conv: The Large Kernel CNN Fusion module in the HLRFA is replaced with a 3 × 3 convolution operator to validate the role of long-range feature modeling in DSEPGAN.
(d) DSEPGAN-Trans: The LKConv Block in the HLRFA is replaced with multi-head self-attention from the Vision Transformer [26]. The input to the self-attention module is a sequence of tokens, each generated from an 8 × 8 image patch.
(e) DSEPGAN w/o ConvFFN: The ConvFFN module in the HLRFA is directly removed to demonstrate its necessity. The ConvFFN follows the large kernel convolution to prevent information loss and enhance the feature representation capability of the model.
The results are shown in Table 6, with the best results highlighted in bold. RMSE, SSIM, and ERGAS are used to measure the model’s performance. The ablation results across both the CIA and LGC datasets consistently validate the necessity and effectiveness of the proposed modules. First, the reduced performance of DSEPGAN-Diff on both CIA and LGC suggests that directly inputting the difference between the coarse images may lead to the loss of essential spectral information from the predicted date’s coarse image. Second, the results of DSPGAN (w/o INN) clearly demonstrate the positive contribution of the INN. Third, DSEPGAN-Trans, despite its complexity (10.8 M parameters), did not surpass the full DSEPGAN model. Patch-based tokenization smooths local textures, unlike the Large-Kernel Convolution, which expands the receptive field densely at the pixel level. LKConv provides a more balanced and stable solution for STF by efficiently capturing long-range dependencies while preserving detailed textures. The removal of the ConvFFN module (DSEPGAN w/o ConvFFN) confirms its necessity for stabilizing local feature representation after long-range aggregation.
We observe that introducing the INN and ConvFFN modules increases the number of parameters and computational complexity, leading to longer training times. In particular, the addition of INN has a greater impact on training time due to its inherent high computational complexity and storage requirements. However, all these increases in computational cost remain within acceptable limits.

5.3. Parameter Analysis

5.3.1. Kernel Size

The impact of varying depthwise convolution kernel sizes (i.e., K in Figure 4) on the experimental results is demonstrated in Table 7. The optimal indicators are highlighted in bold, while the sub-optimal indicators are underscored. It can be seen that the model results show a trend of becoming better when the convolutional kernel size increases, indicating that long-range feature aggregation has an optimizing effect on the model. Additionally, the effect of using convolutional kernels of varying sizes at different fusion stages is examined. The experiments show that the fusion results are no better than when K = 13. Moreover, it’s observed that Parameters, FLOPs, and Batch-Time undergo minimal changes with the gradual increment of the convolution kernel size, attributed to the lightweight nature facilitated by depthwise Convolution. Considering the model accuracy and computational cost, the size of the convolution kernel for each fusion stage is set to 13.

5.3.2. Number of Invertible Basic Units

The impact of varying the number of invertible basic units (i.e., N in Figure 3) within each stage of the PSDE on model performance and computational cost is demonstrated in Table 8. It shows that increasing the number of invertible modules results in a decrease in the mean error, an increase in structural similarity, and an overall improvement in accuracy. However, this enhancement comes at the cost of increased model parameters. Additionally, it can be observed that when N is increased to 4, the improvement in model performance is not as significant as before. Therefore, in this study, all other experiments are conducted with N set to 3.

5.3.3. Number of Stages

The impact of different numbers of stages on model performance and computational costs is illustrated in Table 9. As the number of stages increases, the model’s ability to preserve image structural information improves, leading to continuous advancements in reducing prediction errors and enhancing image quality. However, as the model becomes more complex and deeper, it also escalates the requirements for parameters and computational resources. To match the size of other comparative models, the feature extraction and fusion in DSEPGAN are configured with four stages, indicating that the entire model operates across four distinct scales.

6. Conclusions

In this study, a Dual-Stream Enhanced Pyramid based on Generative Adversarial Network (DSEPGAN) is proposed for spatiotemporal fusion of remote sensing images. The method introduces several technically distinctive designs aimed at improving detail preservation and temporal–spatial consistency in fused products. The framework employs a dual-stream pyramid to decouple temporal change features from spatial detail features and integrates invertible layers to minimize information loss and retain high-frequency details from fine-resolution imagery. A progressive fusion mechanism based on large-kernel convolution enhances long-range context modeling while maintaining computational efficiency through depthwise and pointwise operations. Moreover, adversarial learning improves the realism and statistical fidelity of the fused images by encouraging the generator to match radiometric and structural characteristics of real observations. Extensive experiments on three benchmark datasets demonstrate that DSEPGAN outperforms state-of-the-art STF methods, achieving superior spatial detail preservation, temporal consistency, and robustness. The framework highlights the theoretical contribution of integrating dual-stream pyramids, invertible layers, and long-range context modeling for unified capture of fine-grained details and global features.
Beyond theoretical contributions, DSEPGAN’s superior fidelity and robustness are vital for complex applications requiring high-precision spatiotemporal data, particularly in heterogeneous landscapes and high-frequency urban change monitoring. This advantage stems directly from our core architectural choices: The invertible detail-preserving mechanism ensures maximal fidelity of high-frequency features (e.g., sharp boundaries and fine structures), which is critical for accurately reconstructing urban infrastructure (road networks and building footprints) and achieving high precision in land cover classification at edges. Concurrently, efficient large-kernel long-range modeling enhances global contextual consistency, helping to resolve spectral mixing over large, fragmented patches—a key requirement for applications like precision agriculture management (crop yield forecasting) and rapid disaster assessment (evaluating damage extent) in complex areas.
Despite its overall effectiveness, the DSEPGAN framework presents inherent trade-offs. Specifically, the invertible modules introduce higher computational overhead and large-kernel convolutions, while efficient, they present an inherent limitation in capturing truly global dependencies. Addressing these challenges, future work will focus on developing lighter, more streamlined invertible structures and exploring highly efficient global context modeling to further enhance fusion performance.

Author Contributions

Conceptualization, D.Z. and K.W.; data curation, L.X. and H.L.; formal analysis, D.Z. and M.J.; funding acquisition, L.X. and K.W.; methodology, D.Z. and K.W.; validation, M.J.; visualization, D.Z. and H.L.; writing—original draft, D.Z.; writing—review and editing, L.X., K.W., H.L. and M.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (Grant Nos. U21A2013 and 62071438), the Fundamental Research Funds for the Central Universities, China University of Geosciences (Wuhan) (Grant No. 2642022009), the Open Fund of Key Laboratory of Space Ocean Remote Sensing and Application, MNR (Grant No. 202401001), the Global Change and Air–Sea Interaction II (Grant No. GASI-01-DLYG-WIND0), the Open Fund of State Key Laboratory of Remote Sensing Science (Grant No. OFSLRSS202312). the Foundation of State Key Laboratory of Public Big Data (Grant No. PBD2023-28), and the Open Fund of Key Laboratory of Regional Development and Environmental Response (Grant No. 2023(A)003).

Data Availability Statement

The implementation of the proposed DSEPGAN is publicly available at https://github.com/ZhouDDCUG/DSEPGAN (accessed on 14 December 2025).

Acknowledgments

The authors express their gratitude to the scholars who produced and shared the codes of STARFM, FSDAF, Fit-FC, EDCSTFN, GAN-STFM, and MLFF-GAN models. The authors would like to thank the editors and anonymous reviewers for their insightful comments and suggestions that led to this improved version and clearer presentation of the technical content.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Zhu, X.; Cai, F.; Tian, J.; Williams, T. Spatiotemporal Fusion of Multisource Remote Sensing Data: Literature Survey, Taxonomy, Principles, Applications, and Future Directions. Remote Sens. 2018, 10, 527. [Google Scholar] [CrossRef]
  2. Emelyanova, I.V.; McVicar, T.R.; Van Niel, T.G.; Li, L.T.; Van Dijk, A.I.J.M. Assessing the Accuracy of Blending Landsat–MODIS Surface Reflectances in Two Landscapes with Contrasting Spatial and Temporal Dynamics: A Framework for Algorithm Selection. Remote Sens. Environ. 2013, 133, 193–209. [Google Scholar] [CrossRef]
  3. Pan, L.; Lu, L.; Fu, P.; Nitivattananon, V.; Guo, H.; Li, Q. Understanding Spatiotemporal Evolution of the Surface Urban Heat Island in the Bangkok Metropolitan Region from 2000 to 2020 Using Enhanced Land Surface Temperature. Geomat. Nat. Hazards Risk 2023, 14, 2174904. [Google Scholar] [CrossRef]
  4. Mbabazi, D.; Mohanty, B.P.; Gaur, N. High Spatio-Temporal Resolution Evapotranspiration Estimates Within Large Agricultural Fields by Fusing Eddy Covariance and Landsat Based Data. Agric. For. Meteorol. 2023, 333, 109417. [Google Scholar] [CrossRef]
  5. Zhou, Y.; Liu, T.; Batelaan, O.; Duan, L.; Wang, Y.; Li, X.; Li, M. Spatiotemporal Fusion of Multi-Source Remote Sensing Data for Estimating Aboveground Biomass of Grassland. Ecol. Indic. 2023, 146, 109892. [Google Scholar] [CrossRef]
  6. Wu, W.; Liu, Y.; Li, K.; Yang, H.; Yang, L.; Chen, Z. STFCropNet: A Spatiotemporal Fusion Network for Crop Classification in Multiresolution Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 4736–4750. [Google Scholar] [CrossRef]
  7. Zhukov, B.; Oertel, D.; Lanzl, F.; Reinhackel, G. Unmixing-Based Multisensor Multiresolution Image Fusion. IEEE Trans. Geosci. Remote Sens. 1999, 37, 1212–1226. [Google Scholar] [CrossRef]
  8. Zhu, X.; Helmer, E.H.; Gao, F.; Liu, D.; Chen, J.; Lefsky, M.A. A Flexible Spatiotemporal Method for Fusing Satellite Images with Different Resolutions. Remote Sens. Environ. 2016, 172, 165–177. [Google Scholar] [CrossRef]
  9. Gao, F.; Masek, J.; Schwaller, M.; Hall, F. On the Blending of the Landsat and MODIS Surface Reflectance: Predicting Daily Landsat Surface Reflectance. IEEE Trans. Geosci. Remote Sens. 2006, 44, 2207–2218. [Google Scholar] [CrossRef]
  10. Zhu, X.; Chen, J.; Gao, F.; Chen, X.; Masek, J.G. An Enhanced Spatial and Temporal Adaptive Reflectance Fusion Model for Complex Heterogeneous Regions. Remote Sens. Environ. 2010, 114, 2610–2623. [Google Scholar] [CrossRef]
  11. Wang, Q.; Atkinson, P.M. Spatio-Temporal Fusion for Daily Sentinel-2 Images. Remote Sens. Environ. 2018, 204, 31–42. [Google Scholar] [CrossRef]
  12. Liao, L.; Song, J.; Wang, J.; Xiao, Z.; Wang, J. Bayesian Method for Building Frequent Landsat-Like NDVI Datasets by Integrating MODIS and Landsat NDVI. Remote Sens. 2016, 8, 452. [Google Scholar] [CrossRef]
  13. Xue, J.; Leung, Y.; Fung, T. A Bayesian Data Fusion Approach to Spatio-Temporal Fusion of Remotely Sensed Images. Remote Sens. 2017, 9, 1310. [Google Scholar] [CrossRef]
  14. Huang, B.; Song, H. Spatiotemporal Reflectance Fusion via Sparse Representation. IEEE Trans. Geosci. Remote Sens. 2012, 50, 3707–3716. [Google Scholar] [CrossRef]
  15. Wei, J.; Wang, L.; Liu, P.; Chen, X.; Li, W.; Zomaya, A.Y. Spatiotemporal Fusion of MODIS and Landsat-7 Reflectance Images via Compressed Sensing. IEEE Trans. Geosci. Remote Sens. 2017, 55, 7126–7139. [Google Scholar] [CrossRef]
  16. Tan, Z.; Yue, P.; Di, L.; Tang, J. Deriving High Spatiotemporal Remote Sensing Images Using Deep Convolutional Network. Remote Sens. 2018, 10, 1066. [Google Scholar] [CrossRef]
  17. Zhang, X.; Li, S.; Tan, Z.; Li, X. Enhanced Wavelet Based Spatiotemporal Fusion Networks Using Cross-Paired Remote Sensing Images. ISPRS J. Photogramm. Remote Sens. 2024, 211, 281–297. [Google Scholar] [CrossRef]
  18. Ren, K.; Sun, W.; Meng, X.; Yang, G. GCM-PDA: A Generative Compensation Model for Progressive Difference Attenuation in Spatiotemporal Fusion of Remote Sensing Images. IEEE Trans. Image Process. 2025, 34, 3817–3832. [Google Scholar] [CrossRef]
  19. Song, H.; Liu, Q.; Wang, G.; Hang, R.; Huang, B. Spatiotemporal Satellite Image Fusion Using Deep Convolutional Neural Networks. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2018, 11, 821–829. [Google Scholar] [CrossRef]
  20. Tan, Z.; Di, L.; Zhang, M.; Guo, L.; Gao, M. An Enhanced Deep Convolutional Model for Spatiotemporal Image Fusion. Remote Sens. 2019, 11, 2898. [Google Scholar] [CrossRef]
  21. Li, W.; Zhang, X.; Peng, Y.; Dong, M. DMNet: A Network Architecture Using Dilated Convolution and Multiscale Mechanisms for Spatiotemporal Fusion of Remote Sensing Images. IEEE Sens. J. 2020, 20, 12190–12202. [Google Scholar] [CrossRef]
  22. Cao, H.; Luo, X.; Peng, Y.; Xie, T. MANet: A Network Architecture for Remote Sensing Spatiotemporal Fusion Based on Multiscale and Attention Mechanisms. Remote Sens. 2022, 14, 4600. [Google Scholar] [CrossRef]
  23. Song, B.; Liu, P.; Li, J.; Wang, L.; Zhang, L.; He, G.; Chen, L.; Liu, J. MLFF-GAN: A Multilevel Feature Fusion with GAN for Spatiotemporal Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–16. [Google Scholar] [CrossRef]
  24. Huang, Y.; Li, X.; Du, Z.; Shen, H. Spatiotemporal Enhancement and Interlevel Fusion Network for Remote Sensing Images Change Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–14. [Google Scholar] [CrossRef]
  25. Tan, Z.; Gao, M.; Li, X.; Jiang, L. A Flexible Reference-Insensitive Spatiotemporal Fusion Model for Remote Sensing Images Using Conditional Generative Adversarial Network. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
  26. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  27. Guan, F.; Zhao, N.; Wang, H.; Fang, Z.; Zhang, J.; Yu, Y.; Jiang, L.; Huang, H. Dual-Branch Transformer Framework with Gradient-Aware Weighting Feature Alignment for Robust Cross-View Geo-Localization. Inf. Fusion 2026, 127, 103808. [Google Scholar] [CrossRef]
  28. Benzenati, T.; Kallel, A.; Kessentini, Y. STF-Trans: A Two-Stream Spatiotemporal Fusion Transformer for Very High Resolution Satellites Images. Neurocomputing 2024, 563, 126868. [Google Scholar] [CrossRef]
  29. Li, Y.; Hou, Q.; Zheng, Z.; Cheng, M.-M.; Yang, J.; Li, X. Large Selective Kernel Network for Remote Sensing Object Detection. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV 2023), Paris, France, 3 October 2023; IEEE Computer Soc: Los Alamitos, CA, USA, 2023; pp. 16748–16759. [Google Scholar]
  30. Guan, F.; Zhao, N.; Fang, Z.; Jiang, L.; Zhang, J.; Yu, Y.; Huang, H. Multi-Level Representation Learning via ConvNeXt-Based Network for Unaligned Cross-View Matching. Geo-Spat. Inf. Sci. 2025, 28, 2344–2357 . [Google Scholar] [CrossRef]
  31. Li, W.; Yang, C.; Peng, Y.; Du, J. A Pseudo-Siamese Deep Convolutional Neural Network for Spatiotemporal Satellite Image Fusion. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 1205–1220. [Google Scholar] [CrossRef]
  32. Chen, Y.; Shi, K.; Ge, Y.; Zhou, Y. Spatiotemporal Remote Sensing Image Fusion Using Multiscale Two-Stream Convolutional Neural Networks. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–12. [Google Scholar] [CrossRef]
  33. Liu, Q.; Meng, X.; Shao, F.; Li, S. PSTAF-GAN: Progressive Spatio-Temporal Attention Fusion Method Based on Generative Adversarial Network. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
  34. Hu, C.; Ma, M.; Ma, X.; Zhang, H.; Wu, D.; Gao, G.; Zhang, W. STANet: Spatiotemporal Adaptive Network for Remote Sensing Images. In Proceedings of the 2023 IEEE International Conference on Image Processing (ICIP), Kuala Lumpur, Malaysia, 9–12 October 2023; IEEE: New York, NY, USA, 2023; pp. 3429–3433. [Google Scholar]
  35. Chen, G.; Jiao, P.; Hu, Q.; Xiao, L.; Ye, Z. SwinSTFM: Remote Sensing Spatiotemporal Fusion Using Swin Transformer. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–18. [Google Scholar] [CrossRef]
  36. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV 2021), Montreal, QC, Canada, 10–17 October 2021; IEEE: New York, NY, USA, 2021; pp. 9992–10002. [Google Scholar]
  37. Jiang, M.; Shao, H. A CNN-Transformer Combined Remote Sensing Imagery Spatiotemporal Fusion Model. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 13995–14009. [Google Scholar] [CrossRef]
  38. Wang, W.; Deng, L.-J.; Ran, R.; Vivone, G. A General Paradigm with Detail-Preserving Conditional Invertible Network for Image Fusion. Int. J. Comput. Vis. 2024, 132, 1029–1054. [Google Scholar] [CrossRef]
  39. Wang, J.; Lu, T.; Huang, X.; Zhang, R.; Feng, X. Pan-Sharpening via Conditional Invertible Neural Network. Inf. Fusion 2024, 101, 101980. [Google Scholar] [CrossRef]
  40. Liu, H.; Shao, M.; Qiao, Y.; Wan, Y.; Meng, D. Unpaired Image Super-Resolution Using a Lightweight Invertible Neural Network. Pattern Recognit. 2023, 144, 109822. [Google Scholar] [CrossRef]
  41. Huang, J.-J.; Dragotti, P.L. WINNet: Wavelet-Inspired Invertible Network for Image Denoising. IEEE Trans. Image Process. 2022, 31, 4377–4392. [Google Scholar] [CrossRef]
  42. Shang, F.; Lan, Y.; Yang, J.; Li, E.; Kang, X. Robust Data Hiding for JPEG Images with Invertible Neural Network. Neural Netw. 2023, 163, 219–232. [Google Scholar] [CrossRef]
  43. Zhou, M.; Fu, X.; Huang, J.; Zhao, F.; Liu, A.; Wang, R. Effective Pan-Sharpening with Transformer and Invertible Neural Network. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–15. [Google Scholar] [CrossRef]
  44. Zhao, Z.; Bai, H.; Zhang, J.; Zhang, Y.; Xu, S.; Lin, Z.; Timofte, R.; Van Gool, L. CDDFuse: Correlation-Driven Dual-Branch Feature Decomposition for Multi-Modality Image Fusion. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 5906–5916. [Google Scholar]
  45. Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; IEEE: New York, NY, USA, 2018; pp. 4510–4520. [Google Scholar]
  46. Isola, P.; Zhu, J.-Y.; Zhou, T.; Efros, A.A. Image-to-Image Translation with Conditional Adversarial Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: New York, NY, USA, 2017; pp. 5967–5976. [Google Scholar]
  47. Zhu, J.-Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; IEEE: New York, NY, USA, 2017; pp. 2242–2251. [Google Scholar]
  48. Li, J.; Li, Y.; He, L.; Chen, J.; Plaza, A. Spatio-Temporal Fusion for Remote Sensing Data: An Overview and New Benchmark. Sci. China Inf. Sci. 2020, 63, 140301. [Google Scholar] [CrossRef]
Figure 1. Generator of DSEPGAN.
Figure 1. Generator of DSEPGAN.
Remotesensing 17 04050 g001
Figure 2. Pyramid Time Change Extractor.
Figure 2. Pyramid Time Change Extractor.
Remotesensing 17 04050 g002
Figure 3. Pyramid space detail extractor.
Figure 3. Pyramid space detail extractor.
Remotesensing 17 04050 g003
Figure 4. Hierarchical long-range feature aggregation.
Figure 4. Hierarchical long-range feature aggregation.
Remotesensing 17 04050 g004
Figure 5. Discriminator architecture adopts PatchGAN.
Figure 5. Discriminator architecture adopts PatchGAN.
Remotesensing 17 04050 g005
Figure 6. The experimental datasets for STF. The first row is the CIA data, from left to right are (a) 25 November 2001, MODIS, (b) 25 November 2001, Landsat, (c) 12 January 2002, MODIS, (d) 12 January 2002, Landsat; The second row is LGC data, from left to right are (e) 26 November 2004, MODIS, (f) 26 November 2004, Landsat, (g) 12 December 2004, MODIS, (h) 12 December 2004, Landsat; The third row is Tianjin data, from left to right are (i) 16 April 2015 MODIS, (j) 16 April 2015, Landsat, (k) 18 May 2015, MODIS, (l) 18 May 2015, Landsat.
Figure 6. The experimental datasets for STF. The first row is the CIA data, from left to right are (a) 25 November 2001, MODIS, (b) 25 November 2001, Landsat, (c) 12 January 2002, MODIS, (d) 12 January 2002, Landsat; The second row is LGC data, from left to right are (e) 26 November 2004, MODIS, (f) 26 November 2004, Landsat, (g) 12 December 2004, MODIS, (h) 12 December 2004, Landsat; The third row is Tianjin data, from left to right are (i) 16 April 2015 MODIS, (j) 16 April 2015, Landsat, (k) 18 May 2015, MODIS, (l) 18 May 2015, Landsat.
Remotesensing 17 04050 g006
Figure 7. Fusion results for the CIA dataset with different methods. The first row shows the fused images, and the second row presents the corresponding AAD maps between the fused results and the reference image. (a) Ground truth, (b) STARFM, (c) FSDAF, (d) Fit-FC, (e) EDCSTFN, (f) GAN-STFM, (g) MLFF-GAN, (h) STF-Trans, (i) CTSTFM, (j) DSEPGAN.
Figure 7. Fusion results for the CIA dataset with different methods. The first row shows the fused images, and the second row presents the corresponding AAD maps between the fused results and the reference image. (a) Ground truth, (b) STARFM, (c) FSDAF, (d) Fit-FC, (e) EDCSTFN, (f) GAN-STFM, (g) MLFF-GAN, (h) STF-Trans, (i) CTSTFM, (j) DSEPGAN.
Remotesensing 17 04050 g007
Figure 8. Subregions of fusion results for the CIA dataset with different methods. (a) Ground truth, (b) STARFM, (c) FSDAF, (d) Fit-FC, (e) EDCSTFN, (f) GAN-STFM, (g) MLFF-GAN, (h) STF-Trans, (i) CTSTFM, (j) DSEPGAN.
Figure 8. Subregions of fusion results for the CIA dataset with different methods. (a) Ground truth, (b) STARFM, (c) FSDAF, (d) Fit-FC, (e) EDCSTFN, (f) GAN-STFM, (g) MLFF-GAN, (h) STF-Trans, (i) CTSTFM, (j) DSEPGAN.
Remotesensing 17 04050 g008
Figure 9. Scatter plots for the CIA dataset with different methods. (a) STARFM, (b) FSDAF, (c) Fit-FC, (d) EDCSTFN, (e) GAN-STFM, (f) MLFF-GAN, (g) STF-Trans, (h) CTSTFM, (i) DSEPGAN.
Figure 9. Scatter plots for the CIA dataset with different methods. (a) STARFM, (b) FSDAF, (c) Fit-FC, (d) EDCSTFN, (e) GAN-STFM, (f) MLFF-GAN, (g) STF-Trans, (h) CTSTFM, (i) DSEPGAN.
Remotesensing 17 04050 g009
Figure 10. Fusion results for the LGC dataset with different methods. The first row shows the fused images, and the second row presents the corresponding AAD maps between the fused results and the reference image. (a) Ground truth, (b) STARFM, (c) FSDAF, (d) Fit-FC, (e) EDCSTFN, (f) GAN-STFM, (g) MLFF-GAN, (h) STF-Trans, (i) CTSTFM, (j) DSEPGAN.
Figure 10. Fusion results for the LGC dataset with different methods. The first row shows the fused images, and the second row presents the corresponding AAD maps between the fused results and the reference image. (a) Ground truth, (b) STARFM, (c) FSDAF, (d) Fit-FC, (e) EDCSTFN, (f) GAN-STFM, (g) MLFF-GAN, (h) STF-Trans, (i) CTSTFM, (j) DSEPGAN.
Remotesensing 17 04050 g010
Figure 11. Subregions of fusion results for the LGC dataset with different methods. (a) Ground truth, (b) STARFM, (c) FSDAF, (d) Fit-FC, (e) EDCSTFN, (f) GAN-STFM, (g) MLFF-GAN, (h) STF-Trans, (i) CTSTFM, (j) DSEPGAN.
Figure 11. Subregions of fusion results for the LGC dataset with different methods. (a) Ground truth, (b) STARFM, (c) FSDAF, (d) Fit-FC, (e) EDCSTFN, (f) GAN-STFM, (g) MLFF-GAN, (h) STF-Trans, (i) CTSTFM, (j) DSEPGAN.
Remotesensing 17 04050 g011
Figure 12. Scatter plots for the LGC dataset with different methods. (a) STARFM, (b) FSDAF, (c) Fit-FC, (d) EDCSTFN, (e) GAN-STFM, (f) MLFF-GAN, (g) STF-Trans, (h) CTSTFM, (i) DSEPGAN.
Figure 12. Scatter plots for the LGC dataset with different methods. (a) STARFM, (b) FSDAF, (c) Fit-FC, (d) EDCSTFN, (e) GAN-STFM, (f) MLFF-GAN, (g) STF-Trans, (h) CTSTFM, (i) DSEPGAN.
Remotesensing 17 04050 g012
Figure 13. Fusion results for the Tianjin dataset with different methods. The first row shows the fused images, and the second row presents the corresponding AAD maps between the fused results and the reference image. (a) Ground truth, (b) STARFM, (c) FSDAF, (d) Fit-FC, (e) EDCSTFN, (f) GAN-STFM, (g) MLFF-GAN, (h) STF-Trans, (i) CTSTFM, (j) DSEPGAN.
Figure 13. Fusion results for the Tianjin dataset with different methods. The first row shows the fused images, and the second row presents the corresponding AAD maps between the fused results and the reference image. (a) Ground truth, (b) STARFM, (c) FSDAF, (d) Fit-FC, (e) EDCSTFN, (f) GAN-STFM, (g) MLFF-GAN, (h) STF-Trans, (i) CTSTFM, (j) DSEPGAN.
Remotesensing 17 04050 g013
Figure 14. Subregions of fusion results for the Tianjin dataset with different methods. (a) Ground truth, (b) STARFM, (c) FSDAF, (d) Fit-FC, (e) EDCSTFN, (f) GAN-STFM, (g) MLFF-GAN, (h) STF-Trans, (i) CTSTFM, (j) DSEPGAN.
Figure 14. Subregions of fusion results for the Tianjin dataset with different methods. (a) Ground truth, (b) STARFM, (c) FSDAF, (d) Fit-FC, (e) EDCSTFN, (f) GAN-STFM, (g) MLFF-GAN, (h) STF-Trans, (i) CTSTFM, (j) DSEPGAN.
Remotesensing 17 04050 g014
Figure 15. Scatter plots for the Tianjin dataset with different methods. (a) STARFM, (b) FSDAF, (c) Fit-FC, (d) EDCSTFN, (e) GAN-STFM, (f) MLFF-GAN, (g) STF-Trans, (h) CTSTFM, (i) DSEPGAN.
Figure 15. Scatter plots for the Tianjin dataset with different methods. (a) STARFM, (b) FSDAF, (c) Fit-FC, (d) EDCSTFN, (e) GAN-STFM, (f) MLFF-GAN, (g) STF-Trans, (h) CTSTFM, (i) DSEPGAN.
Remotesensing 17 04050 g015
Table 1. DL-based methods comparison table.
Table 1. DL-based methods comparison table.
ModelNetwork
Architecture
Multi-Scale
Mechanism
Long-Range
Modeling
Loss
Functions
EDCSTFNCNNNoneNoneMSE loss, feature loss, and structure loss
GAN-STFMCNN + GANNoneNoneGAN loss, feature loss, spectrum loss, and structure loss
MLFF-GANCNN + GAN
with a pyramid encoder
Pyramid downsamplingNoneGAN loss, L1 loss, spectrum loss, and structure loss
STF-TransCNN + TransformerNoneTransformer attentionL1 loss, high-frequency loss, and Total Variation
CTSTFMCNN + Transformermultikernel CNNTransformer attentionL1 loss
DSEPGAN CNN + INN + GAN
with a Dual-Stream pyramid encoder
Pyramid downsampling +
detail-preserving mechanism
Large-kernel convolutionGAN loss, L1 loss, spectrum loss, and structure loss
Table 2. Quantitative assessment results for the CIA dataset with different methods.
Table 2. Quantitative assessment results for the CIA dataset with different methods.
BandSTARFMFSDAFFit-FCEDCSTFNGAN-STFMMLFF-GANSTF-TransCTSTFMDSEPGAN
RMSE10.0166 0.0161 0.0157 0.0162 0.0126 0.0122 0.01170.0133 0.0115
20.0250 0.0234 0.0235 0.0255 0.0180 0.0180 0.01720.0173 0.0176
30.0399 0.0364 0.0376 0.0436 0.0293 0.0286 0.02750.02730.0291
40.0496 0.0509 0.0472 0.0445 0.0401 0.0395 0.03740.0371 0.0368
50.0460 0.0459 0.0468 0.0477 0.0396 0.0374 0.03780.0375 0.0353
60.0379 0.0384 0.0385 0.0398 0.0344 0.0334 0.03260.0316 0.0311
Avg0.0358 0.0352 0.0349 0.0362 0.0290 0.0282 0.02740.0273 0.0269
SSIM10.8928 0.9007 0.8939 0.8850 0.9250 0.9286 0.92990.9271 0.9343
20.8464 0.8594 0.8536 0.8259 0.8966 0.8956 0.89880.9026 0.9067
30.7739 0.7910 0.7873 0.7165 0.8413 0.8463 0.84620.8543 0.8559
40.6811 0.6740 0.6785 0.7147 0.7608 0.7687 0.76740.7835 0.7901
50.7429 0.7498 0.7462 0.7377 0.8006 0.8056 0.8040.8080 0.8175
60.7748 0.7800 0.7723 0.7749 0.8178 0.8193 0.82310.8265 0.8300
Avg0.7853 0.7925 0.7886 0.7758 0.8404 0.8440 0.84490.8503 0.8557
UIQI10.8140 0.8293 0.8190 0.8152 0.8979 0.9113 0.91310.9107 0.9175
20.8152 0.8387 0.8275 0.8023 0.9107 0.9145 0.92090.92250.9200
30.8165 0.8498 0.8400 0.7815 0.9132 0.9178 0.92540.92640.9177
40.8275 0.8262 0.8370 0.8806 0.8976 0.9043 0.91240.91780.9173
50.9222 0.9247 0.9224 0.9174 0.9444 0.9491 0.95340.9528 0.9544
60.9206 0.9215 0.9193 0.9174 0.9368 0.9408 0.94650.94750.9468
Avg0.8527 0.8650 0.8609 0.8524 0.9168 0.9230 0.92860.92960.9290
CC10.8320 0.8370 0.8401 0.8331 0.9022 0.9116 0.91670.9148 0.9195
20.8369 0.8501 0.8473 0.8234 0.9153 0.9160 0.92320.9247 0.9278
30.8454 0.8654 0.8551 0.8013 0.9165 0.9198 0.92660.9275 0.9284
40.8344 0.8281 0.8459 0.8813 0.8994 0.9049 0.91370.91870.9185
50.9222 0.9249 0.9238 0.9180 0.9451 0.9497 0.9540.9532 0.9553
60.9210 0.9217 0.9198 0.9177 0.9376 0.9412 0.94720.94780.9477
Avg0.8653 0.8712 0.8720 0.8625 0.9194 0.9239 0.93020.9311 0.9328
ERGASALL1.3146 1.2666 1.2612 1.3488 1.0298 1.0047 0.96970.9931 0.9675
SAMALL11.1256 10.9104 10.8326 11.4756 8.8898 8.6679 8.28468.2494 8.1507
Bold text highlights the best-performing metrics.
Table 3. Quantitative assessment results for the LGC dataset with different methods.
Table 3. Quantitative assessment results for the LGC dataset with different methods.
BandSTARFMFSDAFFit-FCEDCSTFNGAN-STFMMLFF-GANSTF-TransCTSTFMDSEPGAN
RMSE10.0143 0.0149 0.01400.0151 0.01460.0161 0.0147 0.0167 0.0148
20.0200 0.0207 0.0201 0.0200 0.02070.0223 0.01970.0234 0.0212
30.0251 0.0258 0.0251 0.0257 0.02640.0269 0.02490.0320 0.0258
40.0376 0.0397 0.0385 0.0394 0.0410.0400 0.03570.0532 0.0366
50.0568 0.0621 0.0565 0.0590 0.0540.0533 0.0583 0.0660 0.0514
60.0455 0.0515 0.0446 0.0407 0.03990.0404 0.0432 0.0476 0.0374
Avg0.0332 0.0358 0.0331 0.0333 0.0328 0.0332 0.0328 0.0398 0.0312
SSIM10.9132 0.9125 0.92330.9228 0.91850.9059 0.9167 0.9166 0.9171
20.8730 0.8709 0.8800 0.88970.88010.8709 0.8857 0.8843 0.8832
30.8350 0.8331 0.8438 0.8455 0.84130.8356 0.8480 0.8391 0.8492
40.7292 0.7294 0.7405 0.7083 0.70370.7239 0.7291 0.6910 0.7481
50.5697 0.5220 0.5532 0.5513 0.56960.5948 0.5663 0.5766 0.6168
60.6408 0.5754 0.6274 0.6267 0.64380.6533 0.6237 0.6473 0.6739
Avg0.7601 0.7405 0.7614 0.7574 0.7595 0.7641 0.7616 0.7591 0.7814
UIQI10.7152 0.7062 0.7124 0.6132 0.68440.6839 0.7086 0.5813 0.7361
20.7019 0.6890 0.6943 0.6560 0.68070.7125 0.7207 0.5637 0.7305
30.7072 0.6965 0.7007 0.6573 0.6960.7179 0.7327 0.5366 0.7349
40.7857 0.7794 0.7827 0.7431 0.7660.7826 0.8112 0.6991 0.8191
50.7531 0.7198 0.7313 0.6645 0.76590.7940 0.7571 0.7436 0.8171
60.7205 0.6531 0.7021 0.7316 0.77980.7918 0.7670 0.7713 0.8183
Avg0.7306 0.7073 0.7206 0.6776 0.7288 0.7471 0.7495 0.6493 0.7760
CC10.7158 0.7076 0.7160 0.6582 0.68780.6870 0.71510.6073 0.7396
20.7071 0.6916 0.7007 0.6958 0.68880.7169 0.7250 0.5902 0.7335
30.7130 0.7004 0.7086 0.6834 0.69940.7192 0.7352 0.5469 0.7353
40.8075 0.8011 0.8098 0.7832 0.7690.7871 0.82650.7243 0.8245
50.7909 0.7666 0.7832 0.7695 0.78640.7993 0.8111 0.7750 0.8228
60.7873 0.7473 0.7786 0.7871 0.78830.7977 0.8070 0.7949 0.8208
Avg0.7536 0.7358 0.7495 0.7295 0.7366 0.7512 0.7700 0.6731 0.7794
ERGASALL2.0655 2.2230 2.0322 2.0245 1.98372.0300 2.0232 2.3816 1.9096
SAMALL16.2826 17.0293 16.2303 16.8170 16.773816.5577 15.9806 18.0406 15.7259
Bold text highlights the best-performing metrics.
Table 4. Quantitative assessment results for the Tianjin dataset with different methods.
Table 4. Quantitative assessment results for the Tianjin dataset with different methods.
BandSTARFMFSDAFFit-FCEDCSTFNGAN-STFMMLFF-GANSTF-TransCTSTFMDSEPGAN
RMSE10.0359 0.0071 0.0083 0.0119 0.0064 0.0060 0.00730.0080 0.0055
20.0175 0.0139 0.0082 0.0161 0.0079 0.0076 0.00930.0090 0.0068
30.0314 0.0074 0.0081 0.0148 0.0071 0.0080 0.00860.0093 0.0064
40.0215 0.01450.0155 0.0163 0.0159 0.0160 0.01750.0203 0.0149
Avg0.0266 0.0107 0.0100 0.0148 0.0093 0.0094 0.01070.0117 0.0084
SSIM10.7382 0.9422 0.9319 0.8439 0.9605 0.9592 0.94690.9294 0.9631
20.8550 0.8850 0.9465 0.8424 0.9530 0.9487 0.93650.9381 0.9608
30.7943 0.9485 0.9436 0.8435 0.9531 0.9454 0.93320.9198 0.9625
40.8437 0.89270.8820 0.8708 0.8807 0.8853 0.84270.8188 0.8922
Avg0.8078 0.9171 0.9260 0.8502 0.9368 0.9347 0.91480.9015 0.9447
UIQI10.0870 0.8394 0.6238 0.7275 0.7756 0.8418 0.68680.6393 0.8706
20.5866 0.7412 0.7293 0.6288 0.6974 0.7838 0.62340.6035 0.8286
30.1253 0.8300 0.7397 0.6529 0.7791 0.8037 0.68060.6136 0.8600
40.6177 0.80380.7628 0.6778 0.7563 0.7673 0.65880.5704 0.7986
Avg0.3541 0.8036 0.7139 0.6718 0.7521 0.7992 0.66240.6067 0.8395
CC10.1660 0.8581 0.7842 0.8467 0.8510 0.8466 0.75130.7245 0.8775
20.6835 0.8131 0.7542 0.7782 0.8205 0.7893 0.73130.6960 0.8378
30.1841 0.8331 0.7889 0.7786 0.8373 0.8114 0.73080.6747 0.8626
40.6210 0.80540.7654 0.7129 0.7613 0.7681 0.6750.5975 0.7989
Avg0.4136 0.8274 0.7732 0.7791 0.8175 0.8038 0.72210.6732 0.8442
ERGASALL7.9286 2.4543 2.1782 3.5716 1.8916 1.8925 2.19882.3437 1.3594
SAMALL34.8451 14.6957 16.5414 18.0913 15.0764 15.2694 17.856819.2186 13.8698
Bold text highlights the best-performing metrics.
Table 5. Comparison of model efficiency and computational cost.
Table 5. Comparison of model efficiency and computational cost.
MODELParameters (M)FLOPsBatch-Time (s)Test-Time (s)
STARFM\\\\541.13
FSDAF\\\\1262.19
Fit-FC\\\\6544.50
EDCSTFN\0.28 1.49 × 10110.16 8.55
GAN-STFMGenerator0.58 3.02 × 10110.77 11.24
Discriminator3.67 8.24 × 107
MLFF-GANGenerator5.93 1.09 × 10110.35 17.30
Discriminator2.78 3.02 × 1010
STF-Trans\6.01 4.41 × 10110.28 7.06
CTSTFM\6.30 3.06 × 10121.36 15.74
DSEPGANGenerator5.40 2.33 × 10110.68 15.10
Discriminator2.78 3.02 × 1010
Table 6. Quantitative results and model efficiency of different methods.
Table 6. Quantitative results and model efficiency of different methods.
ModelCIALGCParametersFLOPs (G)Batch-Time (s)
RMSESSIMERGASRMSESSIMERGAS
DSEPGAN-Diff0.0302 0.8435 1.0669 0.0313 0.7775 1.9279 5,179,218201.50.6403
DSPGAN0.0275 0.8519 0.9730 0.0315 0.7754 1.9169 4,629,138189.50.4657
DSEPGAN-Conv0.0289 0.8519 1.0145 0.0328 0.7768 2.0348 4,536,786189.70.4782
DSEPGAN-Trans0.0278 0.8475 0.9992 0.0334 0.7722 2.0900 10,848,306186.70.5779
DSEPGAN w/o ConvFFN0.0281 0.8504 1.0010 0.0323 0.7806 1.9838 4,712,322201.30.6146
DSEPGAN0.02690.85570.96750.03120.78141.90965,396,946232.80.6825
Bold text highlights the best-performing metrics.
Table 7. Impact of different kernel sizes on model performance.
Table 7. Impact of different kernel sizes on model performance.
Kernel SizeRMSESSIMERGASParametersFLOPs (G)Batch-Time (s)
(3,3,3,3)0.0272 0.8528 0.9687 5,312,466225.5 0.6306
(7,7,7,7)0.0279 0.8542 0.9971 5,333,586227.3 0.6463
(9,9,9,9)0.0270 0.8546 0.9844 5,350,482228.8 0.6528
(13,13,13,13)0.02690.85570.9675 5,396,946232.8 0.6762
(3,7,9,13)0.02710.8525 0.9726 5,334,738230.7 0.6663
(3,7,15,31)0.0272 0.8543 0.96155,386,578252.4 0.8043
Bold indicates the best, and underline indicates the second-best.
Table 8. Impact of the number of invertible basic units on model performance.
Table 8. Impact of the number of invertible basic units on model performance.
NRMSESSIMERGASParametersFLOPs (G)Batch-Time (s)
10.0285 0.8491 1.00374,685,394198.9 0.4361
20.0278 0.8508 0.98795,041,170215.9 0.5220
30.02690.85570.96755,396,946232.8 0.6762
40.02690.85570.96425,752,722249.7 0.7773
Bold indicates the best, and underline indicates the second-best.
Table 9. Impact of the number of stages on model performance.
Table 9. Impact of the number of stages on model performance.
StageRMSESSIMERGASParametersFLOPs (G)Batch-Time (s)
30.0286 0.8475 1.0370 1,329,714160.7 0.5989
40.0269 0.8557 0.9675 5,396,946232.8 0.6762
50.02560.86320.926821,512,466304.4 0.7321
Bold text highlights the best-performing metrics.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhou, D.; Xu, L.; Wu, K.; Liu, H.; Jiang, M. DSEPGAN: A Dual-Stream Enhanced Pyramid Based on Generative Adversarial Network for Spatiotemporal Image Fusion. Remote Sens. 2025, 17, 4050. https://doi.org/10.3390/rs17244050

AMA Style

Zhou D, Xu L, Wu K, Liu H, Jiang M. DSEPGAN: A Dual-Stream Enhanced Pyramid Based on Generative Adversarial Network for Spatiotemporal Image Fusion. Remote Sensing. 2025; 17(24):4050. https://doi.org/10.3390/rs17244050

Chicago/Turabian Style

Zhou, Dandan, Lina Xu, Ke Wu, Huize Liu, and Mengting Jiang. 2025. "DSEPGAN: A Dual-Stream Enhanced Pyramid Based on Generative Adversarial Network for Spatiotemporal Image Fusion" Remote Sensing 17, no. 24: 4050. https://doi.org/10.3390/rs17244050

APA Style

Zhou, D., Xu, L., Wu, K., Liu, H., & Jiang, M. (2025). DSEPGAN: A Dual-Stream Enhanced Pyramid Based on Generative Adversarial Network for Spatiotemporal Image Fusion. Remote Sensing, 17(24), 4050. https://doi.org/10.3390/rs17244050

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop