1. Introduction
Remote sensing has been extensively used in a wide range of Earth observation tasks, including continuous crop growth monitoring for agricultural assessment [
1,
2,
3], analysis of the water environment [
4,
5], and long-term ecosystem evaluations such as forest cover change [
6,
7,
8] as well as desertification monitoring [
9,
10,
11]. The dynamic monitoring of the ground surface requires remote sensing data with high spatial and temporal resolutions. Current mainstream optical remote sensing data can be categorized into hyperspectral, multispectral, and panchromatic images. Hyperspectral images typically consist of hundreds of contiguous narrow spectral bands (e.g., 100–200 bands within the 400–2500 nm range), providing a high spectral resolution but relatively low spatial and temporal resolutions. Panchromatic images consist of only one broad spectral band and produce grayscale imagery with the highest spatial resolution, but they lack rich spectral information. Multispectral images typically include 4–20 relatively broad bands, striking a balance with moderate spatial and temporal resolutions and are widely used in various applications. However, existing multispectral satellite systems face a fundamental trade-off: relatively high-spatial-resolution satellites (e.g., Landsat) are limited by a narrow width and long revisit period, resulting in inadequate temporal continuity; and high-frequency observing systems (e.g., MODIS) are constrained by a low spatial resolution, which makes it difficult to capture the detailed features of the ground surface. To address this limitation, multi-source remote sensing spatiotemporal fusion techniques have emerged as a promising solution. These techniques fuse multispectral imagery from different sources. The main concept of spatiotemporal fusion is to generate high-quality images with both fine spatial and temporal resolutions by integrating complementary information from multiple remote sensing sources [
12].
Spatiotemporal fusion methods generate fused images by extracting spatial details and capturing temporal variations. The fundamental principle is to establish a mapping between temporal changes and spatial structures through the synergistic integration of multispectral high-spatial–low-temporal-resolution images (fine-resolution images) and multispectral low-spatial–high-temporal-resolution images (coarse-resolution images). Existing spatiotemporal fusion approaches are broadly categorized into two groups: traditional model-driven methods and data-driven deep learning methods [
13]. Traditional model-driven approaches are further divided into three main types: weight function-based, unmixing-based, and sparse representation-based methods. Weight function-based methods model coarse-to-fine image relationships using spatiotemporal neighborhood similarity and weighting functions [
14]. For example, The Spatial and Temporal Adaptive Reflectance Fusion Model (STARFM) [
15] assumes coarse pixel spectral purity and transfers reflectance changes via weighting to predict the target fine image from a prior one. While flexible and physically interpretable, these methods often fail in areas of abrupt land surface change. Unmixing-based methods [
16,
17] apply linear spectral mixture theory, decomposing coarse pixels into endmembers (pure land cover spectra) and their abundances (fractional cover). They reconstruct fused images by integrating fine image spatiotemporal information. The Flexible Spatiotemporal Data Fusion Algorithm (FSDAF) [
18] combines weight function and unmixing concepts: it estimates homogeneous region spectral variations, interpolates spatial changes, and fuses spectral and spatial features to generate fine images. Unmixing methods offer fine pixel decomposition and handle local changes well but struggle with subtle land cover transitions. Sparse representation-based methods [
19] decompose images into sparse dynamic and low-rank static components. Within a sparse coding framework, they jointly model coarse image temporal evolution and a fine image global structure, suppressing noise while enhancing spatiotemporal consistency and preserving fine details. The Error-Bound-Regularized Sparse Coding Dictionary Learning (EBSCDL) model [
20] employs error-bound regularization to constrain dictionary perturbations and block-sparse constraints to better model local structural correlations. However, these methods’ strong reliance on local sparsity limits their ability to model cross-scale dependencies and complex nonlinear dynamics, consequently restricting their preservation of global consistency and representation of sudden changes.
Compared with traditional model-driven methods, deep learning-based methods can capture the complex spatiotemporal relationships implicit in large-scale remote sensing data. As a result, deep learning has been extensively applied in spatiotemporal fusion tasks in recent years [
21]. Generative Adversarial Network (GAN) [
22], as a revolutionary generative model, has been widely used in image generation, super-resolution, and other fields [
23]. Consequently, GAN has also been introduced into remote sensing spatiotemporal fusion tasks [
24,
25,
26]. The GAN-STFM method [
27] breaks the temporal constraints on reference image selection, significantly improving the flexibility of the fusion process. However, this increased flexibility may compromise the accuracy of the fused images. The PSTAF-GAN method [
28] designs a flexible multi-scale feature extraction framework to capture hierarchical features and adopts a progressive fusion strategy to enhance fusion accuracy. The MLFF-GAN method [
29], based on the U-Net architecture, adopts multi-level feature fusion to improve fusion accuracy in regions undergoing change. The HPLTS-GAN method [
30] is designed to enhance model performance in temporally insensitive tasks by minimizing reliance on temporal information while preserving prediction accuracy. This approach effectively improves the spatiotemporal consistency of the fused images and substantially enhances the model’s overall performance.
The convolutional neural network (CNN), known for its powerful feature extraction capabilities, has become one of the most prominent approaches in multi-source remote sensing image spatiotemporal fusion [
31,
32,
33,
34]. The Enhanced Deep Convolutional Spatiotemporal Fusion Network (EDCSTFN) [
35] uses multi-receptive-field convolutional layers to extract multi-scale spatial features. Deeper layers capture abstract semantic information, while shallower layers preserve high-frequency details, improving the modeling of complex land cover. However, despite its strong spatiotemporal performance, it struggles to capture subtle long-term temporal variations. The MLKNet method [
36] introduces a multi-level knowledge modeling mechanism to fully leverage the complementary nature of hierarchical features, such as shallow structural information and deep semantic representations, thereby enhancing the network’s capability to model complex scenes. The CIG-STF method [
37] effectively integrates change detection with spatiotemporal fusion, substantially improving fusion accuracy for abrupt land cover changes (such as floods and landslides) in fused images, thereby enhancing the model’s practicality.
CNN is effective at extracting local image features but struggles to model long-range dependencies. In contrast, the transformer leverages self-attention to model global dependencies, making it well-suited for tasks involving long-range context understanding and cross-modal learning. These advantages have contributed to the widespread adoption of the Vision Transformer (ViT) [
38] in computer vision. ViT employs self-attention to capture global relationships between image patches, enabling robust global feature representation, especially when trained on large-scale datasets. Several ViT-based approaches have been proposed for spatiotemporal fusion in remote sensing [
39,
40,
41]. For example, STINet [
42] fuses multi-scale spatiotemporal features to capture variations across land cover types, but it may introduce local texture distortions. STM-STFNet [
43] integrates Swin Transformer’s global context modeling with multi-dimensional attention to jointly predict images in both spatial and temporal domains. This design improves accuracy under complex surface changes, such as land cover transitions. SwinSTFM [
44] combines pixel-level attention with spectral mixture theory to enhance fusion performance. However, similar to other transformer-based approaches, it suffers from a high computational complexity.
In summary, while many deep learning-based spatiotemporal fusion methods have achieved promising performance, several limitations remain:
In spatiotemporal fusion tasks, the significant resolution gap between coarse- and fine-resolution images poses a major challenge for reconstructing high-quality texture details.
Existing deep learning-based spatiotemporal fusion methods often emphasize spatial details while neglecting spectral information, resulting in fused images with high spatial fidelity but significant spectral distortion.
Most existing end-to-end deep learning-based spatiotemporal fusion methods rely on relatively complex neural network architectures, which often lead to a high computational complexity. The massive data volumes of remote sensing images further intensify this computational burden.
Although existing deep learning-based methods have achieved impressive results, no comprehensive solution has been proposed to address all three issues simultaneously. To address the aforementioned limitations, this article proposes a deep learning-based model, Sparse Fast Transformer fusion method based on Generative Adversarial Network (SFT-GAN). Compared to existing deep learning-based multi-source remote sensing spatiotemporal fusion methods, the proposed SFT-GAN offers the following contributions:
To concentrate on the first limitation, SFT-GAN adopts a multi-level pyramid architecture and designs a flexible channel attention fusion mechanism to adaptively fuse spatial detail features and temporal variation features, enhancing informative channels while suppressing irrelevant noise. In addition, a Detail Compensation Module (DCM) is introduced to fully leverage spatial prior information from the reference image. The DCM applies the Butterworth filter to decompose the image into high- and low-frequency components at multiple scales, enhancing the high-frequency details to improve texture representation.
To address the second limitation, a Spectrum Compensation Module (SCM) is designed to leverage spectral prior information from reference images. Specifically, SCM analyzes inter-band correlations in coarse-resolution images to extract intrinsic spectral patterns, which are used to guide the reconstruction of fine-resolution images, thereby enhancing the spectral fidelity of the fused image.
To focus on the third limitation, this article proposes the Sparse Transformer Module, which optimizes the transformer using a KL divergence-based sparsity strategy, significantly reducing the model’s computational complexity and memory consumption. Under the same training conditions, the proposed method can process larger-scale datasets, thereby improving overall efficiency and practical applicability.
The remaining contents are organized as follows.
Section 2 presents the overall architecture of the proposed SFT-GAN model.
Section 3 validates the effectiveness of the proposed method through both comparative and ablation experiments.
Section 4 discusses the proposed method’s performance and advantages and outlines potential directions for future research.
Section 5 concludes the article and outlines potential directions for future work. Additional, the code is released at
https://github.com/MaZhaoX/SFT-GAN (accessed on 2 July 2025).
3. Experiments and Results
3.1. Study Areas and Datasets
The experiments use publicly available datasets from two locations: the Lower Gwydir Catchment (LGC) and the Coleambally Irrigation Area (CIA) [
51]. The study area of LGC is located in northern New South Wales (NSW) and contains 14 pairs of cloud-free Landsat-MODIS (L-M) image pairs acquired between April 2004 and April 2005. Both the Landsat and MODIS images are resampled to a size of 2720 × 3200 pixels with a spatial resolution of 25 m, and each image contains six spectral bands. A flood occurred in this area in December 2004, rendering it a dynamic site for testing temporal robustness. This enables a more effective evaluation of the method’s predictive performance under dynamic land cover conditions. The study area of CIA is located in southern NSW, consisting of 17 cloud-free L-M image pairs from 2001 to 2002, resampled to 2040 × 1720 pixels for spatiotemporal fusion. The agricultural and forest areas surrounding the CIA region exhibit considerable temporal variability despite minimal changes in land cover types. Consequently, although land cover remains relatively stable, the CIA dataset displays notable temporal variation.
As the LGC and CIA datasets primarily cover plains and farmlands, the temporal gaps between adjacent image pairs are typically a few days and phenological changes are relatively mild. To further assess the generalization and applicability of the methods, additional experiments were conducted using the AHB and Tianjin datasets [
52]. The AHB dataset covers a study area in Ar Horqin Banner, northeastern China, where agriculture and animal husbandry are the dominant industries. This area is characterized by numerous circular pastures and farmlands. It contains 27 cloud-free Landsat–MODIS (L–M) image pairs acquired between May 2013 and December 2018, each with a resolution of 2480 × 2800 pixels. Due to vegetation growth, the area exhibits significant phenological variation over time. The Tianjin dataset covers an urban study area in Tianjin, a major city in northern China characterized by pronounced seasonal variation. It includes 27 cloud-free L–M image pairs collected from September 2013 to September 2019, each with a resolution of 2100 × 1970 pixels. As an urban dataset, the Tianjin dataset serves as a benchmark for evaluating the effectiveness of spatiotemporal fusion methods in capturing urban phenological dynamics. Compared with the LGC and CIA datasets, the AHB and Tianjin datasets exhibit significantly extended temporal intervals between adjacent image pairs, typically spanning several months, along with more pronounced spectral variations in land surface features. These characteristics introduce greater challenges for spatiotemporal fusion, providing a more rigorous test of model performance.
3.2. Experimental Design and Evaluation
The overall experimental design is divided into three parts. First, SFT-GAN is compared with two traditional model-driven approaches (STARFM [
15], FSDAF [
18]) and four deep learning-based methods (EDCSTFN [
35], GAN-STFM [
27], MLFF-GAN [
29], and STM-STFNet [
43]) to evaluate the proposed method’s effectiveness. Furthermore, a classification experiment based on fused images is conducted to evaluate the quality and practical utility of the images generated by each method. Second, the number of trainable parameters and the computational complexity of the network are analyzed and discussed. Finally, an ablation study is conducted to verify the contribution of each component within the SFT-GAN architecture.
The quality of the fused images is evaluated based on six evaluation metrics: Root Mean Square Error (RMSE), Peak Signal-to-Noise Ratio (PSNR), Erreur Relative Globale Adimensionnelle de Synthèse (ERGAS) [
53], Spectral Angle Mapping (SAM) [
54], Structural Similarity Index Measure (SSIM) [
55], and Universal Image Quality Index (UIQI) [
56]. RMSE quantifies fusion error, with lower values indicating better performance. PSNR focuses on pixel-level differences and is widely used for image quality assessment: higher values correspond to better image quality. ERGAS is a relative, dimensionless global error metric for assessing the quality of synthesized remote sensing images: lower values indicate better quality. SAM measures spectral distortion between generated and real images: lower values suggest greater spectral similarity. SSIM assesses perceptual quality, with higher values indicating better structural and visual consistency. UIQI evaluates the overall similarity between the generated and real images: higher values denote better agreement. In addition to quantitative metrics, visual inspection is conducted using standard false-color composites (NIR-Red-Green) to synthesize color images. Furthermore, absolute average residual maps are used to visualize pixel-wise differences between the generated and real images, enabling direct comparison across methods.
To ensure a fair comparison, traditional model-driven methods use default parameter settings. For deep learning-based methods, input images are divided into patches of size 256 × 256 with a stride of 128. The learning rates and other hyperparameters for EDCSTFN, GAN-STFM, MLFF-GAN, and STM-STFNet follow the settings specified in their original implementations. For SFT-GAN, the initial learning rate is set to and decayed by 20% every 10 epochs.
3.3. Experimental Result and Analysis
3.3.1. CIA Dataset Result
As shown in
Table 1, the proposed method consistently achieves either the best or second-best performance across most evaluation metrics, with particularly outstanding results in SAM, ERGAS, and SSIM. These results demonstrate the method’s strong capability in reconstructing both spectral and textural information in the fused images. As illustrated in
Figure 7, the fusion results of STARFM are nearly unusable, whereas FSDAF preserves some texture details but suffers from a low prediction accuracy. In contrast, although other deep learning-based methods achieve a higher prediction accuracy, they tend to lose varying levels of texture detail. The images generated by SFT-GAN exhibit the lowest error relative to the reference images while preserving abundant texture details. Compared to STM-STFNet, another transformer-based method, the proposed method demonstrates superior preservation of local textures, further validating its effectiveness. Although MLFF-GAN generates visually appealing results, its quantitative performance is relatively poor. Detailed analysis reveals that this is primarily due to pixel misalignment in the fused images, as clearly observed in the zoomed-in patches in
Figure 8. These findings collectively demonstrate the accuracy and robustness of the proposed method in phenology-driven spatiotemporal fusion.
3.3.2. LGC Dataset Result
Figure 9 shows the fusion results on the LGC dataset, where traditional methods exhibit noticeable spatial structure distortions in the fused images. As shown in
Table 2, the proposed method demonstrates superior overall performance, achieving competitive results across most evaluation metrics, with particularly notable improvements in SAM and SSIM, significantly outperforming traditional methods. Compared to the CIA dataset, all methods achieve significantly better performance on the LGC dataset. In terms of spectral and structural fidelity, the proposed method achieves significantly better SAM and SSIM scores than MLFF-GAN and STM-STFNet, indicating its superior ability to preserve spectral characteristics and spatial details. Although STM-STFNet performs well overall, particularly in RMSE, this advantage is mainly due to the use of a larger number of reference images. As illustrated in
Figure 10, the fusion results generated by STARFM and FSDAF exhibit severe spectral distortion. Although deep learning-based spatiotemporal fusion methods alleviate this issue to some extent, varying levels of spectral distortion still remain. The SFT-GAN not only preserves spatial details but also substantially reduces spectral distortion. In summary, the proposed method maintains robust performance under significant land cover changes, demonstrating excellent fusion capability and strong generalizability.
3.3.3. AHB Dataset Result
As shown by the evaluation metrics in
Table 3, the fusion performance of all methods declined on the AHB dataset compared to the previous two datasets, with traditional methods exhibiting the most significant performance degradation. Nevertheless, the SFT-GAN consistently achieved the best performance across all evaluation metrics. As illustrated in
Figure 11, the fusion results from EDCSTFN, MLFF-GAN, and STM-STFNet were significantly affected by noise, with MLFF-GAN and STM-STFNet exhibiting particularly severe spectral distortions. Moreover, none of the compared methods accurately captured the spatiotemporal dynamics of river features. In the magnified views presented in
Figure 12, both EDCSTFN and GAN-STFM failed to preserve the structural integrity of circular farmlands. The fusion results generated by the proposed method exhibited the lowest average pixel error and lacked any noticeable abrupt error spikes. Although STM-STFNet achieved a below-average error rate, it still exhibited regions with substantial local errors. In contrast, other methods not only produced higher average errors but also exhibited more severe and frequent error spikes. In summary, the proposed method maintains superior performance even under substantial spectral variations in surface features, further demonstrating its robustness and strong generalizability in complex spatiotemporal fusion scenarios.
3.3.4. Tianjin Dataset Result
As shown in
Table 4, compared to the CIA and LGC datasets, all methods exhibited reduced fusion performance on the Tianjin dataset across quantitative metrics, indicating that this dataset imposes greater demands on method generalization and robustness. Nevertheless, the SFT-GAN consistently achieved the best or second-best performance across most key metrics, with particularly notable results in the SAM index. As illustrated in
Figure 13, GAN-STFM produced the poorest fusion results, and FSDAF exhibited severe spectral distortion. EDCSTFN failed to preserve fine details, while STM-STFNet suffered from both significant texture loss and spectral distortion. Local visualizations in
Figure 14 further confirm that all methods suffered from varying degrees of spectral distortion. Although MLFF-GAN produced visually appealing results, noticeable noise degraded its performance, resulting in suboptimal quantitative scores. As shown in the absolute average residual maps in
Figure 14, the proposed method achieved the lowest average error. Although MLFF-GAN and EDCSTFN also performed relatively well, their results showed noticeable local errors. Overall, the proposed method demonstrated superior performance in handling urban phenological changes. It effectively mitigated the spectral distortion typically observed in traditional methods under complex urban conditions and alleviated the blurring of local details common in deep learning-based models. These results highlight the strong generalization capability and robustness of the proposed approach. These findings also underscore a key limitation of data-driven deep learning methods: their heavy reliance on training data. When applied to challenging datasets such as AHB or Tianjin—characterized by significant land cover changes and large temporal gaps between image pairs—these methods may experience substantial performance degradation or even complete failure.
3.3.5. Computational Load
To evaluate computational load, we report the number of parameters, multiply-accumulate operations (MACs), GPU memory usage during training, and training time for each deep learning-based method. MACs indicate the number of multiply-accumulate operations needed to process a six-band image with a resolution of 256 × 256. For GPU memory measurement, we used a batch size of 16 and a patch size of 256 × 256, evaluated on the CIA dataset with all other settings kept at their default values. Time refers to the duration needed to complete a single training epoch. The computational load evaluation results are summarized in
Table 5, where Former-GAN denotes a variant of the proposed method with the Sparse Transformer Block replaced by a standard Vision Transformer Block.
Among the five deep learning methods, both STM-STFNet and the proposed SFT-GAN are based on the transformer architecture, leading to relatively large parameter counts. However, with the introduction of the Sparse Transformer Block, SFT-GAN significantly reduces computational complexity, achieving the lowest MACs among all methods—approximately 29% of those required by MLFF-GAN. In addition, due to the sparsity mechanism embedded in the Sparse Transformer Block, SFT-GAN achieves the lowest GPU memory usage during training, consuming only 6.72 GiB. This advantage enables the method to process larger-scale remote sensing imagery under identical hardware conditions, effectively reducing dependence on high-performance computing resources. Moreover, SFT-GAN shows superior training efficiency, reducing training time by approximately 80% compared to STM-STFNet. A comparison with Former-GAN further confirms that the Sparse Transformer Block effectively reduces both computational and memory complexities. In summary, SFT-GAN not only significantly reduces computational cost and training time, but also greatly enhances model usability and practicality through sparse optimization strategies. These advantages make it a promising solution for resource-constrained applications, such as onboard processing on unmanned aerial vehicles.
3.3.6. Classification Results of Fusion Images
To further validate the practicality of the proposed method, a classification experiment was conducted to assess the quality and usability of the fused images. Specifically, a Support Vector Machine (SVM)-based classifier was used to classify fused images generated from the CIA dataset. As the CIA dataset lacks predefined land cover categories, the images were manually categorized into six land cover types. Classification results from the fused images generated by SFT-GAN and other competing methods were compared with those from true high-resolution images. The results are shown in
Figure 15 and summarized in
Table 6. The experimental results indicate that the fused images generated by SFT-GAN achieve the highest Overall Accuracy (OA) of 80.89% and a Kappa coefficient of 0.7259, demonstrating superior classification performance. These findings further validate the practical value of SFT-GAN and demonstrate that the proposed modules effectively improve both spectral consistency and spatial detail representation. Consequently, SFT-GAN offers more reliable data for downstream remote sensing tasks such as land cover classification and change detection.
3.4. Ablation Study
The ablation study consists of two parts: (1) evaluating the effectiveness of each proposed module in the SFT-GAN framework; (2) assessing the impact of different parameters in the Detail Compensation Module on fused image quality, including a comparison of the Butterworth, ideal, and Gaussian filters. The ablation experiments were conducted on the CIA dataset, focusing on the fused image corresponding to 11 January 2002. During training, the initial learning rate was set to and was reduced by 20% every 10 epochs. The batch size was set to 16, with a total of 500 training epochs.
To evaluate the contributions of the Sparse Transformer Module (STM), Detail Compensation Module (DCM), and Spectrum Compensation Module (SCM), four ablation experiments were conducted: (1) retaining SCM while removing DCM from SFT-GAN; (2) retaining DCM while removing SCM from SFT-GAN; (3) removing both DCM and SCM from SFT-GAN; (4) replacing the STM in SFT-GAN with a standard Vision Transformer.
The ablation study evaluated the effectiveness of multi-module collaboration by comparing the performance of different module combinations. The results are presented in
Table 7. When STM, DCM, and SCM are all enabled, the model achieves the best performance across all evaluation metrics. Removing SCM alone significantly increases the SAM value to 3.7743, highlighting the importance of SCM in preserving spectral fidelity. In contrast, removing DCM decreases the SSIM to 0.8631, demonstrating its essential role in fine detail compensation. When only STM is retained, both RMSE and PSNR deteriorate, further confirming the necessity of DCM and SCM collaboration. Notably, replacing STM with a standard Vision Transformer leads to lower PSNR and SSIM compared to the complete model, indicating that STM reduces computational and spatial complexities without sacrificing fusion accuracy. In conclusion, the joint use of STM, DCM, and SCM effectively balances spatial detail enhancement, spectral consistency, and structural similarity, offering a robust solution for multi-source remote sensing image spatiotemporal fusion.
Additionally, an ablation study was conducted to investigate the impact of different DCM parameters and low-pass filter types. The Butterworth filter was replaced by the Ideal and Gaussian filters. For each filter type, two parameter sets were tested for the Gaussian filter as well as the Ideal filter, and four for the Butterworth filter. The quantitative evaluation results are presented in
Table 8. Specifically, for the Butterworth filter, the first parameter is the cutoff frequency
, and the second is the filter order
n. For the Gaussian filter, the parameter is the standard deviation
, while the Ideal filter uses the cutoff frequency
as its sole parameter. The Gaussian and Ideal filters are single-parameter filters.
Overall, the Butterworth filter achieves the best performance under the parameter combination (parameter 1 = 50, 150, 250; parameter 2 = 2, 4, 4), particularly in terms of PSNR and SAM. Notably, parameter 2 has a significant impact on spectral fidelity. When parameter 1 is fixed, changing parameter 2 from (2, 4, 4) to (2, 2, 4) leads to a sharp increase in the SAM value to 4.0781, highlighting its strong influence on spectral consistency. In contrast, although the Gaussian filter yields lower RMSE values with parameter 1 = 1, 2, 3, its performance on other metrics is inferior to that of the Butterworth filter. The Ideal filter performs well in terms of SSIM but shows poor results on other metrics. In summary, the Butterworth filter provides the best trade-off between spatial detail preservation, spectral accuracy, and structural consistency, making it the optimal filter choice for the DCM.
3.5. Stability Study
To evaluate the stability of the proposed model, a stability experiment was designed. Based on the CIA dataset, a total of 15 image pairs were selected, and 10 random experiments were conducted. In each experiment, 3 image pairs were randomly chosen as the test set, and the remaining 12 pairs were used for training. The results on the test set were averaged in each round to obtain the final result for that experiment. RMSE, SSIM, SAM, ERGAS, PSNR, and UIQI were used as evaluation metrics. The standard deviation of each metric across the 10 experiments was calculated to assess the model’s performance stability. The results are summarized in
Table 9.
Table 9 reports the mean and standard deviation of each performance metric across the 10 randomized experiments. As shown in the results, UIQI and SSIM exhibit relatively low standard deviations, indicating that the proposed method is stable in reconstructing structural information. In contrast, the standard deviation of SAM is relatively high, suggesting that the spectral reconstruction performance is somewhat sensitive to the composition of the training samples. Overall, although the model exhibits a certain degree of variation across multiple experiments, the range of performance fluctuations remains within a reasonable and acceptable interval, indicating that the proposed method possesses stability.
4. Discussion
To overcome the limitations of existing deep learning-based multi-source remote sensing spatiotemporal fusion methods, this article proposes a novel approach based on GAN and sparse transformer. The generator consists of three main stages: feature extraction, feature fusion, and information compensation. In the feature extraction stage, a Sparse Transformer Module is applied to reduce computational complexity while preserving the model’s feature extraction capability. During feature fusion, the Feature Reconstruction Module leverages a channel attention mechanism to flexibly integrate spatial detail and temporal variation features. In the information compensation stage, the Detail Compensation Module applies frequency-domain decomposition to recover high-frequency details, thereby enhancing spatial fidelity. Meanwhile, the Spectrum Compensation Module improves spectral fidelity by incorporating band correlation constraints.
Through this multi-stage design, the proposed method achieves an effective trade-off between spectral fidelity, spatial detail preservation, and computational efficiency. Notably, comparative experiments on four benchmark datasets demonstrate that Sparse Fast Transformer fusion method based on Generative Adversarial Network (SFT-GAN) consistently outperforms existing state-of-the-art methods in both quantitative metrics and visual quality. Moreover, ablation studies validate the individual contributions of each component, particularly highlighting the importance of detail compensation and spectrum compensation in preserving fine-grained spatial and spectral information. Despite these promising results, certain limitations remain. The model may exhibit reduced robustness under extreme atmospheric conditions or abrupt land cover changes, such as the result in the Tianjin dataset.
5. Conclusions
This study addresses the persistent challenges in multi-source remote sensing spatiotemporal fusion, including insufficient spectral fidelity and high computational complexity. To this end, we propose the Sparse Fast Transformer fusion method based on Generative Adversarial Network (SFT-GAN), a novel fusion framework that integrates a sparse transformer-based generator with specialized compensation modules for detail and spectral restoration. Through a sparse optimization strategy, the model significantly reduces computational overhead, making it suitable for resource-constrained platforms such as UAVs. Experimental results on four diverse public datasets demonstrate that SFT-GAN achieves superior fusion accuracy and generalization capability across varying spatial and temporal scenarios.
In particular, the proposed Spectrum Compensation Module markedly enhances spectral fidelity, ensuring the applicability of the fused images in downstream tasks such as land use monitoring and ecological environment assessment. Overall, the method strikes an effective balance between accuracy and efficiency, representing a practical solution for real-world remote sensing applications.
However, the proposed method still has certain limitations. The performance of SFT-GAN may decline when land cover types undergo drastic changes. Future research will focus on further optimizing the network architecture to enhance its adaptability under complex land cover change conditions. Currently, most existing fusion methods adopt an early fusion strategy, namely feature-level fusion. Future research will further explore the potential of late fusion strategies [
57] in the spatiotemporal fusion of remote sensing images, aiming to enhance fusion performance and improve model generalization. Additionally, we will investigate the integration of spectral physical priors into deep learning models to further enhance the spectral fidelity of fused images. Currently, most standard benchmark datasets are based on Landsat-MODIS data. To further validate the generalization capability of the model, we plan to incorporate data from other types of sensors (e.g., Gaofen-1) into spatiotemporal fusion studies, which will be one of the key directions in our future work.