1. Introduction
LST is one of the most important indicators for monitoring the earth’s resources and studying the surface ecosystem, exerting a significant influence on hydrological, ecological, environmental and biogeochemical studies. A multitude of satellite platforms are responsible for the provision of essential LST products on a global scale. Each of these platforms is accompanied by distinct trade-offs in terms of spatial and temporal resolution. For example, the Moderate Resolution Imaging Spectroradiometer (MODIS) on Terra/Aqua provides near-daily global coverage with a resolution of approximately 1 km. Landsat satellites (e.g., Landsat 8/9 Thermal Infrared Sensor—TIRS) provide higher resolution (~100 m native, often resampled to 30 m) but only revisit every 16 days. The Sentinel-3 Sea and Land Surface Temperature Radiometer (SLSTR) has been found to exhibit analogous spatiotemporal characteristics to those of the MODIS. Geostationary satellites achieve very high temporal resolution (sub-hourly) but at coarser spatial scales (typically 2–5 km).
Given that contemporary satellites cannot simultaneously deliver land-surface temperature (LST) imagery with both high-spatial and high-temporal resolution, researchers typically fuse two complementary types of data: (i) high-spatial-resolution but low-temporal-resolution LST images (e.g., Landsat scenes with 30 m resolution acquired every 16 days), and (ii) low-spatial-resolution but high-temporal-resolution images (e.g., daily MODIS scenes at 1000 m resolution) [
1,
2,
3,
4]. Numerous traditional algorithms have been developed for this purpose [
5,
6,
7,
8]. Widely used approaches include weighting models, such as the Spatial and Temporal Adaptive Reflectance Fusion Model (STARFM) proposed by Gao et al. [
9], the Enhanced STARFM (ESTARFM) by Zhu et al. [
10], and STNLFFM proposed by Cheng et al. [
11]; as well as unmixing models like the Spatiotemporal Data Fusion Algorithm (STDFA) introduced by Wu et al. [
12] and Zhang et al.’s Enhanced Spatiotemporal Data Fusion Model (ESTDFM) [
13]. In addition, there are also hybrid models, among which the Flexible Spatiotemporal DATA Fusion (FSDAF) model proposed by Zhu is widely used [
14]. These classical algorithms are computationally efficient and robust, requiring minimal ancillary information (e.g., land-cover maps or physical model parameters). They achieve fusion using multisource remote-sensing imagery alone, thus lowering the barriers to application and ensuring wide applicability [
15]. However, over complex heterogeneous surfaces (e.g., regions characterized by mixed pixels), conventional methods like STARFM often fail to accurately capture local abrupt changes, thereby reducing fusion accuracy [
10,
16]. For example, the Root Mean Square Error (RMSE) of weight-based models in forest–farmland ecotones may exceed 2 K [
1].
With recent advancements in computer vision and deep-learning theory, many scholars have used deep-learning techniques to perform spatiotemporal fusion of surface temperature products [
15,
17,
18,
19]. These methods utilize deep neural networks to automatically learn features from massive amounts of historical data and establish nonlinear mapping relationships between high-temporal but low-spatial resolution and low-temporal but high-spatial resolution images [
20]. Deep-learning methods have demonstrated great potential in the field of spatiotemporal fusion for remote sensing images due to their high accuracy and robustness. Among these, convolutional neural networks (CNNs) have been most extensively employed [
21,
22,
23]. For example, Zhou et al. achieved an 87.21% recognition accuracy in carbon emission monitoring from remote sensing images by employing a multi-model fusion strategy (e.g., fully connected layer fusion) and Gamma correction data augmentation, demonstrating the effectiveness of model fusion in enhancing the robustness and precision of remote-sensing tasks [
24]. Based on deep-learning techniques, Tan et al. proposed An Enhanced Deep Convolutional Model for Spatiotemporal Image Fusion (EDCSTFN) model [
25]. By integrating a compound loss function and an enhanced data strategy, the EDCSTFN model demonstrates outstanding performance in improving the prediction accuracy and image quality of spatiotemporal remote sensing image fusion. However, it still has some limitations, such as restricted capability in predicting significant ground changes and the need for further exploration of model transferability [
26]. Wu et al. constructed A Cross-Attention-Based Adaptive Weighting Fusion Network for MODIS and Landsat Spatiotemporal Fusion (CAFE) model for the fusion of Landsat and MODIS satellite images [
26]. By leveraging a cross-attention mechanism and an adaptive temporal difference weighting mechanism, CAFE can effectively capture both subtle and dramatic changes on the land surface, generating more accurate fusion results. However, its complex network structure demands substantial computational resources. These limitations, particularly concerning computational efficiency and the preservation of fine details under complex scenarios, are recognized as common challenges within the broader field of deep learning-based image fusion [
27,
28]. Liu et al. proposed the Self-Supervised Transformer for Infrared and Visible Image Fusion (StfNet) model, which utilizes dual-stream networks and temporal information to improve image-fusion accuracy [
1]. However, its drawbacks include reliance on high-resolution images taken at close intervals and high computational complexity [
29]. Extensive experiments have demonstrated the effectiveness of CNNs in multisource remote-sensing image fusion, as their multi-layer architectures can automatically extract hierarchical features, thus better representing image details and structural information.
The specific application of deep learning in Land surface temperature (LST) downscaling has garnered increasing attention [
30]. Unlike traditional methods relying on manually constructed spectral indices and traditional statistical models, deep learning models can automatically learn deep, complex nonlinear features directly from raw remote sensing data, thereby more effectively characterizing the mapping relationship between surface temperature and multi-source auxiliary data. For instance, convolutional neural networks (CNNs) and their variants have been successfully applied to LST downscaling tasks. Through end-to-end training, these models simultaneously capture spatial texture details and overall temperature distribution trends, significantly enhancing downscaling accuracy and spatial detail-retention capabilities over heterogeneous surfaces [
31]. Improvements in deep-learning approaches primarily draw inspiration from advanced architectures and mechanisms in computer vision. Examples include introducing multi-scale feature-fusion modules to enhance model sensitivity to different object boundaries, or employing attention mechanisms to dynamically weight important spatial and channel features. Furthermore, to address challenges posed by different sensors, temporal phases, and seasonal variations, physically guided correction modules are increasingly integrated with data-driven deep-learning models to enhance model generalization and physical consistency [
31,
32,
33,
34]. However, despite the immense potential demonstrated by deep learning, current methods still face common challenges such as insufficient retention of high-frequency details, weak model interpretability, and high dependence on training data volume and quality.
Despite these advancements, two primary issues persist: (1) For complex terrains, current deep-learning models continue to produce fused LST images with insufficiently sharp representation of surface features, textures, and fine details. (2) The mean absolute error (MAE) of fused LST images typically remains around 2.5 K [
1].
The aim of this study is to address the critical challenges of high-precision land surface temperature (LST) fusion and spatial detail preservation in heterogeneous surface areas. To this end, a novel deep learning spatiotemporal fusion model, DLSTFM, is proposed with the following key contributions:
A Dual-Branch Fusion Structure: This study designs a specifically tailored architecture that decouples coarse spatiotemporal fusion (MODIS target–reference pairs) from fine spatiotemporal fusion (Landsat reference features). This hierarchical approach enables effective learning of both temporal dynamics and spatial details, significantly improving the quality of predicted images in heterogeneous regions compared to single-branch frameworks.
Spatial Adaptive Feature Modulation (SAFM): SAFM is designed for dynamic multi-scale spatial feature integration. Through hierarchical decomposition and resolution-adaptive operations (e.g., separable convolution for base features, adaptive downsampling for high-level features), SAFM enhances cross-scale representation. Ablation studies confirm its critical role in preserving high-frequency details, with its removal increasing RMSE by 4.294 K and causing visible blurring.
Temperature Adaptive Correction Module (TCM): TCM is developed, which innovatively combines radiometric adjustment principles with deep learning. It employs adaptive calibration using Landsat and MODIS reference data to address inherent temperature biases from sensor differences and acquisition conditions, ensuring accuracy. Removing TCM increased MAE by 3.87 K, demonstrating the synergy of physics and data-driven learning.
The Dual-branch Structure of DLSTFM, along with its SAFM and TCMs, work in concert. It employs a dual-branch structure to provide a clear information processing flow, utilizes SAFM to ensure spatial detail enhancement and preservation, and applies TCM physical correction to guarantee temperature value accuracy. This systematic design enables it to surpass existing methods such as EDCSTFN and CAFE in both detail clarity and temperature precision when handling heterogeneous surfaces.
2. Materials and Methods
In this paper, to obtain a high-resolution LST image for the target date, a pair of Landsat-8 and MODIS images from a reference date is fused with a single MODIS image acquired on the target date. To this end, this study designs a surface temperature spatiotemporal fusion model called DLSTFM. The core architecture of DLSTFM is composed of an adaptive temperature-correction module and a multi-scale feature-modulation module. As shown in
Figure 1, the overall network is split into two main branches—a dual-temporal fusion branch and a multi-source feature-fusion branch—which handle coarse-prediction and fine-prediction tasks, respectively.
The multi-level fusion network includes three key components: First, the data input layer receives MODIS target images, MODIS reference images, and Landsat reference images. Second is the Spatial Adaptive Feature Modulation (SAFM) module, which extracts and fuses features of the input images using a convolutional layer and an attention mechanism. The third is the loss function, which combines the composite loss functions of EdgeLoss and PerceptualLoss (EP Loss). EP Loss is mainly used to emphasize the preservation and enhancement of image edge information.
DLSTFM is not a simple stacking of multiple modules, but rather the collaborative operation of Dual-branch structure, SAFM, and TCM. Specifically, the dual-branch structure provides the foundation for feature decoupling, SAFM enhances the representation of spatial details during feature extraction, while TCM further corrects temperature values at the output stage. This collaborative approach enables DLSTFM to maintain clear spatial details in heterogeneous regions while achieving accurate temperature values.
2.1. Spatially Adaptive Feature Module
As shown in
Figure 2, the SAFM module is part of Fusion Module for Multimodal Features (FMM), the core innovative component of this study, which aims to realize the adaptive fusion of multi-scale spatial features. The module effectively enhances the network’s ability to model cross-scale features through hierarchical feature decomposition and spatial weight learning. The SAFM contains three key designs:
2.1.1. Multi-Level Feature Decoding
Given an input feature map X ∈ R(C × H × W), it is first divided equally along the channel dimensions into n subfeatures {X_i} where i = 1…n with Xi ∈ R(C/n × H × W). explicit decomposition establishes the basis for subsequent multiscale processing.
2.1.2. Hierarchical Spatial Weighting
For each sub-feature Xi, the adaptive spatial modulation unit is designed. For base resolution features (i = 1), local context modeling is performed by directly applying separable convolution:
For high-level features (i > 1), adaptive downsampling mechanism is introduced:
where Pi(-) is an adaptive maximization operation where the output size is scaled down by 2−i, and U(-) denotes nearest-neighbor upsampling.
2.1.3. Feature-Aggregation Mechanisms
Cross-Level Feature Fusion via Channel Splicing and Linear Projection.
where ⊙ denotes the Hadamard product, and the design realizes feature modulation through the gating mechanism. Experiments show that the SAFM module significantly improves the network’s discriminative representation of spatial features through parameter sharing (grouped convolution) and multi-scale co-optimization, while maintaining computational efficiency. Ablation experiments verify that the optimal performance balance is reached when the number of layers
n = 4.
2.2. Temperature Adaptive Correction Module
When fusing imagery from different satellite sensors, discrepancies in acquisition times and sensor characteristics create temperature inconsistencies between Landsat-8 and MODIS LST images collected on the same date, which in turn degrade the accuracy of the fused product. To mitigate this issue, this paper incorporates a temperature adaptive correction module (TCM) into DLSTFM (
Figure 3). This module refines the fused LST by referencing both Landsat and MODIS observations.
During the prediction phase, TCM employs only MODIS imagery from the target date, reference date Landsat imagery, and reference date MODIS imagery for temperature calibration. It does not utilize actual Landsat imagery from the target date, thereby eliminating any potential data-leakage issues.
2.3. Loss Function
In order to enhance the spatial detail fidelity and semantic consistency of the fusion results, this study proposes a composite loss function (Edge-Perceptual Loss, EP Loss) that fuses multi-scale edge constraints and depth-perceptual features [
35]. This loss function adaptively balances different supervised signals through a dynamic weight adjustment mechanism, and its core components are as follows:
2.3.1. Multi-Scale Laplacian Edge Loss
The edge difference between the predicted image and the real image at multiple scales is computed by an improved Laplacian pyramid. Specifically, a multi-scale Gaussian kernel is used to smooth and filter the input image and extract the residuals as edge features. The loss function expression is given as:
where
denotes the Laplace edge extraction operation with scale factor
,
is the multiscale set, and
is the Charbonnier paradigm with smoothing term.
2.3.2. Hierarchical Perceptual Loss
Based on the pre-trained VGG16 network, the shallow convolutional features of the image are extracted, and the consistency between the image and the real image in the feature space is predicted by the mean square error constraint:
where
denotes the feature mapping of the
layer of VGG16,
is the set of selected feature layers, and
is the weight coefficients of each layer.
2.3.3. Adaptive Weighting Mechanism
In order to avoid setting fixed weights manually, EP Loss dynamically adjusts the weights according to the ratio of the two types of loss values in the current training iteration:
The final loss function is:
Through adaptive tuning, the model is able to adapt more flexibly to the learning needs of edge details and full structures in different training phases.
2.4. Dataset Description and Relevant Settings for Training
2.4.1. Experimental Dataset
The primary study area of this work is the Griffith region in Australia, which features a temperate Mediterranean climate with distinct seasonal temperature variations. The land surface exhibits high heterogeneity, encompassing diverse cover types such as cropland, forest, water bodies, grassland, and urban built-up areas. These features are interwoven spatially, creating complex landscape patterns and numerous sharp temperature transition zones. This diversity in climate and surface characteristics provides an ideal testbed for validating spatiotemporal fusion models under complex, real-world conditions. This study utilized thermal-infrared land surface temperature data from Landsat 8-TIRS and MODIS Terra. The land surface temperature imagery dataset utilized in this study is designated as LM-fusion, with each image measuring 1238 pixels in length and 1121 pixels in width. LM-fusion is the original dataset which constructed specifically for this research. The dataset under consideration is composed of images from the Griffith region of Australia. All comparative methods employed in this study utilize the same LM-fusion dataset and are evaluated under identical training/testing splits. The input images underwent standardized preprocessing procedures, including radiometric correction, atmospheric correction, and cropping. As displayed in
Table 1, it consists of 27 pairs of Landsat-8 and MODIS land surface temperature images, for a total of 54 images. These images have been divided into training and testing sets. The training set comprises 21 pairs of land surface temperature images, while the testing set contains 6 pairs. This dataset encompasses a variety of land-cover types, including lakes, farmlands, forests, grasslands, and urban buildings. The land cover characteristics of this dataset are more complex compared to other datasets. This provides a rich array of scenarios and challenges for model training and testing. To fully leverage this diversity and enhance the model’s ability to learn local patterns, the training images were segmented into 40 × 40 pixel patches. This strategy significantly increases the effective number and diversity of training samples, which helps the model generalize better and mitigates the risk of overfitting. Furthermore, the prevalence of these land-cover types in urban areas renders the dataset a reliable representation. Prior to analysis, all images underwent a series of preprocessing steps, including radiometric correction, atmospheric correction, rotation, and cropping. These procedures were implemented to ensure the integrity and consistency of the data.
2.4.2. Relevant Settings for Training
This study employs deep-learning techniques for the training of a land surface temperature imagery fusion model, utilizing the Adam optimization algorithm for parameter optimization. During the training process, input is conducted using image patches of size 40 pixels × 40 pixels, and these patches are grouped into batches for training, with a batch size set to 32. The learning rate is carefully adjusted, ultimately set to 3.19 × 10−4, to achieve rapid convergence and parameter optimization of the model. The model training undergoes a total of 50 iterations, with batch-normalization techniques applied in each iteration to accelerate the training process and enhance model performance.
3. Results
To validate the effectiveness of the DLSTFM, it was compared with four existing LST fusion algorithms: STARFM, STNLFFM, EDCSTFN, and CAFE. Among these, STARFM and STNLFFM are filter-based models, while EDCSTFN and CAFE are deep-learning-based models.
Three quantitative metrics—Structural Similarity Index (SSIM), Root Mean Square Error (RMSE), and Mean Absolute Error (MAE)—were employed to evaluate the agreement between the fused outputs and the actual target LST images. Lower RMSE/MAE scores and higher SSIM scores indicate better fusion performance.
3.1. Qualitative Analysis
The experimental results demonstrate that the DLSTFM produces LST images with clear details and strong contrast. As shown in
Figure 4, the model effectively identifies hotspots (areas of high temperature) and cold spots (areas of low temperature), with their spatial locations and temperature ranges closely matching those in the ground-truth images. Notably, the model accurately captures the sharp temperature boundaries between urban heat islands and surrounding farmland. The dual-branch structure (dual-temporal fusion branch and multi-source feature fusion branch) and multi-layer convolutional structure of DLSTFM enhance its ability to extract complex features, preserving fine texture details during fusion. Additionally, the temperature-adaptive correction module further improves the model’s detail retention and pixel-wise accuracy, resulting in sharper output images.
The STARFM has been observed to generate images with a tendency to exhibit lower temperatures in comparison to the ground truth. This discrepancy is characterized by a discernible yet subtle contrast between regions of low and high temperature. In heterogeneous regions, such as urban areas or fragmented farmland, the model faces challenges in capturing small-scale temperature variations, resulting in predictions that lack clarity and precision. Moreover, the spectral similarity screening in STARFM may introduce errors in complex terrains [
9], particularly in regions with significant temperature gradients, where uneven weight allocation reduces the contrast between high and low temperatures.
STNLFFM output images performed well in terms of overall image clarity and detail texture, similar to STARFM. Compared to deep learning-based models, STNLFFM offers higher computational efficiency, enabling faster fusion. However, the model’s predictions for high-temperature pixels demonstrate a degree of inaccuracy, manifesting as a slight darkening of these regions. This phenomenon may be attributed to an inadequate consideration of high-temperature characteristics during the weight calculation process.
The EDCSTFN model demonstrates comparatively diminished overall clarity, accompanied by a certain degree of loss of texture details and deficiencies in image contrast and sharpness. An analysis of the model’s code suggests that this may stem from detail loss during feature extraction and fusion, simplistic feature-fusion methods, and suboptimal upsampling techniques. The model’s straightforward feature-fusion approach, which involves the incorporation of encoded Landsat features into residual features, may not fully leverage the complementary information present among different input features. This potential defect may result in a decrease in the overall contrast of the output image and blurring of fine details and textures.
Moreover, the selection of upsampling method has been demonstrated to exert a substantial influence on the quality of the resulting image. Although bilinear interpolation, as implemented in EDCSTFN, is computationally efficient, it has been observed to underperform in the recovery of high-frequency details. This has been shown to result in a degree of blurriness in the output.
The CAFE model’s fusion images show significant high–low temperature contrast, with clear spatial differentiation between high-temperature centers and low-temperature areas. The macro-distribution patterns of the low-temperature and high-temperature areas are highly consistent with the target images, which verifies the model’s effectiveness in spatial pattern reconstruction. However, in high-temperature areas, a small number of pixels remain unfused, leading to missing values. This issue may arise from data interpolation strategies or flaws in the fusion rules. Similar to ESTARFM, which relies on linear spectral unmixing for mixed pixels [
12], CAFE may fail to extrapolate information if high-temperature changes are inadequately represented in the reference images. Additionally, without dynamic weight adjustment (e.g., inverse distance weighting in EDCSTFN), extreme-temperature pixels may be “diluted” by neighboring low-temperature values, resulting in localized gaps.
The visual comparison above reflects the performance differences among models through the overall image. To further validate DLSTFM’s advantage in detail retention at the pixel scale, this study conducted a temperature-profile analysis. Specifically, a horizontal profile line extending from the left boundary to the image center was extracted from the test image, plotting the temperature-variation curve with pixel position (as shown in
Figure 5). Analysis indicates that the temperature profile predicted by DLSTFM (red solid line) exhibits the highest agreement with actual surface temperatures (black solid line). Across multiple regions of abrupt temperature changes, DLSTFM accurately captures and reproduces the steep temperature gradients, with distinct and sharp transitions in the curve. In contrast, the traditional method STARFM (blue dashed line) produced a prediction curve broadly similar to the actual surface temperature, though its predicted values were slightly lower than the true values in high-temperature regions. The deep learning method EDCSTFN (purple dotted line) exhibited an overly smoothed prediction curve, coupled with significant systematic bias in high-temperature areas, where its predicted values deviated markedly from the actual values. Another deep-learning method, CAFE (green dotted line), performs better than EDCSTFN but still underestimates high-temperature peaks and overestimates low-temperature troughs. STNLFFM (orange dotted line) performed between STARFM and the deep-learning methods. This pixel-level profile analysis visually confirms that DLSTFM, leveraging its dual-branch architecture and Spatially Adaptive Feature Modulation (SAFM) module, more effectively learns and preserves high-frequency spatial details and accurate temperature values in complex heterogeneous surface areas. It overcomes the shortcomings of existing methods, such as blurred details or unstable predictions.
3.2. Quantitative Analysis
Evaluation indicators include:
3.2.1. Structure Similarity Index Measure (SSIM)
The Structure Similarity Index (SSIM) is an indicator that assesses the similarity between two images by comparing their brightness, contrast, and structure, thereby evaluating image quality [
36,
37]. SSIM values range from −1 to 1, with higher values indicating better image quality. A value of 1 indicates identical images, while −1 suggests completely different images.
This study selected imagery from six representative dates between January 2019 and February 2020 to serve as a validation set for model fusion experiments.
Figure 6 presents the quantitative results on the LM-fusion. By comparing the model’s outputs with reference imagery, the model’s effectiveness can be verified. As shown in
Table 2, DLSTFM had an average SSIM similar to STARFM and STNLFFM, with values of 0.86, 0.861, and 0.859, respectively. DLSTFM improved the average SSIM by 22.9% and 7.8% compared to the deep-learning models EDCSTFN and CAFE, demonstrating the effectiveness of its dual-branch fusion network structure.
3.2.2. Mean Absolute Error (MAE)
The Mean Absolute Error (MAE) is a common metric for assessing the performance of spatiotemporal fusion models, calculated as the average of the absolute differences between predicted and actual values [
38]. Mathematically, MAE is defined as:
where
is the true value of the
sample,
is the corresponding predicted value, and
is the total number of samples.
In this experiment, MAE was also a key indicator for evaluating the accuracy of land surface temperature imagery. The results, shown in
Figure 6, indicate that DLSTFM exhibited significant accuracy advantages across all six test datasets, with an average MAE of 2.069 K. From
Figure 6, it can be seen that the average MAE of the traditional algorithms STARFM and STNLFFM were 2.874 K and 2.879 K, respectively, suggesting similar accuracy in fused land surface temperature values. Deep-learning models had higher MAE, with EDCSTFN at 6.026 K and CAFE at 6.124 K, indicating poorer adaptability to complex land surface temperature changes. DLSTFM consistently maintained the lowest error, demonstrating superior stability and accuracy. These results confirm the accuracy of DLSTFM in fusing Landsat and MODIS imagery and predicting land surface temperature values.
3.2.3. Root Mean Square Error (RMSE)
The Root Mean Square Error (RMSE) is a widely used metric to assess the discrepancy between a model’s predictions and actual values. It is calculated as the square root of the average of the squared prediction errors [
39]. The formula is:
where
is the true observed value of the
th sample,
is the predicted value of the
th sample, and
is the number of samples.
As depicted in
Figure 6, the average RMSE of the DLSTFM across the six test datasets was 2.637 K, significantly lower than the traditional methods STARFM and STNLFFM, and reduced by 61.4% and 60.1% compared to the deep-learning models EDCSTFN and CAFE, respectively. This result illustrates that the DLSTFM, through its dual-branch fusion network and temperature adaptive correction module, effectively addresses the insufficient feature capture in complex surface areas by traditional algorithms and overcomes the temperature prediction bias due to detail loss in existing deep-learning models. Specifically, the average RMSE of DLSTFM on the validation set was only 2.023 K, representing improvements of 10.7% and 49.0% compared to STARFM (2.264 K) and CAFE (3.962 K), respectively, verifying the prediction ability of the Multi-scale Feature Modulation Module (FMM) on local temperature mutations. Additionally, the RMSE standard deviation of DLSTFM is 0.391 K, lower than other models, indicating its stronger stability and generalization in multiple scenarios. Overall, the significant advantage of RMSE confirms the innovation and practicality of DLSTFM in improving the accuracy of spatiotemporal fusion of land surface temperature. Generally, the land surface temperature output by the DLSTFM has a high consistency with the reference values, providing an effective technical solution for the continuous reconstruction of land surface temperature in space and time.
4. Discussion
To validate the effectiveness of the DLSTFM network structure and feature modules, this study modified the network structure and deleted relevant modules and conducted related experiments on the LM-fusion dataset.
4.1. Ablation Experiment on Feature Modules
Through the execution of ablation experiments, it was demonstrated that the incorporation of the FMM into the dual-branch fusion structure significantly enhances the generalization capability on the LM-fusion dataset. This integration has been demonstrated to enhance the spatial detail in generated imagery and to improve the accuracy of surface temperature estimates. The outcomes were evaluated using SSIM, MAE, and RMSE metrics. As illustrated in
Table 3, the elimination of the FMM and TCM resulted in a substantial decrease in various test metrics, thereby validating the FMM’s effectiveness.
The experimental findings pertaining to the ablation of the FMM have indicated that it plays a critical role in the maintenance of spatial details and temperature accuracy. The results are presented in
Table 3, upon the removal of the FMM, a substantial decline in the SSIM metric of 12.56% was observed. This decline was accompanied by the presence of spatial blurring in the generated images, suggesting a notable impairment in the extraction of high-frequency features. Furthermore, a significant increase in temperature prediction errors was noted, resulting in a decrease in the overall image temperatures compared to the target images. This decrease was accompanied by an increase in the root mean square error (RMSE) and the mean absolute deviation (MAE) by 4.294 K and 4.072 K, respectively. This finding underscores the pivotal role of the FMM in ensuring the precise reconstruction of surface temperature field, highlighting its indispensable nature in capturing subtle temperature fluctuations, particularly in regions characterized by substantial temperature gradients. Furthermore, an evaluation of the visual quality of the fused images was conducted. The experimental results in
Figure 7 indicate that in low-temperature regions and some high-temperature regions, the removal of the FMM resulted in significant brightness differences between the model-generated images and the target images over large areas. This finding serves to further substantiate the pivotal function of the FMM in the regulation of pixel values and the maintenance of temperature consistency.
The ablation of the TCM revealed a different impact pattern: although the SSIM decreased only by 0.34%, visually manifested primarily as brightness differences, temperature errors still showed a significant increase, with RMSE and MAE increasing by 3.671 K and 3.873 K, respectively. This finding suggests that the elimination of the TCM concomitantly results in a reduction in the accuracy of the model’s predicted temperature values in the images.
It is noteworthy that the removal of both modules resulted in temperature errors exceeding 5 K, but the error increase was more pronounced when FMM was removed. This finding indicates that, in the joint optimization of space and temperature, FMM plays a more fundamental role in feature expression, while TCM focuses on deep correction of temperature features.
In summary, following the removal of the two key modules, FMM and TCM, a range of test indicators demonstrated a decrease, suggesting that FMM and TCM play a crucial role in maintaining the spatial details, temperature accuracy, and overall visual effects of the merged images.
4.2. Ablation Experiment on Network Structure
The network architecture of DLSTFM was subjected to an experimental study that focused on two structures: fast fusion and dual-branch fusion.
Fast Fusion Structure: This approach entails a direct concatenation of features from reference MODIS, target MODIS2, and reference Landsat imagery. The findings suggest that this approach yields a lower level of accuracy in comparison to the original model. The primary rationale for this phenomenon is attributable to the fast fusion’s singular branch configuration, which integrates the three inputs (i.e., reference MODIS, target MODIS, and reference Landsat images) at an early stage in the channel dimension. While this approach expedites the fusion process and reduces computational demands, rendering it suitable for remote sensing images with straightforward surface characteristics, it is inadequate for handling complex spatiotemporal relationships. This phenomenon is particularly evident in images characterized by intricate surface types, where the resulting fused images demonstrate a noticeable reduction in accuracy.
Dual-Branch Fusion Structure: The methodology entails an initial fusion of the reference MODIS and target MODIS features, followed by a subsequent integration with reference Landsat features. As shown in
Table 4, the experimental findings demonstrate that this strategy exhibits superior performance in comparison to the original model, as evidenced by an enhancement of SSIM by 13.76%, along with a reduction in MAE and RMSE by 4.306 K and 4.194 K, respectively. However, this method is computationally slower than the fast fusion approach. In contrast to the direct fusion of all three image features in fast fusion, the dual-branch fusion’s phased integration enables a more detailed and gradual processing of information from diverse data sources. The dual-branch fusion structure enhances the model’s capacity to capture diverse temporal, and spatial characteristics leads to the production of imagery with refined spatial details.
An analysis of the experimental results depicted in
Figure 8 reveals that the output images of DLSTFM employing the dual-branch fusion structure exhibit a higher degree of similarity to TARGET with respect to details, particularly in the regions delineating urban and agricultural areas, as illustrated in
Figure 8. Conversely, the output images of the fast fusion structure exhibit a relative loss of clarity in these details, resulting in a compromise of crucial information. This phenomenon may be attributed to the fact that the fast-fusion structure assigns greater weight to MODIS images during the process of feature fusion.
Additionally, the output images of DLSTFM demonstrate superiority over fast fusion with respect to contrast and brightness. The contrast and brightness of DLSTFM’s output results are more aligned with those of the target images, contributing to a more natural and realistic overall appearance of the image. Conversely, the contrast and brightness of the output from fast fusion are lower, resulting in a lower overall visual quality.
In summary, DLSTFM employs more effective feature extraction and fusion strategies in its network structure design, resulting in superior performance in detail retention, edge clarity, contrast, brightness, and noise suppression. These advantages contribute to the enhanced similarity between the output images of DLSTFM and the target images, thereby providing more valuable data support for subsequent surface temperature analysis and research.
4.3. Supplementary Experiments
To demonstrate the generalization capability of this model, this study has added a new experiment section in which the study tests the trained model on a new geographically different region without retraining. This area is characterized by a mix of forests, agricultural plots, and sporadic water bodies, which differs significantly from the agricultural–urban mosaic of the original Griffith region. Unlike the relatively flat terrain of Griffith, the Ardiethan area exhibits more pronounced topographic variations, providing a more challenging and diverse landscape for testing model robustness.
Figure 9 presents the fusion results of the DLSTFM for the surface temperature image of the Ardiethan region on 3 January 2014. Visually, the model effectively reconstructs the spatial temperature distribution pattern of this area, with clear boundaries between high-temperature and low-temperature zones. It exhibits high spatial consistency with the actual Landsat imagery (
Figure 9), showing no significant blurring or distortion.
To further quantitatively assess the model’s fusion accuracy in this new region, this study calculated three metrics: RMSE, MAE, and SSIM. The results are presented in
Table 5. The DLSTFM achieved an outstanding performance in this region with an RMSE of 0.5022 K, an MAE of 0.4318 K, and a high SSIM of 0.931.
Crucially, the model trained solely on the Griffith dataset was applied directly to the Ardiethan data without any retraining, fine-tuning, or parameter adjustment. This strict zero-shot transfer evaluation is designed to test the model’s inherent generalizability beyond its training distribution.
The results of supplementary experiments demonstrate that the model achieves cross-location generalization capabilities through spatiotemporal joint learning (rather than purely temporal sequence learning). Quantitative metrics indicate that the model maintains high predictive performance at new locations, enabling deployment without requiring full retraining.
4.4. Computational Efficiency
To assess the practical utility of the proposed method, this study recorded the training and inference time of DLSTFM and compared it with existing approaches. Experiments were conducted on a workstation equipped with an NVIDIA GeForce RTX 4080 GPU. The total training time for 50 epochs on the LM-fusion dataset was approximately 1.65 h. During inference, the average time required to fuse a single image for each method is shown in
Table 6.
While traditional methods are faster, their fusion accuracy in heterogeneous regions is significantly lower than that of DLSTFM. Among deep-learning approaches, DLSTFM is notably faster than both EDCSTFN and CAFE while achieving superior fusion accuracy. This indicates that DLSTFM strikes a favorable balance between computational efficiency and fusion quality, making it suitable for medium- to large-scale LST fusion tasks that require high precision.
5. Conclusions
The proposed model utilizes a deep learning-based spatiotemporal fusion approach for surface temperature image prediction, with the objective of achieving high spatiotemporal resolution surface temperature prediction through the integration of multi-source remote sensing data, including MODIS and Landsat. The model utilizes the LM-fusion dataset, a more accurate representation of the complex topographical features characteristic of urban and suburban regions. The DLSTFM employs a dual-branch structure to accommodate both coarse and fine prediction tasks, thereby facilitating more effective capture of subtle and drastic changes in the surface. Furthermore, the SAFM and TCM effectively enhance the spatial detail and temperature value accuracy of the predicted images. SAFM enhances the network’s capacity to model cross-scale features through the implementation of hierarchical feature decomposition and spatial weight learning methodologies. TCM adapts the temperature values of the output images based on reference images, thereby enhancing the accuracy of LST predictions.
This study systematically compares and analyzes the performance of the DLSTFM and four other commonly used models on the LM-fusion dataset. The findings indicate that DLSTFM achieves state-of-the-art performance on the LM-fusion dataset, attaining a mean absolute error (MAE) of 2.1 K and root mean square error (RMSE) of 2.637 K, outperforming both classical (STARFM, STNLFFM) and deep-learning-based (EDCSTFN, CAFE) models. This framework provides a robust solution for generating high-spatiotemporal-resolution LST products, advancing applications in environmental monitoring.
Furthermore, the model attains more precise inversion of surface features, particularly in regions exhibiting pronounced temperature variations due to heterogeneous land surface types. DLSTFM demonstrates strong generalization capability across different geographical regions, as evidenced by its excellent performance on the Ardiethan test area without any retraining, indicating its effectiveness across different scenarios. In the future, the findings of this study are expected to demonstrate significant application potential in areas such as environmental monitoring and urban sustainable development. Specifically: First, in urban heat island effect analysis, the high-resolution temperature data generated by DLSTFM can precisely identify temperature differences across various surfaces like buildings, green spaces, and water bodies. This provides a reliable data foundation for quantifying heat island intensity and evaluating mitigation strategy effectiveness. Second, in climate research, long-term high-resolution temperature data sequences facilitate deeper understanding of local climate-change patterns and their impacts on urban environments and public health.
Despite DLSTFM’s strong performance, this study has limitations that point to future research directions. First, model performance depends on training data quality and quantity; fusion accuracy may be compromised in regions or periods lacking sufficient reference imagery (e.g., during persistent cloud cover). Second, while the model’s dual-branch structure and SAFM module enhance performance, they also introduce computational overhead, potentially posing efficiency challenges for large-scale applications requiring real-time or near-real-time processing. Moving forward, we will incorporate multi-source data, exploring the use of surface reflectance, vegetation indices, and even digital elevation models as auxiliary inputs. This aims to better model the relationship between temperature and surface physical properties, further improving fusion accuracy in complex terrain areas. Additionally, although the Griffith and Ardiethan region exhibits substantial land-cover heterogeneity, the current evaluation is limited to two geographic regions. Future work will focus on validating DLSTFM across multiple climatic regions and land-cover regimes to further assess transferability. Finally, this study will optimize the model’s neural architecture to improve efficiency and generalization capabilities. This study will investigate lightweight modeling techniques and explore cross-region, cross-sensor transfer learning strategies. These efforts aim to reduce dependence on region-specific training data while enhancing the model’s universality and practicality.