Deep-Learning Spatial and Temporal Fusion Model for Land Surface Temperature Based on a Spatially Adaptive Feature and Temperature-Adaptive Correction Module

Jin, Chenhao; Li, Jiasheng; Shen, Yao

doi:10.3390/rs18020238

Open AccessArticle

Deep-Learning Spatial and Temporal Fusion Model for Land Surface Temperature Based on a Spatially Adaptive Feature and Temperature-Adaptive Correction Module

by

Chenhao Jin

¹,

Jiasheng Li

² and

Yao Shen

^1,*

¹

School of Ecology, Hainan University, Haikou 570228, China

²

School of Information and Communication Engineering, Hainan University, Haikou 570228, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(2), 238; https://doi.org/10.3390/rs18020238

Submission received: 23 November 2025 / Revised: 25 December 2025 / Accepted: 6 January 2026 / Published: 12 January 2026

(This article belongs to the Section Environmental Remote Sensing)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

This study proposes DLSTFM, a novel deep-learning model with a dual-branch structure, SAFM, and TCMs for high-fidelity Land Surface Temperature (LST) fusion.
DLSTFM achieves state-of-the-art performance (MAE = 2.1 K), significantly outperforming existing traditional and deep-learning-based fusion methods.

What are the implications of the main findings?

The model generates high-resolution daily LST data, critical for thermal environment monitoring such as wildfire impact assessment in Australia.
DLSTFM demonstrates strong stability and accuracy within complex land-cover conditions, advancing applications in climate and ecological studies.

Abstract

Land surface temperature (LST) is essential for studying land–atmosphere energy exchange, the impact of climate change, and its influence on crop yields and hydrology. Although satellite remote sensing provides large-scale LST data, existing spatiotemporal fusion methods face challenges. Traditional algorithms have difficulty with heterogeneous surfaces, and deep-learning models often produce blurred details and inaccurate temperatures, which limits their use in high-precision applications. This study addresses these issues by developing a Deep-Learning Spatial and Temporal Fusion Model (DLSTFM) for Landsat-8 and MODIS LST imagery in Griffith, Australia. DLSTFM employs a dual-branch structure: one branch is dedicated to dual-temporal fusion, and the other branch is dedicated to multi-source feature fusion. Key innovations include the Spatial Adaptive Feature Modulation (SAFM) module, which performs adaptive multi-scale feature fusion, and the Temperature Adaptive Correction Module (TCM), which makes pixel-wise adjustments using reference data. Experiments demonstrate that DLSTFM significantly outperforms traditional methods and existing deep-learning fusion methods. DLSTFM achieves clearer surface features and a mean absolute temperature error of approximately 2.1 K. The model also demonstrated excellent generalization performance in another test area (Ardiethan) without retraining, showcasing its substantial practical value for high-accuracy LST fusion.

Keywords:

deep learning; remote sensing; image fusion; land surface temperature; spatially adaptive feature module; temperature adaptive correction module

1. Introduction

LST is one of the most important indicators for monitoring the earth’s resources and studying the surface ecosystem, exerting a significant influence on hydrological, ecological, environmental and biogeochemical studies. A multitude of satellite platforms are responsible for the provision of essential LST products on a global scale. Each of these platforms is accompanied by distinct trade-offs in terms of spatial and temporal resolution. For example, the Moderate Resolution Imaging Spectroradiometer (MODIS) on Terra/Aqua provides near-daily global coverage with a resolution of approximately 1 km. Landsat satellites (e.g., Landsat 8/9 Thermal Infrared Sensor—TIRS) provide higher resolution (~100 m native, often resampled to 30 m) but only revisit every 16 days. The Sentinel-3 Sea and Land Surface Temperature Radiometer (SLSTR) has been found to exhibit analogous spatiotemporal characteristics to those of the MODIS. Geostationary satellites achieve very high temporal resolution (sub-hourly) but at coarser spatial scales (typically 2–5 km).

Given that contemporary satellites cannot simultaneously deliver land-surface temperature (LST) imagery with both high-spatial and high-temporal resolution, researchers typically fuse two complementary types of data: (i) high-spatial-resolution but low-temporal-resolution LST images (e.g., Landsat scenes with 30 m resolution acquired every 16 days), and (ii) low-spatial-resolution but high-temporal-resolution images (e.g., daily MODIS scenes at 1000 m resolution) [1,2,3,4]. Numerous traditional algorithms have been developed for this purpose [5,6,7,8]. Widely used approaches include weighting models, such as the Spatial and Temporal Adaptive Reflectance Fusion Model (STARFM) proposed by Gao et al. [9], the Enhanced STARFM (ESTARFM) by Zhu et al. [10], and STNLFFM proposed by Cheng et al. [11]; as well as unmixing models like the Spatiotemporal Data Fusion Algorithm (STDFA) introduced by Wu et al. [12] and Zhang et al.’s Enhanced Spatiotemporal Data Fusion Model (ESTDFM) [13]. In addition, there are also hybrid models, among which the Flexible Spatiotemporal DATA Fusion (FSDAF) model proposed by Zhu is widely used [14]. These classical algorithms are computationally efficient and robust, requiring minimal ancillary information (e.g., land-cover maps or physical model parameters). They achieve fusion using multisource remote-sensing imagery alone, thus lowering the barriers to application and ensuring wide applicability [15]. However, over complex heterogeneous surfaces (e.g., regions characterized by mixed pixels), conventional methods like STARFM often fail to accurately capture local abrupt changes, thereby reducing fusion accuracy [10,16]. For example, the Root Mean Square Error (RMSE) of weight-based models in forest–farmland ecotones may exceed 2 K [1].

With recent advancements in computer vision and deep-learning theory, many scholars have used deep-learning techniques to perform spatiotemporal fusion of surface temperature products [15,17,18,19]. These methods utilize deep neural networks to automatically learn features from massive amounts of historical data and establish nonlinear mapping relationships between high-temporal but low-spatial resolution and low-temporal but high-spatial resolution images [20]. Deep-learning methods have demonstrated great potential in the field of spatiotemporal fusion for remote sensing images due to their high accuracy and robustness. Among these, convolutional neural networks (CNNs) have been most extensively employed [21,22,23]. For example, Zhou et al. achieved an 87.21% recognition accuracy in carbon emission monitoring from remote sensing images by employing a multi-model fusion strategy (e.g., fully connected layer fusion) and Gamma correction data augmentation, demonstrating the effectiveness of model fusion in enhancing the robustness and precision of remote-sensing tasks [24]. Based on deep-learning techniques, Tan et al. proposed An Enhanced Deep Convolutional Model for Spatiotemporal Image Fusion (EDCSTFN) model [25]. By integrating a compound loss function and an enhanced data strategy, the EDCSTFN model demonstrates outstanding performance in improving the prediction accuracy and image quality of spatiotemporal remote sensing image fusion. However, it still has some limitations, such as restricted capability in predicting significant ground changes and the need for further exploration of model transferability [26]. Wu et al. constructed A Cross-Attention-Based Adaptive Weighting Fusion Network for MODIS and Landsat Spatiotemporal Fusion (CAFE) model for the fusion of Landsat and MODIS satellite images [26]. By leveraging a cross-attention mechanism and an adaptive temporal difference weighting mechanism, CAFE can effectively capture both subtle and dramatic changes on the land surface, generating more accurate fusion results. However, its complex network structure demands substantial computational resources. These limitations, particularly concerning computational efficiency and the preservation of fine details under complex scenarios, are recognized as common challenges within the broader field of deep learning-based image fusion [27,28]. Liu et al. proposed the Self-Supervised Transformer for Infrared and Visible Image Fusion (StfNet) model, which utilizes dual-stream networks and temporal information to improve image-fusion accuracy [1]. However, its drawbacks include reliance on high-resolution images taken at close intervals and high computational complexity [29]. Extensive experiments have demonstrated the effectiveness of CNNs in multisource remote-sensing image fusion, as their multi-layer architectures can automatically extract hierarchical features, thus better representing image details and structural information.

The specific application of deep learning in Land surface temperature (LST) downscaling has garnered increasing attention [30]. Unlike traditional methods relying on manually constructed spectral indices and traditional statistical models, deep learning models can automatically learn deep, complex nonlinear features directly from raw remote sensing data, thereby more effectively characterizing the mapping relationship between surface temperature and multi-source auxiliary data. For instance, convolutional neural networks (CNNs) and their variants have been successfully applied to LST downscaling tasks. Through end-to-end training, these models simultaneously capture spatial texture details and overall temperature distribution trends, significantly enhancing downscaling accuracy and spatial detail-retention capabilities over heterogeneous surfaces [31]. Improvements in deep-learning approaches primarily draw inspiration from advanced architectures and mechanisms in computer vision. Examples include introducing multi-scale feature-fusion modules to enhance model sensitivity to different object boundaries, or employing attention mechanisms to dynamically weight important spatial and channel features. Furthermore, to address challenges posed by different sensors, temporal phases, and seasonal variations, physically guided correction modules are increasingly integrated with data-driven deep-learning models to enhance model generalization and physical consistency [31,32,33,34]. However, despite the immense potential demonstrated by deep learning, current methods still face common challenges such as insufficient retention of high-frequency details, weak model interpretability, and high dependence on training data volume and quality.

Despite these advancements, two primary issues persist: (1) For complex terrains, current deep-learning models continue to produce fused LST images with insufficiently sharp representation of surface features, textures, and fine details. (2) The mean absolute error (MAE) of fused LST images typically remains around 2.5 K [1].

The aim of this study is to address the critical challenges of high-precision land surface temperature (LST) fusion and spatial detail preservation in heterogeneous surface areas. To this end, a novel deep learning spatiotemporal fusion model, DLSTFM, is proposed with the following key contributions:

A Dual-Branch Fusion Structure: This study designs a specifically tailored architecture that decouples coarse spatiotemporal fusion (MODIS target–reference pairs) from fine spatiotemporal fusion (Landsat reference features). This hierarchical approach enables effective learning of both temporal dynamics and spatial details, significantly improving the quality of predicted images in heterogeneous regions compared to single-branch frameworks.

Spatial Adaptive Feature Modulation (SAFM): SAFM is designed for dynamic multi-scale spatial feature integration. Through hierarchical decomposition and resolution-adaptive operations (e.g., separable convolution for base features, adaptive downsampling for high-level features), SAFM enhances cross-scale representation. Ablation studies confirm its critical role in preserving high-frequency details, with its removal increasing RMSE by 4.294 K and causing visible blurring.

Temperature Adaptive Correction Module (TCM): TCM is developed, which innovatively combines radiometric adjustment principles with deep learning. It employs adaptive calibration using Landsat and MODIS reference data to address inherent temperature biases from sensor differences and acquisition conditions, ensuring accuracy. Removing TCM increased MAE by 3.87 K, demonstrating the synergy of physics and data-driven learning.

The Dual-branch Structure of DLSTFM, along with its SAFM and TCMs, work in concert. It employs a dual-branch structure to provide a clear information processing flow, utilizes SAFM to ensure spatial detail enhancement and preservation, and applies TCM physical correction to guarantee temperature value accuracy. This systematic design enables it to surpass existing methods such as EDCSTFN and CAFE in both detail clarity and temperature precision when handling heterogeneous surfaces.

2. Materials and Methods

In this paper, to obtain a high-resolution LST image for the target date, a pair of Landsat-8 and MODIS images from a reference date is fused with a single MODIS image acquired on the target date. To this end, this study designs a surface temperature spatiotemporal fusion model called DLSTFM. The core architecture of DLSTFM is composed of an adaptive temperature-correction module and a multi-scale feature-modulation module. As shown in Figure 1, the overall network is split into two main branches—a dual-temporal fusion branch and a multi-source feature-fusion branch—which handle coarse-prediction and fine-prediction tasks, respectively.

The multi-level fusion network includes three key components: First, the data input layer receives MODIS target images, MODIS reference images, and Landsat reference images. Second is the Spatial Adaptive Feature Modulation (SAFM) module, which extracts and fuses features of the input images using a convolutional layer and an attention mechanism. The third is the loss function, which combines the composite loss functions of EdgeLoss and PerceptualLoss (EP Loss). EP Loss is mainly used to emphasize the preservation and enhancement of image edge information.

DLSTFM is not a simple stacking of multiple modules, but rather the collaborative operation of Dual-branch structure, SAFM, and TCM. Specifically, the dual-branch structure provides the foundation for feature decoupling, SAFM enhances the representation of spatial details during feature extraction, while TCM further corrects temperature values at the output stage. This collaborative approach enables DLSTFM to maintain clear spatial details in heterogeneous regions while achieving accurate temperature values.

2.1. Spatially Adaptive Feature Module

As shown in Figure 2, the SAFM module is part of Fusion Module for Multimodal Features (FMM), the core innovative component of this study, which aims to realize the adaptive fusion of multi-scale spatial features. The module effectively enhances the network’s ability to model cross-scale features through hierarchical feature decomposition and spatial weight learning. The SAFM contains three key designs:

2.1.1. Multi-Level Feature Decoding

Given an input feature map X ∈ R(C × H × W), it is first divided equally along the channel dimensions into n subfeatures {X_i} where i = 1…n with Xi ∈ R(C/n × H × W). explicit decomposition establishes the basis for subsequent multiscale processing.

2.1.2. Hierarchical Spatial Weighting

For each sub-feature Xi, the adaptive spatial modulation unit is designed. For base resolution features (i = 1), local context modeling is performed by directly applying separable convolution:

W_{1} (X_{1}) = DWConv 3 \times 3 (X_{1})

(1)

For high-level features (i > 1), adaptive downsampling mechanism is introduced:

W_{i} (X_{i}) = U (DWConv 3 \times 3 (P_{i} (X_{i})))

(2)

where Pi(-) is an adaptive maximization operation where the output size is scaled down by 2−i, and U(-) denotes nearest-neighbor upsampling.

2.1.3. Feature-Aggregation Mechanisms

Cross-Level Feature Fusion via Channel Splicing and Linear Projection.

X_{out} = GELU (Conv 1 \times 1 (∥ W_{i} (X_{i})) ⊙ X_{in})

(3)

where ⊙ denotes the Hadamard product, and the design realizes feature modulation through the gating mechanism. Experiments show that the SAFM module significantly improves the network’s discriminative representation of spatial features through parameter sharing (grouped convolution) and multi-scale co-optimization, while maintaining computational efficiency. Ablation experiments verify that the optimal performance balance is reached when the number of layers n = 4.

2.2. Temperature Adaptive Correction Module

When fusing imagery from different satellite sensors, discrepancies in acquisition times and sensor characteristics create temperature inconsistencies between Landsat-8 and MODIS LST images collected on the same date, which in turn degrade the accuracy of the fused product. To mitigate this issue, this paper incorporates a temperature adaptive correction module (TCM) into DLSTFM (Figure 3). This module refines the fused LST by referencing both Landsat and MODIS observations.

During the prediction phase, TCM employs only MODIS imagery from the target date, reference date Landsat imagery, and reference date MODIS imagery for temperature calibration. It does not utilize actual Landsat imagery from the target date, thereby eliminating any potential data-leakage issues.

2.3. Loss Function

In order to enhance the spatial detail fidelity and semantic consistency of the fusion results, this study proposes a composite loss function (Edge-Perceptual Loss, EP Loss) that fuses multi-scale edge constraints and depth-perceptual features [35]. This loss function adaptively balances different supervised signals through a dynamic weight adjustment mechanism, and its core components are as follows:

2.3.1. Multi-Scale Laplacian Edge Loss

The edge difference between the predicted image and the real image at multiple scales is computed by an improved Laplacian pyramid. Specifically, a multi-scale Gaussian kernel is used to smooth and filter the input image and extract the residuals as edge features. The loss function expression is given as:

L_{edge} = \sum_{s \in S} ∥ Δ_{s} (x) - Δ_{s} (y) ∥_{Charbonnier}

(4)

where

Δ_{s} (\cdot)

denotes the Laplace edge extraction operation with scale factor

s

,

S = {1, 0.5}

is the multiscale set, and

∥ \cdot ∥_{Charbonnier}

is the Charbonnier paradigm with smoothing term.

2.3.2. Hierarchical Perceptual Loss

Based on the pre-trained VGG16 network, the shallow convolutional features of the image are extracted, and the consistency between the image and the real image in the feature space is predicted by the mean square error constraint:

L_{perc} = \sum_{l \in L} ω_{l} ∥ ϕ_{l} (x) - ϕ_{l} (y) ∥_{2}

(5)

where

ϕ_{l} (\cdot)

denotes the feature mapping of the

l

layer of VGG16,

L

is the set of selected feature layers, and

ω_{l}

is the weight coefficients of each layer.

2.3.3. Adaptive Weighting Mechanism

In order to avoid setting fixed weights manually, EP Loss dynamically adjusts the weights according to the ratio of the two types of loss values in the current training iteration:

ω_{edge} = \frac{L_{edge}}{L_{edge}}, ω_{perc} = \frac{L_{perc}}{L_{edge} + L_{perc}}

(6)

The final loss function is:

LELoss = ω_{edge} \cdot L_{edge} + ω_{perc} \cdot L_{perc}

(7)

Through adaptive tuning, the model is able to adapt more flexibly to the learning needs of edge details and full structures in different training phases.

2.4. Dataset Description and Relevant Settings for Training

2.4.1. Experimental Dataset

The primary study area of this work is the Griffith region in Australia, which features a temperate Mediterranean climate with distinct seasonal temperature variations. The land surface exhibits high heterogeneity, encompassing diverse cover types such as cropland, forest, water bodies, grassland, and urban built-up areas. These features are interwoven spatially, creating complex landscape patterns and numerous sharp temperature transition zones. This diversity in climate and surface characteristics provides an ideal testbed for validating spatiotemporal fusion models under complex, real-world conditions. This study utilized thermal-infrared land surface temperature data from Landsat 8-TIRS and MODIS Terra. The land surface temperature imagery dataset utilized in this study is designated as LM-fusion, with each image measuring 1238 pixels in length and 1121 pixels in width. LM-fusion is the original dataset which constructed specifically for this research. The dataset under consideration is composed of images from the Griffith region of Australia. All comparative methods employed in this study utilize the same LM-fusion dataset and are evaluated under identical training/testing splits. The input images underwent standardized preprocessing procedures, including radiometric correction, atmospheric correction, and cropping. As displayed in Table 1, it consists of 27 pairs of Landsat-8 and MODIS land surface temperature images, for a total of 54 images. These images have been divided into training and testing sets. The training set comprises 21 pairs of land surface temperature images, while the testing set contains 6 pairs. This dataset encompasses a variety of land-cover types, including lakes, farmlands, forests, grasslands, and urban buildings. The land cover characteristics of this dataset are more complex compared to other datasets. This provides a rich array of scenarios and challenges for model training and testing. To fully leverage this diversity and enhance the model’s ability to learn local patterns, the training images were segmented into 40 × 40 pixel patches. This strategy significantly increases the effective number and diversity of training samples, which helps the model generalize better and mitigates the risk of overfitting. Furthermore, the prevalence of these land-cover types in urban areas renders the dataset a reliable representation. Prior to analysis, all images underwent a series of preprocessing steps, including radiometric correction, atmospheric correction, rotation, and cropping. These procedures were implemented to ensure the integrity and consistency of the data.

2.4.2. Relevant Settings for Training

This study employs deep-learning techniques for the training of a land surface temperature imagery fusion model, utilizing the Adam optimization algorithm for parameter optimization. During the training process, input is conducted using image patches of size 40 pixels × 40 pixels, and these patches are grouped into batches for training, with a batch size set to 32. The learning rate is carefully adjusted, ultimately set to 3.19 × 10⁻⁴, to achieve rapid convergence and parameter optimization of the model. The model training undergoes a total of 50 iterations, with batch-normalization techniques applied in each iteration to accelerate the training process and enhance model performance.

3. Results

To validate the effectiveness of the DLSTFM, it was compared with four existing LST fusion algorithms: STARFM, STNLFFM, EDCSTFN, and CAFE. Among these, STARFM and STNLFFM are filter-based models, while EDCSTFN and CAFE are deep-learning-based models.

Three quantitative metrics—Structural Similarity Index (SSIM), Root Mean Square Error (RMSE), and Mean Absolute Error (MAE)—were employed to evaluate the agreement between the fused outputs and the actual target LST images. Lower RMSE/MAE scores and higher SSIM scores indicate better fusion performance.

3.1. Qualitative Analysis

The experimental results demonstrate that the DLSTFM produces LST images with clear details and strong contrast. As shown in Figure 4, the model effectively identifies hotspots (areas of high temperature) and cold spots (areas of low temperature), with their spatial locations and temperature ranges closely matching those in the ground-truth images. Notably, the model accurately captures the sharp temperature boundaries between urban heat islands and surrounding farmland. The dual-branch structure (dual-temporal fusion branch and multi-source feature fusion branch) and multi-layer convolutional structure of DLSTFM enhance its ability to extract complex features, preserving fine texture details during fusion. Additionally, the temperature-adaptive correction module further improves the model’s detail retention and pixel-wise accuracy, resulting in sharper output images.

The STARFM has been observed to generate images with a tendency to exhibit lower temperatures in comparison to the ground truth. This discrepancy is characterized by a discernible yet subtle contrast between regions of low and high temperature. In heterogeneous regions, such as urban areas or fragmented farmland, the model faces challenges in capturing small-scale temperature variations, resulting in predictions that lack clarity and precision. Moreover, the spectral similarity screening in STARFM may introduce errors in complex terrains [9], particularly in regions with significant temperature gradients, where uneven weight allocation reduces the contrast between high and low temperatures.

STNLFFM output images performed well in terms of overall image clarity and detail texture, similar to STARFM. Compared to deep learning-based models, STNLFFM offers higher computational efficiency, enabling faster fusion. However, the model’s predictions for high-temperature pixels demonstrate a degree of inaccuracy, manifesting as a slight darkening of these regions. This phenomenon may be attributed to an inadequate consideration of high-temperature characteristics during the weight calculation process.

The EDCSTFN model demonstrates comparatively diminished overall clarity, accompanied by a certain degree of loss of texture details and deficiencies in image contrast and sharpness. An analysis of the model’s code suggests that this may stem from detail loss during feature extraction and fusion, simplistic feature-fusion methods, and suboptimal upsampling techniques. The model’s straightforward feature-fusion approach, which involves the incorporation of encoded Landsat features into residual features, may not fully leverage the complementary information present among different input features. This potential defect may result in a decrease in the overall contrast of the output image and blurring of fine details and textures.

Moreover, the selection of upsampling method has been demonstrated to exert a substantial influence on the quality of the resulting image. Although bilinear interpolation, as implemented in EDCSTFN, is computationally efficient, it has been observed to underperform in the recovery of high-frequency details. This has been shown to result in a degree of blurriness in the output.

The CAFE model’s fusion images show significant high–low temperature contrast, with clear spatial differentiation between high-temperature centers and low-temperature areas. The macro-distribution patterns of the low-temperature and high-temperature areas are highly consistent with the target images, which verifies the model’s effectiveness in spatial pattern reconstruction. However, in high-temperature areas, a small number of pixels remain unfused, leading to missing values. This issue may arise from data interpolation strategies or flaws in the fusion rules. Similar to ESTARFM, which relies on linear spectral unmixing for mixed pixels [12], CAFE may fail to extrapolate information if high-temperature changes are inadequately represented in the reference images. Additionally, without dynamic weight adjustment (e.g., inverse distance weighting in EDCSTFN), extreme-temperature pixels may be “diluted” by neighboring low-temperature values, resulting in localized gaps.

The visual comparison above reflects the performance differences among models through the overall image. To further validate DLSTFM’s advantage in detail retention at the pixel scale, this study conducted a temperature-profile analysis. Specifically, a horizontal profile line extending from the left boundary to the image center was extracted from the test image, plotting the temperature-variation curve with pixel position (as shown in Figure 5). Analysis indicates that the temperature profile predicted by DLSTFM (red solid line) exhibits the highest agreement with actual surface temperatures (black solid line). Across multiple regions of abrupt temperature changes, DLSTFM accurately captures and reproduces the steep temperature gradients, with distinct and sharp transitions in the curve. In contrast, the traditional method STARFM (blue dashed line) produced a prediction curve broadly similar to the actual surface temperature, though its predicted values were slightly lower than the true values in high-temperature regions. The deep learning method EDCSTFN (purple dotted line) exhibited an overly smoothed prediction curve, coupled with significant systematic bias in high-temperature areas, where its predicted values deviated markedly from the actual values. Another deep-learning method, CAFE (green dotted line), performs better than EDCSTFN but still underestimates high-temperature peaks and overestimates low-temperature troughs. STNLFFM (orange dotted line) performed between STARFM and the deep-learning methods. This pixel-level profile analysis visually confirms that DLSTFM, leveraging its dual-branch architecture and Spatially Adaptive Feature Modulation (SAFM) module, more effectively learns and preserves high-frequency spatial details and accurate temperature values in complex heterogeneous surface areas. It overcomes the shortcomings of existing methods, such as blurred details or unstable predictions.

3.2. Quantitative Analysis

Evaluation indicators include:

3.2.1. Structure Similarity Index Measure (SSIM)

The Structure Similarity Index (SSIM) is an indicator that assesses the similarity between two images by comparing their brightness, contrast, and structure, thereby evaluating image quality [36,37]. SSIM values range from −1 to 1, with higher values indicating better image quality. A value of 1 indicates identical images, while −1 suggests completely different images.

This study selected imagery from six representative dates between January 2019 and February 2020 to serve as a validation set for model fusion experiments. Figure 6 presents the quantitative results on the LM-fusion. By comparing the model’s outputs with reference imagery, the model’s effectiveness can be verified. As shown in Table 2, DLSTFM had an average SSIM similar to STARFM and STNLFFM, with values of 0.86, 0.861, and 0.859, respectively. DLSTFM improved the average SSIM by 22.9% and 7.8% compared to the deep-learning models EDCSTFN and CAFE, demonstrating the effectiveness of its dual-branch fusion network structure.

3.2.2. Mean Absolute Error (MAE)

The Mean Absolute Error (MAE) is a common metric for assessing the performance of spatiotemporal fusion models, calculated as the average of the absolute differences between predicted and actual values [38]. Mathematically, MAE is defined as:

M A E = \frac{1}{n} \sum_{i = 1}^{n} |y_{i} - {\hat{y}}_{i}|

(8)

where

y_{i}

is the true value of the

i

sample,

{\hat{y}}_{i}

is the corresponding predicted value, and

n

is the total number of samples.

In this experiment, MAE was also a key indicator for evaluating the accuracy of land surface temperature imagery. The results, shown in Figure 6, indicate that DLSTFM exhibited significant accuracy advantages across all six test datasets, with an average MAE of 2.069 K. From Figure 6, it can be seen that the average MAE of the traditional algorithms STARFM and STNLFFM were 2.874 K and 2.879 K, respectively, suggesting similar accuracy in fused land surface temperature values. Deep-learning models had higher MAE, with EDCSTFN at 6.026 K and CAFE at 6.124 K, indicating poorer adaptability to complex land surface temperature changes. DLSTFM consistently maintained the lowest error, demonstrating superior stability and accuracy. These results confirm the accuracy of DLSTFM in fusing Landsat and MODIS imagery and predicting land surface temperature values.

3.2.3. Root Mean Square Error (RMSE)

The Root Mean Square Error (RMSE) is a widely used metric to assess the discrepancy between a model’s predictions and actual values. It is calculated as the square root of the average of the squared prediction errors [39]. The formula is:

RMSE = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}

(9)

where

y_{i}

is the true observed value of the

i

th sample,

{\hat{y}}_{i}

is the predicted value of the

i

th sample, and

n

is the number of samples.

As depicted in Figure 6, the average RMSE of the DLSTFM across the six test datasets was 2.637 K, significantly lower than the traditional methods STARFM and STNLFFM, and reduced by 61.4% and 60.1% compared to the deep-learning models EDCSTFN and CAFE, respectively. This result illustrates that the DLSTFM, through its dual-branch fusion network and temperature adaptive correction module, effectively addresses the insufficient feature capture in complex surface areas by traditional algorithms and overcomes the temperature prediction bias due to detail loss in existing deep-learning models. Specifically, the average RMSE of DLSTFM on the validation set was only 2.023 K, representing improvements of 10.7% and 49.0% compared to STARFM (2.264 K) and CAFE (3.962 K), respectively, verifying the prediction ability of the Multi-scale Feature Modulation Module (FMM) on local temperature mutations. Additionally, the RMSE standard deviation of DLSTFM is 0.391 K, lower than other models, indicating its stronger stability and generalization in multiple scenarios. Overall, the significant advantage of RMSE confirms the innovation and practicality of DLSTFM in improving the accuracy of spatiotemporal fusion of land surface temperature. Generally, the land surface temperature output by the DLSTFM has a high consistency with the reference values, providing an effective technical solution for the continuous reconstruction of land surface temperature in space and time.

4. Discussion

To validate the effectiveness of the DLSTFM network structure and feature modules, this study modified the network structure and deleted relevant modules and conducted related experiments on the LM-fusion dataset.

4.1. Ablation Experiment on Feature Modules

Through the execution of ablation experiments, it was demonstrated that the incorporation of the FMM into the dual-branch fusion structure significantly enhances the generalization capability on the LM-fusion dataset. This integration has been demonstrated to enhance the spatial detail in generated imagery and to improve the accuracy of surface temperature estimates. The outcomes were evaluated using SSIM, MAE, and RMSE metrics. As illustrated in Table 3, the elimination of the FMM and TCM resulted in a substantial decrease in various test metrics, thereby validating the FMM’s effectiveness.

The experimental findings pertaining to the ablation of the FMM have indicated that it plays a critical role in the maintenance of spatial details and temperature accuracy. The results are presented in Table 3, upon the removal of the FMM, a substantial decline in the SSIM metric of 12.56% was observed. This decline was accompanied by the presence of spatial blurring in the generated images, suggesting a notable impairment in the extraction of high-frequency features. Furthermore, a significant increase in temperature prediction errors was noted, resulting in a decrease in the overall image temperatures compared to the target images. This decrease was accompanied by an increase in the root mean square error (RMSE) and the mean absolute deviation (MAE) by 4.294 K and 4.072 K, respectively. This finding underscores the pivotal role of the FMM in ensuring the precise reconstruction of surface temperature field, highlighting its indispensable nature in capturing subtle temperature fluctuations, particularly in regions characterized by substantial temperature gradients. Furthermore, an evaluation of the visual quality of the fused images was conducted. The experimental results in Figure 7 indicate that in low-temperature regions and some high-temperature regions, the removal of the FMM resulted in significant brightness differences between the model-generated images and the target images over large areas. This finding serves to further substantiate the pivotal function of the FMM in the regulation of pixel values and the maintenance of temperature consistency.

The ablation of the TCM revealed a different impact pattern: although the SSIM decreased only by 0.34%, visually manifested primarily as brightness differences, temperature errors still showed a significant increase, with RMSE and MAE increasing by 3.671 K and 3.873 K, respectively. This finding suggests that the elimination of the TCM concomitantly results in a reduction in the accuracy of the model’s predicted temperature values in the images.

It is noteworthy that the removal of both modules resulted in temperature errors exceeding 5 K, but the error increase was more pronounced when FMM was removed. This finding indicates that, in the joint optimization of space and temperature, FMM plays a more fundamental role in feature expression, while TCM focuses on deep correction of temperature features.

In summary, following the removal of the two key modules, FMM and TCM, a range of test indicators demonstrated a decrease, suggesting that FMM and TCM play a crucial role in maintaining the spatial details, temperature accuracy, and overall visual effects of the merged images.

4.2. Ablation Experiment on Network Structure

The network architecture of DLSTFM was subjected to an experimental study that focused on two structures: fast fusion and dual-branch fusion.

Fast Fusion Structure: This approach entails a direct concatenation of features from reference MODIS, target MODIS2, and reference Landsat imagery. The findings suggest that this approach yields a lower level of accuracy in comparison to the original model. The primary rationale for this phenomenon is attributable to the fast fusion’s singular branch configuration, which integrates the three inputs (i.e., reference MODIS, target MODIS, and reference Landsat images) at an early stage in the channel dimension. While this approach expedites the fusion process and reduces computational demands, rendering it suitable for remote sensing images with straightforward surface characteristics, it is inadequate for handling complex spatiotemporal relationships. This phenomenon is particularly evident in images characterized by intricate surface types, where the resulting fused images demonstrate a noticeable reduction in accuracy.

Dual-Branch Fusion Structure: The methodology entails an initial fusion of the reference MODIS and target MODIS features, followed by a subsequent integration with reference Landsat features. As shown in Table 4, the experimental findings demonstrate that this strategy exhibits superior performance in comparison to the original model, as evidenced by an enhancement of SSIM by 13.76%, along with a reduction in MAE and RMSE by 4.306 K and 4.194 K, respectively. However, this method is computationally slower than the fast fusion approach. In contrast to the direct fusion of all three image features in fast fusion, the dual-branch fusion’s phased integration enables a more detailed and gradual processing of information from diverse data sources. The dual-branch fusion structure enhances the model’s capacity to capture diverse temporal, and spatial characteristics leads to the production of imagery with refined spatial details.

An analysis of the experimental results depicted in Figure 8 reveals that the output images of DLSTFM employing the dual-branch fusion structure exhibit a higher degree of similarity to TARGET with respect to details, particularly in the regions delineating urban and agricultural areas, as illustrated in Figure 8. Conversely, the output images of the fast fusion structure exhibit a relative loss of clarity in these details, resulting in a compromise of crucial information. This phenomenon may be attributed to the fact that the fast-fusion structure assigns greater weight to MODIS images during the process of feature fusion.

Additionally, the output images of DLSTFM demonstrate superiority over fast fusion with respect to contrast and brightness. The contrast and brightness of DLSTFM’s output results are more aligned with those of the target images, contributing to a more natural and realistic overall appearance of the image. Conversely, the contrast and brightness of the output from fast fusion are lower, resulting in a lower overall visual quality.

In summary, DLSTFM employs more effective feature extraction and fusion strategies in its network structure design, resulting in superior performance in detail retention, edge clarity, contrast, brightness, and noise suppression. These advantages contribute to the enhanced similarity between the output images of DLSTFM and the target images, thereby providing more valuable data support for subsequent surface temperature analysis and research.

4.3. Supplementary Experiments

To demonstrate the generalization capability of this model, this study has added a new experiment section in which the study tests the trained model on a new geographically different region without retraining. This area is characterized by a mix of forests, agricultural plots, and sporadic water bodies, which differs significantly from the agricultural–urban mosaic of the original Griffith region. Unlike the relatively flat terrain of Griffith, the Ardiethan area exhibits more pronounced topographic variations, providing a more challenging and diverse landscape for testing model robustness.

Figure 9 presents the fusion results of the DLSTFM for the surface temperature image of the Ardiethan region on 3 January 2014. Visually, the model effectively reconstructs the spatial temperature distribution pattern of this area, with clear boundaries between high-temperature and low-temperature zones. It exhibits high spatial consistency with the actual Landsat imagery (Figure 9), showing no significant blurring or distortion.

To further quantitatively assess the model’s fusion accuracy in this new region, this study calculated three metrics: RMSE, MAE, and SSIM. The results are presented in Table 5. The DLSTFM achieved an outstanding performance in this region with an RMSE of 0.5022 K, an MAE of 0.4318 K, and a high SSIM of 0.931.

Crucially, the model trained solely on the Griffith dataset was applied directly to the Ardiethan data without any retraining, fine-tuning, or parameter adjustment. This strict zero-shot transfer evaluation is designed to test the model’s inherent generalizability beyond its training distribution.

The results of supplementary experiments demonstrate that the model achieves cross-location generalization capabilities through spatiotemporal joint learning (rather than purely temporal sequence learning). Quantitative metrics indicate that the model maintains high predictive performance at new locations, enabling deployment without requiring full retraining.

4.4. Computational Efficiency

To assess the practical utility of the proposed method, this study recorded the training and inference time of DLSTFM and compared it with existing approaches. Experiments were conducted on a workstation equipped with an NVIDIA GeForce RTX 4080 GPU. The total training time for 50 epochs on the LM-fusion dataset was approximately 1.65 h. During inference, the average time required to fuse a single image for each method is shown in Table 6.

While traditional methods are faster, their fusion accuracy in heterogeneous regions is significantly lower than that of DLSTFM. Among deep-learning approaches, DLSTFM is notably faster than both EDCSTFN and CAFE while achieving superior fusion accuracy. This indicates that DLSTFM strikes a favorable balance between computational efficiency and fusion quality, making it suitable for medium- to large-scale LST fusion tasks that require high precision.

5. Conclusions

The proposed model utilizes a deep learning-based spatiotemporal fusion approach for surface temperature image prediction, with the objective of achieving high spatiotemporal resolution surface temperature prediction through the integration of multi-source remote sensing data, including MODIS and Landsat. The model utilizes the LM-fusion dataset, a more accurate representation of the complex topographical features characteristic of urban and suburban regions. The DLSTFM employs a dual-branch structure to accommodate both coarse and fine prediction tasks, thereby facilitating more effective capture of subtle and drastic changes in the surface. Furthermore, the SAFM and TCM effectively enhance the spatial detail and temperature value accuracy of the predicted images. SAFM enhances the network’s capacity to model cross-scale features through the implementation of hierarchical feature decomposition and spatial weight learning methodologies. TCM adapts the temperature values of the output images based on reference images, thereby enhancing the accuracy of LST predictions.

This study systematically compares and analyzes the performance of the DLSTFM and four other commonly used models on the LM-fusion dataset. The findings indicate that DLSTFM achieves state-of-the-art performance on the LM-fusion dataset, attaining a mean absolute error (MAE) of 2.1 K and root mean square error (RMSE) of 2.637 K, outperforming both classical (STARFM, STNLFFM) and deep-learning-based (EDCSTFN, CAFE) models. This framework provides a robust solution for generating high-spatiotemporal-resolution LST products, advancing applications in environmental monitoring.

Furthermore, the model attains more precise inversion of surface features, particularly in regions exhibiting pronounced temperature variations due to heterogeneous land surface types. DLSTFM demonstrates strong generalization capability across different geographical regions, as evidenced by its excellent performance on the Ardiethan test area without any retraining, indicating its effectiveness across different scenarios. In the future, the findings of this study are expected to demonstrate significant application potential in areas such as environmental monitoring and urban sustainable development. Specifically: First, in urban heat island effect analysis, the high-resolution temperature data generated by DLSTFM can precisely identify temperature differences across various surfaces like buildings, green spaces, and water bodies. This provides a reliable data foundation for quantifying heat island intensity and evaluating mitigation strategy effectiveness. Second, in climate research, long-term high-resolution temperature data sequences facilitate deeper understanding of local climate-change patterns and their impacts on urban environments and public health.

Despite DLSTFM’s strong performance, this study has limitations that point to future research directions. First, model performance depends on training data quality and quantity; fusion accuracy may be compromised in regions or periods lacking sufficient reference imagery (e.g., during persistent cloud cover). Second, while the model’s dual-branch structure and SAFM module enhance performance, they also introduce computational overhead, potentially posing efficiency challenges for large-scale applications requiring real-time or near-real-time processing. Moving forward, we will incorporate multi-source data, exploring the use of surface reflectance, vegetation indices, and even digital elevation models as auxiliary inputs. This aims to better model the relationship between temperature and surface physical properties, further improving fusion accuracy in complex terrain areas. Additionally, although the Griffith and Ardiethan region exhibits substantial land-cover heterogeneity, the current evaluation is limited to two geographic regions. Future work will focus on validating DLSTFM across multiple climatic regions and land-cover regimes to further assess transferability. Finally, this study will optimize the model’s neural architecture to improve efficiency and generalization capabilities. This study will investigate lightweight modeling techniques and explore cross-region, cross-sensor transfer learning strategies. These efforts aim to reduce dependence on region-specific training data while enhancing the model’s universality and practicality.

Author Contributions

Methodology, Y.S.; Software, C.J. and J.L.; Validation, C.J.; Formal analysis, C.J. and Y.S.; Resources, C.J., J.L. and Y.S.; Data curation, C.J. and J.L.; Writing—original draft, C.J., J.L. and Y.S.; Writing—review & editing, C.J. and Y.S.; Supervision, Y.S.; Project administration, C.J. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant 42361053, and the start-up fund of Hainan University under Grant KYQD(ZR)22081.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Li, R.; Wang, M.; Zhang, Z.; Hu, T.; Liu, X. A review of spatiotemporal fusion methods for remotely sensed land surface temperature. Natl. Remote Sens. Bull. 2022, 26, 2433–2450. [Google Scholar]
Yu, Y.; Renzullo, L.J.; McVicar, T.R.; Malone, B.P.; Tian, S. Generating daily 100 m resolution land surface temperature estimates continentally using an unbiased spatiotemporal fusion approach. Remote Sens. Environ. 2023, 297, 113784. [Google Scholar] [CrossRef]
Pu, R.; Bonafoni, S. Thermal infrared remote sensing data downscaling investigations: An overview on current status and perspectives. Remote Sens. Appl. Soc. Environ. 2023, 29, 100921. [Google Scholar] [CrossRef]
Weng, Q.; Fu, P.; Gao, F. Generating daily land surface temperature at Landsat resolution by fusing Landsat and MODIS data. Remote Sens. Environ. 2014, 145, 55–67. [Google Scholar] [CrossRef]
Xiao, J.; Aggarwal, A.K.; Duc, N.H.; Arya, A.; Rage, U.K.; Avtar, R. A review of remote sensing image spatiotemporal fusion: Challenges, applications and recent trends. Remote Sens. Appl. Soc. Environ. 2023, 32, 101005. [Google Scholar] [CrossRef]
Chen, G.; Lu, H.; Zou, W.; Li, L.; Emam, M.; Chen, X.; Jing, W.; Wang, J.; Li, C. Spatiotemporal fusion for spectral remote sensing: A statistical analysis and review. J. King Saud Univ. Comput. Inf. Sci. 2023, 35, 259–273. [Google Scholar] [CrossRef]
Liu, J.; Ma, Y.; Wu, Y.; Chen, F. Review of methods and applications of high spatiotemporal fusion of remote sensing data. J. Remote Sens. 2016, 20, 1038–1049. [Google Scholar] [CrossRef]
Zhu, X.; Cai, F.; Tian, J.; Williams, T.K.-A. Spatiotemporal fusion of multisource remote sensing data: Literature survey, taxonomy, principles, applications, and future directions. Remote Sens. 2018, 10, 527. [Google Scholar] [CrossRef]
Feng, G.; Masek, J.; Schwaller, M.; Hall, F. On the blending of the Landsat and MODIS surface reflectance: Predicting daily Landsat surface reflectance. IEEE Trans. Geosci. Remote Sens. 2006, 44, 2207–2218. [Google Scholar] [CrossRef]
Zhu, X.; Chen, J.; Gao, F.; Chen, X.; Masek, J.G. An enhanced spatial and temporal adaptive reflectance fusion model for complex heterogeneous regions. Remote Sens. Environ. 2010, 114, 2610–2623. [Google Scholar] [CrossRef]
Cheng, Q.; Liu, H.; Shen, H.; Wu, P.; Zhang, L. A spatial and temporal nonlocal filter-based data fusion method. IEEE Trans. Geosci. Remote Sens. 2017, 55, 4476–4488. [Google Scholar] [CrossRef]
Wu, M.; Niu, Z.; Wang, C.; Wu, C.; Wang, L. Use of MODIS and Landsat time series data to generate high-resolution temporal synthetic Landsat data using a spatial and temporal reflectance fusion Model. J. Appl. Remote Sens. 2012, 6, 063507. [Google Scholar] [CrossRef]
Zhang, W.; Li, A.; Jin, H.; Bian, J.; Zhang, Z.; Lei, G.; Qin, Z.; Huang, C. An Enhanced Spatial and Temporal Data Fusion Model for Fusing Landsat and MODIS Surface Reflectance to Generate High Temporal Landsat-Like Data. Remote Sens. 2013, 5, 5346–5368. [Google Scholar] [CrossRef]
Zhu, X.; Helmer, E.H.; Gao, F.; Liu, D.; Chen, J.; Lefsky, M.A. A flexible spatiotemporal method for fusing satellite images with different resolutions. Remote Sens. Environ. 2016, 172, 165–177. [Google Scholar] [CrossRef]
Yu, M.; Huang, Q.; Li, Z. Deep learning for spatiotemporal forecasting in Earth system science: A review. Int. J. Digit. Earth 2024, 17, 2391952. [Google Scholar] [CrossRef]
Hao, Z.; Guo, J.; Shen, L.; Luo, Y.; Hu, H.; Wang, G.; Yu, D.; Wen, Y.; Tao, D. Low-Precision Training of Large Language Models: Methods, Challenges, and Opportunities. arXiv 2025, arXiv:2505.01043. [Google Scholar] [CrossRef]
Chai, J.; Zeng, H.; Li, A.; Ngai, E.W.T. Deep learning in computer vision: A critical review of emerging techniques and application scenarios. Mach. Learn. Appl. 2021, 6, 100134. [Google Scholar] [CrossRef]
Malik, N.; Singh, P.V. Deep Learning in Computer Vision: Methods, Interpretation, Causation, and Fairness. In Operations Research & Management Science in the Age of Analytics; INFORMS: Hanover, PA, USA, 2022; pp. 73–100. [Google Scholar]
Wang, A.; Wu, H.; Iwahori, Y. Advances in Computer Vision and Deep Learning and Its Applications. Electronics 2025, 14, 1551. [Google Scholar] [CrossRef]
Alzubaidi, L.; Zhang, J.; Humaidi, A.J.; Al-Dujaili, A.; Duan, Y.; Al-Shamma, O.; Santamaría, J.; Fadhel, M.A.; Al-Amidie, M.; Farhan, L. Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions. J. Big Data 2021, 8, 53. [Google Scholar] [CrossRef]
Yuan, Q.; Shen, H.; Li, T.; Li, Z.; Li, S.; Jiang, Y.; Xu, H.; Weiwei, T.; Yang, Q.; Wang, J.; et al. Deep learning in environmental remote sensing: Achievements and challenges. Remote Sens. Environ. 2020, 241, 111716. [Google Scholar] [CrossRef]
Vivone, G.; Deng, L.J.; Deng, S.; Hong, D.; Jiang, M.; Li, C.; Li, W.; Shen, H.; Wu, X.; Xiao, J.L.; et al. Deep Learning in Remote Sensing Image Fusion: Methods, protocols, data, and future perspectives. IEEE Geosci. Remote Sens. Mag. 2025, 13, 269–310. [Google Scholar] [CrossRef]
Wang, Z.; Ma, Y.; Zhang, Y. Review of pixel-level remote sensing image fusion based on deep learning. Inf. Fusion 2023, 90, 36–58. [Google Scholar] [CrossRef]
Zhou, S.; Zhang, X.; Chu, S.; Zhang, T.; Wang, J. Research on remote sensing image carbon emission monitoring based on deep learning. Signal Process. 2023, 207, 108943. [Google Scholar] [CrossRef]
Tan, Z.; Di, L.; Zhang, M.; Guo, L.; Gao, M. An Enhanced Deep Convolutional Model for Spatiotemporal Image Fusion. Remote Sens. 2019, 11, 2898. [Google Scholar] [CrossRef]
Lian, Z.; Zhan, Y.; Zhang, W.; Wang, Z.; Liu, W.; Huang, X. Recent Advances in Deep Learning-Based Spatiotemporal Fusion Methods for Remote Sensing Images. Sensors 2025, 25, 1093. [Google Scholar] [CrossRef] [PubMed]
Zhang, H.; Xu, H.; Tian, X.; Jiang, J.; Ma, J. Image fusion meets deep learning: A survey and perspective. Inf. Fusion 2021, 76, 323–336. [Google Scholar] [CrossRef]
Xu, H.; Ma, J.; Jiang, J.; Guo, X.; Ling, H. U2Fusion: A unified unsupervised image fusion network. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 502–518. [Google Scholar] [CrossRef]
Sun, E.; Cui, Y.; Liu, P.; Yan, J. A Decade of Deep Learning for Remote Sensing Spatiotemporal Fusion: Advances, Challenges, and Opportunities. IEEE Trans. Emerg. Top. Comput. Intell. 2025, 126, 1513–1526. [Google Scholar] [CrossRef]
Li, H.; Zhang, J.; Wang, Y.; Fan, X.; Huang, D. Comparison of multi-factor spatial downscaling models for high-resolution LST estimation in mountainous and hilly open-pit mines. Infrared Phys. Technol. 2023, 136, 105085. [Google Scholar] [CrossRef]
Li, S.; Wan, H.; Yu, Q.; Wang, X. Downscaling of ERA5 reanalysis land surface temperature based on attention mechanism and Google Earth Engine. Sci Rep. 2025, 15, 675. [Google Scholar] [CrossRef]
Ao, Z.; Sun, Y.; Pan, X.; Xin, Q. Deep learning-based spatiotemporal data fusion using a patch-to-pixel mapping strategy and model comparisons. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5407718. [Google Scholar] [CrossRef]
Mienye, I.D.; Swart, T.G. A Comprehensive Review of Deep Learning: Architectures, Recent Advances, and Applications. Information 2024, 15, 755. [Google Scholar] [CrossRef]
Wu, J.; Xia, L.; Chan, T.O.; Awange, J.; Zhong, B. Downscaling land surface temperature: A framework based on geographically and temporally neural network weighted autoregressive model with spatio-temporal fused scaling factors. ISPRS J. Photogramm. Remote Sens. 2022, 187, 259–272. [Google Scholar] [CrossRef]
Johnson, J.; Alahi, A.; Fei-Fei, L. Perceptual losses for real-time style transfer and super-resolution. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 694–711. [Google Scholar] [CrossRef]
Zhou, W.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
Renieblas, G.P.; Nogués, A.T.; González, A.M.; Gómez-Leon, N.; Del Castillo, E.G. Structural similarity index family for image quality assessment in radiological images. J. Med. Imaging 2017, 4, 035501. [Google Scholar] [CrossRef]
Kallmann, H.E. Transversal Filters. Proc. IRE 1940, 28, 302–310. [Google Scholar] [CrossRef]
Sara, U.; Akter, M.; Uddin, M.S. Image quality assessment through FSIM, SSIM, MSE and PSNR—A comparative study. J. Comput. Commun. 2019, 7, 8–18. [Google Scholar] [CrossRef]

Figure 1. Flowchart of the DLSTFM.

Figure 2. DLSTFM network structure diagram. Colors differentiate module types, and arrows indicate the direction of information flow.

Figure 3. Temperature-Adaptive Correction-Module Structure Diagram.

Figure 4. Quantitatively comparisons of DLSTFM with other four competitors on LM fusion dataset. From left to right: STARFM. STNLFFM. EDCSTFN. CAFE. DLSTFM. Target Image. The red boxes represent the areas that require attention.

Figure 5. (a) The position of pixel-level transect on the image and (b) Pixel-level transect profiles of each model.

Figure 6. Quantitative comparisons of DLSTFM and other four methods on LM-fusion dataset. Note: The hollow points represent the mean values, while the solid points represent the outliers.

Figure 7. Results of feature module ablation experiment. The actual Landsat image observed on 12 January 2019 (d) and its prediction images by the DLSTFM (a), the DLSTFM with the FMM ablated (b), the DLSTFM with the TCM ablated (c).

Figure 8. Network structure ablation experimental results. The actual Landsat image observed on 12 January 2019 (c) and its prediction images by the DLSTFM (a), the DLSTFM of fast fusion structure (b).

Figure 9. The experimental results of spatiotemporal fusion of surface temperature in the newly studied area. The actual Landsat image observed on 3 January 2014 (b) and its prediction images by the DLSTFM (a).

Table 1. Image information on training and validation dataset.

Training		Validation
Landsat (DD/MMD/YY)	MODIS (DD/MM/YY)	Landsat (DD/MM/YY)	MODIS (DD/MM/YY)
20/02/2019	20/02/2019	12/01/2019	12/01/2019
19/04/2019	19/04/2019	03/03/2019	03/03/2019
31/08/2019	31/08/2019	20/09/2019	20/09/2019
02/10/2019	02/10/2019	27/10/2019	27/10/2019
30/11/2019	30/11/2019	14/12/2019	14/12/2019
26/03/2020	26/03/2020	20/02/2020	20/02/2020
20/05/2020	20/05/2020
26/08/2020	26/08/2020
25/11/2020	25/11/2020
08/01/2021	08/01/2021
09/02/2021	09/02/2021
06/03/2021	06/03/2021
28/03/2021	28/03/2021
30/04/2021	30/04/2021
14/09/2021	14/09/2021
01/10/2021	01/10/2021
20/11/2021	20/11/2021
19/12/2021	19/12/2021
02/02/2022	02/02/2022
25/03/2022	25/03/2022
15/04/2022	15/04/2022

Table 2. Quantitative experimental results of five models.

	STARFM	STNLFFM	EDCSTFN	CAFE	DLSTFM
RMSE (↓)	3.408	3.445	6.838	6.608	2.637
MAE (↓)	2.874	2.879	6.026	6.124	2.069
SSIM (↑)	0.861	0.859	0.7	0.798	0.86

Note: ‘↑’indicates that higher values are better, ‘↓’ indicates that lower values are better, the following table is the same.

Table 3. Quantitative results of the module ablation experiments.

	Dual-Branch Fusion of DLSTFM	FMM Removed	TCM Removed
RMSE (↓)	2.637	6.546	5.912
MAE (↓)	2.069	5.833	5.637
SSIM (↑)	0.86	0.752	0.857

Table 4. Quantitative results of the network structure ablation experiments.

	Dual-Branch Fusion of DLSTFM	Fast Fusion
RMSE (↓)	2.637	6.831
MAE (↓)	2.069	6.375
SSIM (↑)	0.86	0.756

Table 5. Quantitative results of the supplementary experiments.

	DLSTFM
RMSE	0.5022
MAE	0.4318
SSIM	0.931

Table 6. Comparison of fusion times across models.

	STARFM	STNLFFM	EDCSTFN	CAFE	DLSTFM
Fusion duration(s)	0.8	1.1	7.5	11.8	5.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jin, C.; Li, J.; Shen, Y. Deep-Learning Spatial and Temporal Fusion Model for Land Surface Temperature Based on a Spatially Adaptive Feature and Temperature-Adaptive Correction Module. Remote Sens. 2026, 18, 238. https://doi.org/10.3390/rs18020238

AMA Style

Jin C, Li J, Shen Y. Deep-Learning Spatial and Temporal Fusion Model for Land Surface Temperature Based on a Spatially Adaptive Feature and Temperature-Adaptive Correction Module. Remote Sensing. 2026; 18(2):238. https://doi.org/10.3390/rs18020238

Chicago/Turabian Style

Jin, Chenhao, Jiasheng Li, and Yao Shen. 2026. "Deep-Learning Spatial and Temporal Fusion Model for Land Surface Temperature Based on a Spatially Adaptive Feature and Temperature-Adaptive Correction Module" Remote Sensing 18, no. 2: 238. https://doi.org/10.3390/rs18020238

APA Style

Jin, C., Li, J., & Shen, Y. (2026). Deep-Learning Spatial and Temporal Fusion Model for Land Surface Temperature Based on a Spatially Adaptive Feature and Temperature-Adaptive Correction Module. Remote Sensing, 18(2), 238. https://doi.org/10.3390/rs18020238

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep-Learning Spatial and Temporal Fusion Model for Land Surface Temperature Based on a Spatially Adaptive Feature and Temperature-Adaptive Correction Module

Highlights

Abstract

1. Introduction

2. Materials and Methods

2.1. Spatially Adaptive Feature Module

2.1.1. Multi-Level Feature Decoding

2.1.2. Hierarchical Spatial Weighting

2.1.3. Feature-Aggregation Mechanisms

2.2. Temperature Adaptive Correction Module

2.3. Loss Function

2.3.1. Multi-Scale Laplacian Edge Loss

2.3.2. Hierarchical Perceptual Loss

2.3.3. Adaptive Weighting Mechanism

2.4. Dataset Description and Relevant Settings for Training

2.4.1. Experimental Dataset

2.4.2. Relevant Settings for Training

3. Results

3.1. Qualitative Analysis

3.2. Quantitative Analysis

3.2.1. Structure Similarity Index Measure (SSIM)

3.2.2. Mean Absolute Error (MAE)

3.2.3. Root Mean Square Error (RMSE)

4. Discussion

4.1. Ablation Experiment on Feature Modules

4.2. Ablation Experiment on Network Structure

4.3. Supplementary Experiments

4.4. Computational Efficiency

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI