Deep Learning for Spatio-Temporal Fusion in Land Surface Temperature Estimation: A Comprehensive Survey, Experimental Analysis, and Future Trends

Bouaziz, Sofiane; Hafiane, Adel; Canals, Raphaël; Nedjai, Rachid

doi:10.3390/rs18020289

Open AccessReview

Deep Learning for Spatio-Temporal Fusion in Land Surface Temperature Estimation: A Comprehensive Survey, Experimental Analysis, and Future Trends

¹

INSA CVL, Université d’Orléans, PRISME UR 4229, 18022 Bourges, Centre Val de Loire, France

²

Université d’Orléans, INSA CVL, PRISME UR 4229, 45067 Orléans, Centre Val de Loire, France

³

Université d’Orléans, CEDETE, UR 1210, 45067 Orléans, Centre Val de Loire, France

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(2), 289; https://doi.org/10.3390/rs18020289

Submission received: 22 December 2025 / Revised: 10 January 2026 / Accepted: 13 January 2026 / Published: 15 January 2026

(This article belongs to the Special Issue Algorithms Exploration of Land Surface Temperature Retrieval from Satellites Data)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

Reviews and analyzes deep learning-based spatio-temporal fusion methods for land surface temperature estimation, highlighting their strengths, limitations, and adaptation needs.
Provides a comprehensive experimental evaluation on a newly released open-access dataset, revealing pronounced performance variability among state-of-the-art spatio-temporal fusion models and highlighting critical inconsistencies in existing fusion strategies when applied to thermal signals.

What are the implications of the main findings?

Supports the development of spatio-temporal fusion methods that explicitly account for land surface temperature’s spatio-temporal characteristics and physical constraints.
Offers a structured reference through taxonomy, benchmark dataset, and experimental analysis to guide future research and improve model generalizability for land surface temperature estimation.

Abstract

Land Surface Temperature (LST) plays a key role in climate monitoring, urban heat assessment, and land–atmosphere interactions. However, current thermal infrared satellite sensors cannot simultaneously achieve high spatial and temporal resolution. Spatio-temporal fusion (STF) techniques address this limitation by combining complementary satellite data, one with high spatial but low temporal resolution, and another with high temporal but low spatial resolution. Existing STF techniques, from classical models to modern deep learning (DL) architectures, were primarily developed for surface reflectance (SR). Their application to thermal data remains limited and often overlooks LST-specific spatial and temporal variability. This study provides a focused review of DL-based STF methods for LST. We present a formal mathematical definition of the thermal fusion task, propose a refined taxonomy of relevant DL methods, and analyze the modifications required when adapting SR-oriented models to LST. To support reproducibility and benchmarking, we introduce a new dataset comprising 51 Terra MODIS-Landsat LST pairs from 2013 to 2024, and evaluate representative models to explore their behavior on thermal data.

Keywords:

spatio-temporal fusion; land surface temperature; spatial resolution; temporal resolution; deep learning

Graphical Abstract

1. Introduction

Over the past decade, urbanization has intensified globally, with cities accommodating an increasing proportion of the world’s population. By 2020, more than half of people lived in urban areas, and European urbanization had already reached 74% by 2017 [1]. This rapid growth intensifies environmental challenges, including the Urban Heat Island (UHI) effect, elevated air and water pollution, and reduced green spaces. Collectively, these impacts increase energy demand, degrade air quality, and threaten public health through higher rates of heat-related mortality, particularly from cardiovascular diseases [2,3,4].

Land Surface Temperature (LST) is a key variable for understanding and managing these environmental processes. Physically, it represents the thermal radiation emitted by the Earth’s surface, reflecting how incoming solar energy interacts with bare ground or vegetated canopies [5]. It plays a key role in revealing the temporal and spatial dynamics of the surface’s equilibrium state [6,7]. This information supports a wide range of applications, including climate monitoring [8,9], urban planning [10,11], and natural resource management [12,13]. Satellites provide the primary means of measuring LST at regional to global scales, offering consistent coverage and frequent revisit times [14]. Consequently, many thermal infrared (TIR) sensors have been deployed, that led to diverse approaches for estimating LST at different spatial and temporal resolutions [15,16,17,18,19,20,21]. Despite these advances, technical and budgetary constraints impose a fundamental trade-off between spatial and temporal resolutions [22,23]. Spatial resolution refers to the level of detail represented within a single pixel of a satellite image [24], while temporal resolution describes the frequency of observations over time for a given region [25]. Achieving both high spatial and high temporal resolutions remains difficult, as finer spatial detail typically reduces revisit frequency, whereas frequent acquisitions rely on coarser spatial coverage [26].

Two main strategies exist for producing LST data with both high spatial and temporal resolutions [27]: spatial downscaling and spatio-temporal fusion (STF) [28]. Spatial downscaling, also referred to as thermal sharpening or disaggregation, enhances spatial resolution by assuming a stable relationship between LST and auxiliary variables across scales [29,30,31,32]. For example, the traditional thermal sharpening approach [29] estimates fine-resolution LST using predictors such as normalized difference vegetation index (NDVI) and fractional vegetation cover, and later studies introduced additional variables including the normalized difference building index (NDBI) [33] and surface albedo [34]. However, this dependency limits robustness, as the relationship between LST and auxiliary features often changes over time [35]. STF, also known as image or data fusion, offers an alternative strategy by combining two remote sensing (RS) satellite sources that share similar spectral characteristics but differ in spatial and temporal resolutions [36]. In most cases, one sensor provides high spatial and low temporal resolutions (HSLT), while the other offers low spatial and high temporal resolutions (LSHT) [37]. Figure 1 illustrates the annual number of publications on STF for LST estimation from 2015 to 2025. The field has experienced rapid growth, rising from 227 papers in 2015 to 1720 papers in 2025. This surge reflects increasing interest in high-spatial and high-temporal (HSHT) LST research, driven by both advances in satellite sensors and the urgent need to address climate-related challenges.

Current STF approaches are commonly classified into four categories: weighted-based, unmixing-based, hybrid, and learning-based methods. Weighted-based methods produce fine-resolution satellite images by leveraging neighborhood information [38,39,40,41,42,43]. For instance, STARFM [38] adjusts pixel weights using separate coefficients for homogeneous and heterogeneous regions, and ESTARFM [39] further refines weights by considering mixed and pure pixels. These approaches are fast and stable, particularly in homogeneous areas, but their reliance on homogeneity assumptions limits detailed reconstruction. Unmixing-based approaches treat coarse-resolution temporal variations as mixtures of finer-scale components [44,45,46,47], yet they often produce blocky artifacts due to limited temporal modeling. Hybrid methods integrate weighted-based and unmixing-based principles to exploit their respective strengths [48,49,50,51]. Learning-based methods rely on training models with existing datasets to learn mappings between HSLT and LSHT images. They are broadly divided into sparse representation-based and deep learning (DL)-based approaches. Sparse representation methods (or dictionary learning) [52,53,54,55] employ Bayesian learning to model coarse-to-fine satellite image relationships. Examples include covariance functions [56], low-pass filtering [57], pixel unmixing [58], joint Gaussian distributions [59], and Kalman filters [60]. These methods, however, depend strongly on satellite image range and involve complex computations [61]. DL-based approaches overcome these limitations by capturing non-linear relationships between input and output satellite images. Architectures explored for STF include Convolutional Neural Networks (CNNs) [37,62,63,64,65,66,67,68,69,70,71], Autoencoders (AEs) [61,72], Generative Adversarial Networks (GANs) [73,74,75,76,77,78,79,80,81,82], Vision Transformers (ViT) [83,84,85,86,87,88,89,90], and Recurrent Neural Networks (RNNs) [70,91,92].

Although a substantial body of STF research exists, most methods were originally developed and validated using SR datasets, and transferring them directly to LST is nontrivial due to fundamentally different spatio-temporal dynamics. SR varies mainly with phenology and seasonal illumination changes, resulting in relatively smooth spatial transitions and limited short-term fluctuations. In contrast, LST exhibits rapid temporal variability, often at hourly or sub-hourly scales, driven by atmospheric conditions, surface energy balance, wind, cloud dynamics, and material-dependent thermal properties [93,94]. Spatially, LST can also change abruptly over short distances, such as between impervious surfaces and vegetation, whereas SR typically transitions more gradually with land cover patterns [93]. A number of surveys have examined STF in RS, but their coverage, technical depth, and relevance to LST vary considerably. Table 1 compares existing surveys across seven criteria. The first is publication year, with more recent works prioritized for their coverage of the latest advancements. The second is the target domain, where, for LST-focused surveys, we assess whether they address how SR-based STF methods could be adapted for LST. We further evaluate whether surveys include DL methods, discuss their limitations, and highlight open challenges. Finally, we consider whether the surveys provide experimental evaluation and whether they introduce or benchmark a dataset to support future research. To date, only three surveys [94,95,96] explicitly address STF for LST. However, none provide a detailed examination of learning-based models, particularly DL, and none perform experimental comparison or challenge the assumption that SR-based STF techniques generalize directly to LST.

In this work, we review recent advances in DL-based STF for LST estimation. To summarize, our main contributions are the following:

We provide a comprehensive overview of DL-based STF methods for LST, highlighting their architectures, objectives, and adaptations for LST’s spatio-temporal dynamics.
We introduce an open-source MODIS-Landsat LST pair dataset (STF-LST), comprising 51 images spanning 2013–2024, which serves as the first benchmark in the field.
We conduct experimental analysis of state-of-the-art DL methods by offering quantitative and qualitative insights into their performance, limitations, and practical applicability for LST estimation.

This review is organized as follows. Section 2 introduces satellite-derived LST, highlighting its physical meaning, the trade-offs between spatial and temporal resolution, and its fundamental differences from SR. Section 3 formulates the STF problem for LST estimation, including its mathematical definition, commonly used loss functions, and evaluation metrics. In Section 4, we propose a novel taxonomy of DL-based STF methods. Section 5 presents an extensive experimental analysis, where representative STF approaches are evaluated on paired MODIS-Landsat LST data. Section 6 discusses the limitations of current methods and outlines promising future research directions. Finally, Section 7 summarizes the main findings of this survey.

2. Satellite-Derived LST

In this section, we will define LST, explain how it is derived from satellite observations, discuss the challenges associated with the trade-off between spatial and temporal resolution in satellite data, and highlight the differences between LST and SR.

2.1. LST Concept and Retrieval

LST is defined as the thermodynamic temperature at the surface of objects. In RS, this surface layer is continuous and projected across all visible components within the sensor’s instantaneous field of view (IFOV), as shown in Figure 2. LST is called radiometric temperature [108], as it requires eliminating atmospheric effects and correcting for emissivity. Its formula is described in the Equation (1) [14,93,109].

\begin{matrix} T_{s} (θ_{v}, φ_{v}) & = B_{λ}^{- 1} [\frac{A}{B}], \\ A & = R_{λ} (θ_{v}, φ_{v}) - R_{a t_{λ} ↑} (θ_{v}, φ_{v}) - τ_{λ} (θ_{v}, φ_{v}) (1 - ε_{λ} (θ_{v}, φ_{v})) R_{a t_{λ} ↓}, \\ B & = τ_{λ} (θ_{v}, φ_{v}) ε_{λ} (θ_{v}, φ_{v}) \end{matrix}

(1)

where

T_{s}

is the radiometric temperature,

θ_{v}

and

φ_{v}

represent the viewing zenith and azimuth angles,

λ

is the channel-effective wavelength.

B_{λ}^{- 1}

is the inverse function of Planck’s law, which converts the measured radiance into temperature.

R_{λ}

,

R_{a t_{λ} ↑}

, and

R_{a t_{λ} ↓}

are the at-sensor observed radiance, upward atmospheric radiance, and downward atmospheric radiance respectively,

τ_{λ}

is the channel atmospheric transmittance, and

ε_{λ}

is the channel land surface emissivity (LSE).

The radiometric temperature has four properties:

1.: It is independent of spatial scale, meaning it can be applied across various scales [111].
2.: The depth of penetration depends on the wavelength used. In TIR range ( $λ \approx 10 μ m$ ), the depth varies from 1 to 100 $μ m$ , leading to the term skin temperature [108]. In the microwave range ( $λ \approx 1 cm$ ), the depth ranges from 0.1 to 10 cm, hence the term subsurface temperature [112].
3.: It is affected by the viewing angle, making it directional [108].
4.: It represents an average temperature derived from all homogeneous and isothermal elements within the IFOV [108].

Retrieving LST from these radiances presents a challenging and ill-posed problem due to the following reasons:

LST retrieval is underdetermined because, for every radiance measurement across N TIR channels, there are N unknown LSEs and an unknown LST.
The radiances measured in different TIR channels are highly correlated, making the system of equations unstable and sensitive to small errors in the data.

To overcome these challenges, additional assumptions and constraints are required to either increase the number of equations or reduce the number of unknowns, while decorrelating the data. Consequently, several algorithms have been developed to retrieve LST, including the Single-Channel [113], Split-Window [114], and mono-window [115].

2.2. Trade-Offs in Spatial and Temporal Resolution for LST Retrieval

The accuracy of LST retrieval is significantly influenced by two key factors: spatial resolution and temporal resolution.

Spatial Resolution: defines the size of a pixel in the satellite image, which determines the smallest detectable feature. This is crucial for accurate LST retrieval, as fine-scale spatial resolution captures smaller, more localized temperature variations [25].
Temporal Resolution: refers to the frequency at which a satellite revisits the same area. A higher temporal resolution is critical for monitoring dynamic temperature changes over time [26].

Table 2 summarizes the most commonly used satellites for LST retrieval, emphasizing the trade-off between spatial and temporal resolution. For instance, Landsat 8 and 9 provide high spatial resolution (30 m) but revisit the same area only every 16 days. In contrast, MODIS Aqua and Terra offer daily observations (1 day temporal resolution) but at coarser spatial resolution (1 km). STF techniques present a promising approach to overcome this trade-off, and will be discussed in the following sections.

2.3. Differences Between SR and LST Dynamics

SR and LST are both derived from satellite observations, but they display fundamentally different spatial and temporal behaviors because they are governed by distinct physical processes. This shapes how each variable changes across space and time.

2.3.1. Spatial Variations

SR exhibits relatively smooth, structured, and stable spatial patterns [116]. For instance, forests, urban areas, and water bodies generally maintain consistent boundaries over time. SR values also change gradually across space, with neighboring pixels usually showing similar values. In contrast, LST can vary abruptly over short distances [14]. For example, within a single urban area, surfaces with similar reflectance can exhibit large LST differences due to variations in thermal inertia.

2.3.2. Temporal Variations

SR changes gradually over time, driven by seasonal or phenological processes such as vegetation growth, senescence, crop cycles, and land cover changes. As a result, SR exhibits strong temporal correlation and relatively predictable trends [117]. LST, however, is subject to rapid temporal fluctuations due to diurnal cycles, precipitation events, wind, and other atmospheric conditions [14]. LST can also vary significantly within hours, making it much less temporally stable than SR.

3. STF Problem Formulation for LST

This section presents a mathematical formulation of the STF problem for LST estimation. We introduce the relevant notations, describe how DL models are incorporated into the fusion process, categorize the loss functions commonly used during training, and outline the evaluation metrics employed to assess model performance.

3.1. Mathematical Definition

The STF problem can be formulated as a multi-objective optimization task, aiming to simultaneously improve both the spatial and temporal resolutions of LST data. Formally, this can be expressed as in Equation (2).

max (F (R_{s}), F (R_{t})),

(2)

where

F (R_{s})

and

F (R_{t})

are objective functions corresponding to spatial and temporal resolutions enhancement, respectively. Specifically,

F (R_{s})

aims to maximize spatial fidelity by enhancing the resolution and preserving fine-scale details in the high-resolution satellite image, while

F (R_{t})

seeks to maximize temporal fidelity by minimizing the temporal gap between consecutive acquisitions. An additional, often overlooked, objective is gap filling in the high-resolution satellite image, making the STF problem a tri-objective optimization task: balancing spatial and temporal resolutions enhancement along with accurate reconstruction of missing or masked pixels. STF methods leverage known patterns in satellite image pixel values over time (temporal variations) and across spatial scales (spatial variations) to estimate HSHT satellite images. For LST, temporal variations refer to changes in pixel values observed at the same location over time, while spatial variations refer to differences in pixel values between LSHT and HSLT images.

Let

X_{1}

and

X_{2}

denote data from two satellites:

X_{1}

provides LSHT LST data, while

X_{2}

provides HSLT LST data. Let

t_{1}, t_{2}, t_{3}

be three distinct time steps, and let s denote the geographic region of interest (ROI). Then,

X_{1} (s, t_{i})

and

X_{2} (s, t_{i})

for

i \in {1, 2, 3}

represent the LST data from the first or second satellite at time

t_{i}

for location s. Given two pairs of satellite LST images,

P_{1} = {X_{1} (s, t_{1}), X_{2} (s, t_{1})}

and

P_{3} = {X_{1} (s, t_{3}), X_{2} (s, t_{3})}

, at times

t_{1}

and

t_{3}

, along with the LSHT LST image

X_{1} (s, t_{2})

at time

t_{2}

, the goal of STF is to predict the corresponding HSLT LST image

X_{2} (s, t_{2})

, as illustrated in Figure 3. Therefore, the predicted HSHT LST image at time

t_{2}

, denoted as

{\hat{X}}_{2} (s, t_{2})

, can be expressed as in Equation (3).

{\hat{X}}_{2} (s, t_{2}) = f (P_{1}, P_{3}, X_{1} (s, t_{2})),

(3)

where the objective is for

{\hat{X}}_{2} (s, t_{2})

to approximate the true HSLT image

X_{2} (s, t_{2})

as closely as possible. Table 3 summarizes the notations used in this formulation.

However, the STF problem cannot be effectively solved using linear methods [46,118,119]. Non-linear approaches, such as DL, are better suited to capture the complex dependencies between LSHT and HSLT images. DL is a subset of machine learning that employs multi-layer neural networks to learn complex, non-linear relationships from data. To account for these non-linear dependencies, the function f is parameterized by a set of weights W, which are optimized during training, as shown in Equation (4).

{\hat{X}}_{2} (s, t_{2}) = f (P_{1}, P_{3}, X_{1} (s, t_{2}) ∣ W) .

(4)

In practice, relying on two pairs of images may not always be feasible, as it requires waiting for a future pair,

P_{3}

, captured under favorable conditions (e.g., minimal cloud coverage) to predict the high-spatial LST at an earlier time. To address this limitation, two alternative strategies are commonly employed: (i) using only one pair of satellite images,

P_{1}

, or (ii) leveraging a time series of previous pairs.

3.2. Loss Functions

The weights W in Equation (4) are optimized by minimizing a loss function, which measures the discrepancy between the predicted and true high-resolution satellite image. Although many losses originate from general computer vision tasks, their role in STF is tied to preserving temperature gradients, spatial structure, and radiometric consistency.

In practice, the objective is expressed as a weighted combination of multiple complementary loss terms, as shown in Equation (5), where each

λ_{i}

represents the relative importance of its corresponding loss term.

\begin{matrix} L_{t o t a l} = λ_{1} L_{c o n t e n t} + λ_{2} L_{v i s i o n} + λ_{3} L_{f e a t u r e} + λ_{4} L_{S p e c t r a l} + λ_{5} L_{G A N}, where \sum_{i = 1}^{5} λ_{i} = 1 . \end{matrix}

(5)

Here,

L_{c o n t e n t}

,

L_{v i s i o n}

,

L_{f e a t u r e}

,

L_{s p e c t r a l}

, and

L_{G A N}

denote the content, vision, feature, spectral, and adversarial loss terms, respectively, which are detailed in the following subsections. During training, the loss is minimized using gradient-based optimization algorithms, most commonly Stochastic Gradient Descent (SGD) and Adam [120].

3.2.1. Content Loss

It preserves radiometric consistency by ensuring that pixel values remain physically plausible and aligned with real LST values. The most common form is the Mean Squared Error (MSE), expressed in Equation (6), while variants such as Mean Absolute Error (MAE), KL divergence [121], Huber loss [122], or index-driven terms (e.g., NDVI-based [123] and NDBI-based [124] error constraints) have also been used to encode domain knowledge.

L_{content} = \frac{1}{N} \sum_{i = 1}^{N} {({\hat{X}}_{2} (s_{i}, t_{2}) - X_{2} (s_{i}, t_{2}))}^{2}

(6)

3.2.2. Vision Loss

It penalizes visually unrealistic reconstructions and helps preserve spatial structure, reducing the risk of overly smoothed predictions that blur LST boundaries. It typically relies on perceptual similarity metrics such as Structural Similarity Index (SSIM) and MS-SSIM [125,126], given by Equation (7), which evaluate local luminance, contrast, and structural consistency rather than pixel-wise differences alone. It may also incorporate edge-aware formulations, such as Sobel-based loss [127], to further enforce gradient continuity and retain sharp transitions.

L_{vision} = 1 - SSIM ({\hat{X}}_{2} (s, t_{2}), X_{2} (s, t_{2}))

(7)

3.2.3. Feature Loss

It captures high-level perceptual features by comparing intermediate representations extracted from pre-trained networks (e.g., encoders) [128]. For LST, this loss helps preserve meaningful spatial patterns (e.g., thermal gradients between vegetation, water, and built-up areas) that may not be fully enforced through pixel-wise or vision losses alone. The feature loss is computed as the difference between the feature representations of the predicted image

F_{{\hat{X}}_{2}}

and the reference image

F_{X_{2}}

, as expressed in Equation (8), where L denotes the number of extracted feature elements.

L_{feature} = \frac{1}{L} {∥ F_{{\hat{X}}_{2}} - F_{X_{2}} ∥}_{2}^{2}

(8)

3.2.4. Spectral Loss

It enforces consistency in the learned feature space by measuring the cosine similarity between the predicted and reference images [75]. In LST-based STF, this loss helps preserve thermal response patterns even when pixel-wise intensities vary (e.g., hotter built-up surfaces compared to cooler vegetation). The formulation is given in Equation (9).

L_{spectral} = 1 - \frac{F_{{\hat{X}}_{2}} \cdot F_{X_{2}}}{∥ F_{{\hat{X}}_{2}} ∥ ∥ F_{X_{2}} ∥}

(9)

3.2.5. Adversarial Loss

It is used when the STF model relies on generative adversarial learning (e.g., GAN-based STF). The loss encourages the predicted high-resolution LST image to follow the distribution of real observation, making the output statistically indistinguishable from true high-resolution LST image [129].

3.3. Evaluation Metrics

STF models are typically evaluated using a diverse set of quantitative and qualitative metrics. These metrics can be grouped into three categories: Error Assessment, Quality Assessment, and Efficiency.

Error Assessment Metrics: These metrics measure the numerical discrepancy between the fused LST and the reference high-resolution one. Common examples include RMSE, MAE, relative MAE (rMAE), and the coefficient of determination ( $R^{2}$ ). ERGAS is also frequently used to assess global normalized error, with lower values indicate higher fidelity.
Quality Assessment Metrics: Rather than measuring numerical differences, this category assesses perceptual or structural similarity. SSIM [125], Peak Signal-to-Noise Ratio (PSNR) [130], correlation coefficient (CC) [131], Spectral Angle Mapper (SAM) [132], Perceptual Image Patch Similarity (LPIPS) [133], and Universal Image Quality Index (UIQI) [134] are commonly adopted to quantify texture preservation, sharpness, and spectral consistency in reconstructed LST fields.
Efficiency Metrics: Beyond accuracy, computational efficiency is increasingly emphasized, especially for large-scale or near-real-time applications. Metrics include inference time (per fused scene), memory footprint, and scalability with spatial or temporal input size. For example, DL-based STF models are reported to achieve inference speeds orders of magnitude faster than classical algorithms such as ESTARFM [74].

4. Taxonomy of DL-Based STF Methods

In this section, we present a taxonomy of DL-based STF methods constructed from 34 representative works selected for recency, relevance, and methodological diversity. The classification is organized around four criteria: Architecture, Learning Paradigm, Training Strategy, and Use of Pre-trained Models (Figure 4). We adopt a broader perspective by including STF methods developed for related RS tasks, such as SR, NDVI, and others. We then map LST-specific methods onto this framework and discuss how each category can be adapted to address the unique spatial and temporal characteristics of LST data. This approach not only highlights shared methodological patterns but also clarifies their relevance and potential adaptation for LST estimation.

4.1. Architectures

STF methods are generally categorized based on their model architecture, a trend that also holds for STF-based LST estimation. Common architectures include CNNs, AEs, GANs, ViT, and RNNs.

4.1.1. Convolutional Neural Networks

CNNs are widely used architectures for processing grid-like data such as satellite images [135,136]. They automatically learn spatial features from input data, making them effective for capturing complex patterns, edges, and textures. A typical CNN consists of layers such as convolutional, pooling, normalization, activation, and fully connected layers [137]:

Convolutional Layers: Extract spatial features using local filters.
Pooling Layers: Reduce spatial dimensions and improve feature robustness.
Normalization Layers: Stabilize and accelerate training.
Activation Layers: Introduce non-linearity.
Fully Connected Layers: Integrate learned features for prediction.

CNN-based STF methods leverage CNNs to automatically model the complex, non-linear relationships between LSHT and HSLT satellite image pairs, which are then used to predict high-resolution target satellite images. Although CNNs are primarily designed for spatial modeling, temporal information can be incorporated by processing multiple temporal images through separate CNN streams and fusing their outputs to capture spatio-temporal variations. Figure 5 illustrates a typical CNN-based STF architecture using a single pair

P_{1}

. The network consists of four main blocks: (1) Spatial feature extraction, where the high-resolution satellite image is processed through convolutional layers to obtain compressed spatial features, (2) Temporal variation extraction, where two low-resolution satellite images are concatenated along channels and passed through convolutional layers to capture temporal features, (3) Feature fusion, where spatial and temporal representations are combined in a latent space, and (4) Reconstruction, where deconvolutional layers and fully connected layers generate the predicted high-resolution satellite image. When multiple pairs are used, each pair is processed similarly, and their weighted outputs are integrated to produce the final prediction. Table 4 presents an overview of CNN-based DL methods for STF. The first CNN-based approach was introduced by [37], where a CNN is trained on n MODIS-Landsat pairs and applied using

P_{1}

and

P_{3}

satellite pairs to predict the HSHT satellite image at the target date. The model employs an MSE loss and was evaluated on the CIA and LGC datasets. A similar strategy was adopted in [64], while Wang et al. [66] extended the formulation by incorporating two prior MODIS-Landsat pairs and using the Euclidean distance as the loss function. Tan et al. [62] reformulated the problem according to the notation in Section 3.1. Their method uses only one pair

P_{1}

together with

X_{1} (s, t_{2})

to estimate

\hat{X} (s, t_{2})

. Spatial features from Landsat at

t_{1}

and temporal variations between the two MODIS dates are extracted through separate convolutional blocks, fused in a latent space, and reconstructed via deconvolution and fully connected layers. MSE is used as the loss, and the dataset is manually curated. Li et al. [67] proposed a sensor-bias-driven model that explicitly accounts for spectral and spatial inconsistencies between MODIS and Landsat. Two CNNs are used, one enhances coarse reflectance, while the second models cross-sensor bias. Qin et al. [68] introduced a multi-constrained loss that simultaneously improves fusion quality and enables gap filling, where the model learns to handle cloud-induced missing pixels through a binary mask mechanism. Yu et al. [71] presented an unsupervised CNN framework for Landsat 8 and Sentinel-2 STF. Meng et al. [70] combined a multiscale Siamese CNN with a convolutional RNN to jointly model spatial-spectral features and temporal dynamics.

Although these CNN-based STF approaches demonstrate strong performance, none of them were originally designed or validated for LST. Applying them directly to LST often leads to model instability or divergence, primarily because its spatial-temporal dynamics. To address these challenges, a few CNN-based STF methods have been specifically developed for LST estimation. Yin et al. [65] introduced STTFN, a dual-CNN framework designed to accommodate the spatio-temporal variability and higher non-linearity of LST. The model uses two independently trained networks (forwardCNN and backwardCNN) to estimate

{\hat{X}}_{2} (s, t_{3})

and

{\hat{X}}_{2} (s, t_{1})

from

P_{1}

and

P_{3}

, respectively. At the prediction stage, both networks process

P_{2}

, and their outputs are combined using a temporal weighting function to produce the final LST estimate at

t_{2}

. The Huber loss is adopted to reduce the influence of extreme temperature values. More recently, Sun et al. [69] proposed a cascade STF architecture that explicitly leverages multiple temporal pairs to better capture the temporal smoothness of LST. The framework consists of a supervised module generating initial spatially consistent predictions, followed by a self-supervised refinement block that constrains temporal behavior at prediction dates. In addition to content loss, the method introduces cycle-consistency and temporal-consistency losses, which are particularly important for LST due to its stronger temporal variation compared to SR.

From Table 4, several trends emerge for CNN-based STF methods. Most approaches rely on MODIS-Landsat satellite pairs. Many methods, including [37,64], use two temporal satellite pairs, whereas others, such as [62] operate with a single satellite pair. Content-based losses (e.g., MSE) are the predominant optimization criteria. A large fraction of studies depend on handcrafted datasets, and only a few, such as [62], provide open-source implementation, revealing a reproducibility gap. Importantly, only two CNN-based methods [65,69] have been validated on LST data.

4.1.2. Autoencoder

AEs are unsupervised neural networks designed to learn compact and informative representations of input data [138,139]. They consist of an encoder, which maps the input

X = {x_{i}}_{i = 1}^{n}

to a latent representation

H = {h_{i}}_{i = 1}^{n}

, and a decoder, which reconstructs the input from this latent space. The network is trained to minimize the reconstruction error between the original and reconstructed input as defined in Equation (10).

min_{θ} J_{A E} (θ) = min_{θ} \sum_{i = 1}^{n} l (x_{i}, x_{i}^{'}) = min_{θ} \sum_{i = 1}^{n} l (x_{i}, g_{θ} (f_{θ} (x_{i})))

(10)

Here,

x_{i}

denotes the i-th input sample,

x_{i}^{'}

is its reconstruction, and n is the total number of samples in the training dataset. The encoder function

f_{θ}

transforms

x_{i}

into a latent representation

h_{i}

, while the decoder function

g_{θ}

reconstructs the input from this latent representation. The parameters

θ

include all weights and biases of both encoder and decoder networks. Stacked AEs (SAEs) extend this architecture by using multiple encoding and decoding layers to capture more complex hierarchical features [140]. Variants such as Sparse [141], Contractive [142], Denoising [143], Variational [144], Convolutional AEs [145], and Recurrent AEs [146], have been developed to improve robustness, noise handling, or temporal and spatial modeling.

AE-based STF methods extend traditional AEs to model both spatial and temporal relationships between LSHT and HSLT image pairs. Although AEs are primarily unsupervised, STF applications adapt them for supervised tasks using an encoder–fusion–decoder design, as illustrated in Figure 6. In this architecture, a spatial encoder processes the high-resolution satellite image to produce a latent representation

H_{1}

, while a temporal encoder compresses two low-resolution satellite images into

H_{2}

. The latent representations

H_{1}

and

H_{2}

are then fused through a tailored mechanism to capture spatio-temporal dependencies. The merged latent representation is decoded to reconstruct the predicted high-resolution satellite image

{\hat{X}}_{2}

, which is compared to the true target

X_{2}

to compute the loss. Table 5 provides a summary of AE-based DL approaches for STF. Tan et al. [72] introduced EDCSTFN, which employs a basic AE architecture and a compound loss combining content, feature, and vision components.

For LST estimation, Chen et al. [61] proposed a Conditional Variational AE (CVAE). The encoder utilizes a multi-kernel convolutional transformer to extract global features, while the decoder reconstructs the fused LST image with a simple convolutional structure. A compound loss integrating variational inference and noise mitigation improves robustness against outliers and ensures high-quality reconstruction. Note that GAN-based AE methods are more appropriately categorized under GAN models and are therefore excluded from this AE section for clarity.

Table 5 presents a comparative overview of AE-based STF methods. Most approaches rely on two temporal satellite image pairs [61,72]. Open-source implementations are few, with only [72] providing accessible code. Validation on LST data remains limited, with [61] being a notable exception.

4.1.3. Generative Adversarial Networks

GANs are a class of machine learning models designed for unsupervised learning, where the objective is to produce synthetic data that closely resembles real-world samples [129]. A GAN consists of two neural networks: a generator G, which produces artificial data, and a discriminator D, which evaluates whether the data is real or fake [147,148]. The two networks engage in a competitive, adversarial game, in which the generator aims to fool the discriminator, while the discriminator seeks to accurately distinguish real from generated data [149]. This iterative min-max process enables the generator to produce outputs often indistinguishable from real data. GANs have been widely applied to image generation [150], super-resolution [151], denoising [152], and image-to-image translation [153]. The basic architecture of a standard GAN is illustrated in Figure 7.

The discriminator’s objective is to maximize the likelihood of assigning high scores to real data and low scores to generated data, as formulated in Equation (11). Here,

x \sim μ (x)

denotes real samples drawn from the true data distribution, and

z \sim γ (z)

represents latent noise vectors sampled from a prior distribution used as input for the generator.

L^{(D)} = max_{D} [log D (x) + log (1 - D (G (z)))] .

(11)

Conversely, the generator is trained to minimize the discriminator’s ability to distinguish real from generated data, as shown in Equation (12).

L^{(G)} = min_{G} [log D (x) + log (1 - D (G (z)))]

(12)

Combining these two objectives yields the standard adversarial min-max optimization problem expressed in Equation (13).

L = min_{G} max_{D} [log D (x) + log (1 - D (G (z)))]

(13)

For the full dataset, the objective is expressed using expectations over real data

x \sim μ (x)

and latent variables

z \sim γ (z)

, as presented in Equation (14).

\begin{matrix} min_{G} max_{D} V (D, G) & = min_{G} max_{D} [E_{x \sim μ} [log D (x)] + E_{z \sim γ} [log (1 - D (G (z)))]] \end{matrix}

(14)

GANs have evolved into multiple variants for diverse applications. Fully Connected GANs use simple MLPs for both networks [129], while Convolutional GAN introduce convolutional layers suitable for image generation [154,155]. Conditional GANs (cGANs) allow controlled generation by conditioning on auxiliary information [156,157]. GANs with inference models, such as ALI [158] and BiGAN [159], map observed data to latent representations. Adversarial AEs (AAEs) combine AE reconstruction with adversarial learning to improve generative modeling [160,161].

GAN-based STF models consist of two primary components, as illustrated in Figure 8: the generator and the discriminator. The generator is responsible for producing high-resolution fused satellite images using either a single or two temporal satellite pairs (

P_{1}

and

P_{3}

). The discriminator evaluates these outputs by distinguishing real high-resolution satellite images from generated ones. cGANs are the most widely adopted variant for STF, as conditioning the generation process on low-resolution inputs helps maintain spatial and temporal consistency. The generator typically follows a three-stage design: feature extraction, feature fusion, and image reconstruction. An encoder–decoder architecture is commonly used to enhance spatial resolution during the extraction and reconstruction stages, while the fusion module integrates multi-spatial and multi-temporal features to produce temporally coherent predictions. The discriminator receives as input both the coarse-resolution satellite image at the target date and either the real or the generated fine-resolution satellite image. Its objective is to output a probability indicating whether the pair is real (ground truth) or fused (generated), usually achieved with a sigmoid activation in the final layer. During training, when the discriminator is provided with real pairs (true fine-resolution and coarse-resolution LST images), it is encouraged to classify them as real (label 1). Conversely, when presented with fused pairs (generated fine-resolution and coarse-resolution LST images), it learns to classify them as fake (label 0). Table 6 presents a comparative summary of cGAN-based DL techniques for STF. In [73], a CycleGAN architecture was used to model temporal transitions between HSLT satellite images at

k - 1

and

k + 1

, with final refinements performed using wavelet transforms. A two-stage GAN framework was later proposed in [74], where the generator relies on residual blocks for high-frequency feature extraction and a feature-level fusion strategy, validated on three MODIS-Landsat datasets (CIA, LGC [162], and Shenzhen). Ma et al. [75] modeled spatial, sensor, and temporal inconsistencies using cascaded dual regression networks, four-layer CNNs, and a GAN module, respectively. Several studies have also focused on simplifying the inputs required for GAN-based STF. For example, ref. [76] introduced a cGAN-based model that performs fusion using only two images: a coarse-resolution observation from the prediction date and a fine-resolution reference satellite image from any prior date. Additional improvements include the integration of advanced segmentation and linear injection strategies [77], multilevel feature fusion with attention modules and adaptive instance normalization (MLFF-GAN) [78], as well as robust attention mechanisms designed to handle noisy inputs [79]. More recent advances include adaptive multiscale pyramidal architectures with deformable convolutions [80] and diffusion-based generative models that iteratively refine noisy inputs using dual-stream U-Net architectures [81]. These approaches have been evaluated on a diverse set of benchmark datasets, including CIA, LGC [162], Tianjin [97], and E-Smile dataset [163].

For LST estimation, Chen et al. [82] proposed a cGAN-based STF architecture specifically designed for LST generation. The approach constructs an unsupervised generative network, which iteratively generates fine spatio-temporal resolution LST data from reference LST images containing missing values.

Table 6 provides a summary of GAN-based STF methods. Most approaches rely on two temporal pairs for predictions, though some, such as [75,77,82], use a single pair. Loss functions typically combine adversarial losses with additional terms, such as content, spectral, or perceptual losses, to improve reconstruction quality and spatial fidelity. Open-source implementations are relatively common, with [76,78,80] providing accessible code, which supports reproducibility and further development. Importantly, only one GAN-based STF method [82] has been validated on LST data.

4.1.4. Vision Transformers

Transformers were originally introduced for natural language processing to model long-range dependencies through self-attention mechanisms [164], achieving significant success in tasks such as machine translation and text generation [165]. Motivated by these results, Dosovitskiy [166] proposed the ViT, which extended Transformer architectures to computer vision tasks by processing images as sequences of fixed-size patches instead of relying on convolutional operations. Figure 9 illustrates the basic architecture of ViT. In ViT, an input image is divided into non-overlapping patches, each of which is flattened and linearly projected into a patch embedding. These embeddings are combined with positional encodings to preserve spatial information and then passed through a standard Transformer encoder composed of multi-head self-attention and feed-forward networks.

Since the introduction of ViT, numerous architectural variants have been proposed to enhance training efficiency, local feature modeling, and multi-scale representation. DeiT [167] improved data efficiency using knowledge distillation. TNT [168] introduced nested tokens to better capture local structure, while PVT [169] incorporated pyramid-style hierarchical features for dense prediction tasks. Swin Transformer [170] addressed the lack of locality in ViT by employing shifted window self-attention. Additional improvements include spatially separable attention in Twins [171], region-aware token interaction in ViL [172], and outlook attention in VOLO [173].

ViT-based STF models leverage self-attention mechanisms to capture long-range spatial and temporal dependencies more effectively than convolutional architectures. As shown in Figure 10, each input satellite image is first decomposed into a sequence of fixed-size patches, to which positional embeddings are added in order to preserve spatial structure. A spatial transformer encoder processes the high-resolution observation, while a temporal transformer encoder compresses the two coarse-resolution satellite images into latent patch representations. The resulting spatial and temporal sequences are then fused and fed into a transformer decoder, which reconstructs the output patch sequence. After reshaping, the model produces the final high-resolution prediction at the target date. Table 7 provides an overview of ViT-based DL approaches for STF. In [83], a multi-stream architecture combining ViT with CNNs was introduced to jointly exploit global temporal dependencies and local feature extraction. The method was evaluated on the LGC, CIA [162], and AHB [97] datasets. In [85], the authors proposed an STF framework based on the swin transformer and integrated linear spectral mixing theory. Li et al. [86] improved ViT-based STF by introducing an enhanced transformer encoder and dilated convolutions to enlarge the receptive field. MSFusion [84] employs a texture ViT to better model spatial structural details, while Jiang and Shao [87] developed an AE with a multi-kernel convolutional ViT encoder to extract global features across scales. More recently, Benzenati et al. [88] introduced STF-Trans, a transformer-based fusion method that requires only a single high-resolution observation at any arbitrary date alongside coarse-resolution temporal inputs, using an AE ViT design to effectively capture long-range dependencies. In addition, Ma et al. [89] proposed SFT-GAN, which integrates a sparse fast ViT within a GAN-based framework to enhance multi-scale feature extraction and spectral consistency.

For ViT-based STF for LST-specific applications, Hu et al. [90] proposed THSTNet, which employs a two-stage Swin ViT architecture with spatiotemporal mapping and a texture converter module to improve the reconstruction of fine-resolution LST.

Table 7 summarizes recent ViT-based methods for STF. Most approaches emerged after 2021, following the introduction of ViT in 2020. Loss functions largely prioritize content preservation, while some methods incorporate additional vision-guided constraints to enhance spatial or spectral fidelity. Several works, such as [85,89], provide open-source implementations that support reproducibility and further development. Notably, only one ViT-based STF model has been specifically designed and validated for LST data [90].

4.1.5. Recurrent Neural Networks

RNNs are a class of supervised machine learning models designed to process sequential or time-series data [138]. They incorporate feedback connections that allow information from previous time steps to influence the current output, enabling the network to retain temporal dependencies [174,175]. Variants such as Long Short-Term Memory networks (LSTMs) [176], Bidirectional LSTMs [177], Stacked LSTMs [178], and Gated Recurrent Units (GRUs) [179] have been developed to overcome limitations including vanishing gradients, capture long-term dependencies, and reduce computational complexity.

In RNN-based STF, the aim is to leverage temporal dependencies between coarse and fine-resolution satellite images. Given a sequence of n temporal satellite pairs,

P_{i}

, the RNN is trained to map coarse-resolution inputs to fine-resolution predictions at each time step

t_{i}

, denoted as

{\hat{X}}_{2} (s, t_{i})

, for

i \in [1, n]

. After training, the model generates fine-resolution estimates for new time steps, which are then fused with existing coarse-resolution data to reconstruct high-quality satellite image outputs. Table 8 gives an overview of RNN-based DL frameworks applied to STF. Yang et al. [91] proposed a hybrid model combining a Super-Resolution CNN with an LSTM. The CNN enhances spatial resolution, while the LSTM captures temporal patterns, particularly for rapid phenological changes. Zhan et al. [92] extended this approach by integrating a UNet for spatial mapping between MODIS and Sentinel-2 images, combined with an LSTM to exploit temporal dynamics for generating high-resolution NDVI data during critical crop growth periods.

Table 8 summarizes RNN-based STF methods. Overall, RNNs are less commonly used in STF compared to other architectures, primarily because STF benefits from models that jointly capture both spatial and temporal dependencies, whereas RNNs are optimized for sequential data. Notably, RNN-based approaches have not yet been explored for LST-specific STF, which highlights a potential avenue for future research contributions.

4.2. Learning Paradigm

Various learning paradigms have been applied to STF, and they can be grouped into four main categories: supervised, unsupervised, self-supervised, and collaborative learning. These categories differ in how they leverage available high-resolution data.

1.: Supervised learning.This paradigm relies on paired training samples, where fine-resolution observations $X_{2} (s, t_{2})$ are available as targets. Most existing STF models fall into this category. While supervised learning has proven effective, its dependence on cloud-free, fine-resolution LST data limits scalability in real LST applications.
2.: Unsupervised learning. Here, the model is trained without fine-resolution labels, meaning $X_{2} (s, t_{2})$ is unknown. Only one recent study has explored an unsupervised STF formulation [71]. This direction is especially promising for LST.
3.: Self-supervised learning. Positioned between supervised and unsupervised paradigms, self-supervised learning creates proxy tasks or pseudo-labels directly from the data [180]. To date, only one STF method has adopted this strategy [69]. Extending self-supervised schemes to LST STF remains largely unexplored and could help reduce reliance on scarce fine-resolution LST images.
4.: Collaborative learning. This strategy treats STF as a cooperative process, where different learners interact to improve fusion quality [181]. Only one study has explicitly framed STF in this way [70]. Such paradigms could be beneficial for LST, as they may better exploit complementary cues between coarse and fine thermal observations.

4.3. Training Strategy

Training strategies refer to the techniques employed to enhance the performance and stability of neural networks during optimization. In STF, these strategies play a key role in capturing spatial-temporal dependencies, especially for LST, where models must handle strong thermal spatio-temporal variability. Existing STF methods generally adopt four main training strategies: residual learning, attention mechanisms, normalization, and dropout.

1.: Residual learning. Residual learning [182] introduces skip connections that allow the network to learn a residual function instead of the full mapping, which stabilizes the optimization of DL architectures [183]. The residual formulation is defined in Equation (15).

$F (x) : = H (x) - x$

(15)

where $H (x)$ is the desired underlying function, and $F (x)$ is the residual mapping. The output of the residual block becomes $F (x) + x = H (x)$ . As shown in Table 9, residual learning is the most common strategy across STF methods due to its ability to preserve essential spatial and temporal structures in LST data.
2.: Attention mechanisms. Attention mechanisms enable a model to focus on the most informative components of the input. In STF, four forms are used: channel, spatial, temporal, and feature attention. Channel attention assigns importance to individual spectral bands, although its usefulness is limited for LST-focused STF because LST data generally contains only one thermal band. Spatial attention highlights salient spatial regions, helping the model detect areas with strong temperature variability or sharp thermal gradients. Temporal attention emphasizes key time steps, allowing the network to capture rapid LST fluctuations and short-term dynamics. Feature attention evaluates the relevance of entire feature maps. As summarized in Table 9, all ViT-based STF methods incorporate spatial attention, consistent with the fundamental role of attention in ViT architectures.
3.: Normalization. Normalization refers to a set of transformations applied to stabilize and accelerate model training by enforcing desired statistical properties such as centering, scaling, or decorrelation [184]. In STF, five main normalization strategies are encountered. Batch Normalization (BN) [185] mitigates internal covariate shift by standardizing activations within each mini-batch. Thus, given an activation a, BN computes its normalized form as shown in Equation (16).

${\hat{a}}^{(i)} = \frac{a^{(i)} - μ}{\sqrt{σ^{2} + ϵ}}$

(16)

where $μ$ and $σ^{2}$ denote the mini-batch mean and variance, and $ϵ > 0$ ensures numerical stability. Group Normalization (GN) [186] standardizes activations within predefined groups. Instance Normalization (IN) [187] normalizes each sample independently and is used to reduce contrast-related variations. Spectral Normalization (SN) [188], mainly applied in GAN-based STF methods. It stabilizes the discriminator training by constraining the Lipschitz constant of weight matrices. Finally, Switchable Normalization (SwN) [189] combines three types of statistics: channel-wise, layer-wise, and mini-batch-wise. As shown in Table 9, STF methods vary widely in their choice of normalization strategy, reflecting different architectural needs and constraints, especially when dealing with LST data.
4.: Dropout. Dropout randomly deactivates neurons during training to reduce overfitting [190]. Although only a few STF approaches use dropout (Table 9), it remains a promising direction for LST STF, where limited high-resolution observations can make models prone to overfitting.

Table 9. Overview of training strategies employed in state-of-the-art STF methods.The table categorizes strategies into residual learning, attention mechanisms, normalization techniques, and dropout, and lists the corresponding methods that utilize each strategy. For attention and normalization, subtypes are also indicated to highlight specific implementations across different STF approaches.

Training Strategy	List of Methods
Residual Learning	[37,61,64,65,66,67,69,72,74,75,76,78,79,80,81,83,86,87,90]
Attention Mechanism	Channel Attention: [68,75,87,89] Spatial Attention: [81,83,84,85,86,87,88,90] Temporal Attention: [77,78,79] Feature Attention: [78]
Normalization	Batch Normalization: [37,61,65,68,74,75,76,79,80,92] Group Normalization: [81,87] Instance Normalization: [78] Spectral Normalization: [75,76,79,82] Switchable Normalization: [75,76,84]
Dropout	[80,82,83,86,91,92]

4.4. Incorporation of Pre-Trained Models

Incorporating pre-trained models is a widely used strategy in DL to enhance model generalization, improve convergence, and mitigate overfitting [191,192]. This is particularly relevant in STF, including LST-specific applications, where paired high- and low-resolution datasets are limited. Pre-training can be applied at two main levels (Table 10):

1.: Feature Extraction: In this strategy, pre-trained models are used solely to extract informative features from the input data without further fine-tuning [193]. Within STF, feature extraction is often employed to compute spectral or perceptual losses (see Section 3.2.3 and Section 3.2.4).
2.: Transfer Learning: Transfer learning aims to improve performance on a target task by adapting knowledge from a related source domain [194]. Formally, given a source dataset $D_{S}$ with task $T_{S}$ and a target dataset $D_{T}$ with task $T_{T}$ , transfer learning seeks to enhance the target predictive function $f_{T} (\cdot)$ by utilizing knowledge from $D_{S}$ and $T_{S}$ , where $D_{S} \neq D_{T}$ or $T_{S} \neq T_{T}$ . In STF for LST, Chen et al. [61] pre-trained an autoencoder on simulated LST data generated by downscaling MODIS measurements to 4 km resolution via pixel aggregation, and transferred the learned parameters to initialize their fusion framework. Additionally, Huang et al. [81] proposed a fine-tuning strategy to adapt pre-trained models to new regions, but did not employ pre-training for initial model training, and thus is not included under this category.

5. Experiment Analysis and Results

This section presents the experimental analysis and results. We begin by defining the ROI and providing a comprehensive description of our LST dataset (STF-LST). Next, we compare representative state-of-the-art STF approaches both quantitatively and qualitatively, and we highlight the challenges posed by the domain shift from SR to LST data. The dataset and accompanying resources are publicly available at https://github.com/Sofianebouaziz1/STF-LST (accessed on 12 January 2026).

5.1. Region of Interest

The ROI is located within Orléans Métropole in the Centre-Val de Loire region of France. As shown in Figure 11(a1), it extends from longitudes 1.7505°E to 2.1263°E and latitudes 47.7605°N to 48.0133°N, covering approximately 334 km². The Loire River, France’s longest watercourse, crosses the ROI and significantly modulates its local microclimate, introducing marked spatial variations in LST (Figure 11(a2)). The ROI exhibits a high degree of land cover heterogeneity, including dense urban cores, water bodies, forested surfaces, industrial areas, and agricultural croplands (Figure 11(b1–b5)). This mixture produces strong contrasts in radiative properties, emissivity, and thermal inertia. Such conditions are particularly challenging for LST STF, as they generate sharp temperature gradients, nonlinear temporal dynamics, and heterogeneous spatial patterns that must be reconstructed at fine resolution. Orléans Métropole, with nearly 280,000 inhabitants across 22 municipalities, presents a representative mid-sized European urban environment where UHI, river-induced cooling, and peri-urban agricultural influences coexist. This diversity makes the ROI an appropriate benchmark for evaluating the robustness and generalization ability of STF models for LST data.

5.2. Satellite Data

This study relies on two complementary satellite products accessed through the Google Earth Engine (GEE) platform [195]: (i) MODIS/Terra LST and Emissivity Daily Global 1 km (MOD11A1, Collection 6.1) for coarse-resolution observations, and (ii) Landsat 8 USGS Level-2 Collection 2 Tier 1 for fine-resolution LST. These datasets were selected for two main reasons. First, both satellites acquire data during mid-morning overpasses, which ensures that MODIS and Landsat capture surface and atmospheric conditions that are highly comparable in illumination, temperature, and emissivity. Second, when combined, they provide complementary strengths: MODIS offers daily revisit frequency at 1 km resolution, whereas Landsat 8 provides detailed 30 m spatial information but with a 16-day revisit cycle. Leveraging both of them provides fine-resolution daily LST. MODIS LST values were extracted from the LST_Day_1km band derived using the split window algorithm, which yields an RMSE below 2 °C across most land cover types [196]. Landsat 8 LST was derived from the ST_B10 thermal band using the single-channel algorithm, with reported accuracy around 1.5 °C [113]. Both products in GEE include standard atmospheric corrections, emissivity adjustments, and quality assurance masks, which reduces preprocessing inconsistencies [197,198]. However, no explicit cross-sensor radiometric calibration or normalization between MODIS and Landsat LST was applied prior to training. In STF, the model may implicitly learn systematic inter-sensor differences from overlapping reference data at previous or future timesteps. Nevertheless, all STF methods were trained and evaluated using the same preprocessed data, so such inconsistencies are expected to affect all methods similarly and do not compromise the validity of the comparative analysis.

A total of 51 MODIS-Landsat paired observations were collected between 14 April 2013 and 05 October 2024. Only scenes with acceptable cloud levels were retained, <20% cloud cover for Landsat 8 and <10% for MODIS. Each pair corresponds to a date on which the MODIS observation and the Landsat 8 acquisition overlapped spatially and temporally over the defined ROI. All 51 pairs used in the experiments are listed in Table 11. The dataset was divided into train, validation, and test, and was structured into temporal triplets of the form

(t_{i}, t_{i + 1}, t_{i + 2})

. Within the training set, successive triplets were allowed to overlap to increase the density and variability of training instances. For instance, the sample

(t_{1}, t_{2}, t_{3})

is followed by

(t_{2}, t_{3}, t_{4})

. This is widely used in temporal learning settings, as it exposes the model to a richer set of temporal transitions while preserving the physical coherence of the series. However, to avoid temporal leakage and ensure a fair evaluation, no overlap was allowed between the training, validation, and test subsets. The final distribution consists of 34 triplet samples for training (14 April 2013 to 07 August 2020), 4 for validation (27 November 2020 to 09 May 2022), and 8 for testing (13 August 2022 to 05 October 2024). Each sample consists of a previous, target, and future overlapping pair between MODIS and Landsat.

LST gaps were corrected using interpolation techniques. Landsat 8 LST gaps were reconstructed using temporal interpolation over a 32-day window by exploiting two valid observations before and after each missing acquisition to recover smoothly varying temperature patterns. Remaining invalid pixels were filled using an adaptive spatial strategy based on a focal-mean filter, which expanded iteratively until at least one valid neighbor was available. MODIS LST gaps were reconstructed by applying only spatial gap filling using the same adaptive focal-mean approach. After gap correction, all MODIS LST images were resampled to 30 m using bicubic interpolation to ensure pixel-wise alignment with Landsat 8 and to harmonize the spatial dimensions required for STF.

All LST images of size

950 \times 950

pixels were divided into fixed-size patches. A patch size of

95 \times 95

with a stride of 20 pixels was selected to balance spatial context against computational cost, ensuring that each patch retained sufficient thermal and textural variability without mixing heterogeneous land-cover patterns excessively. This procedure generated 62,866 training patches and 7396 validation patches from the available scenes.

5.3. Quantitative Comparison

The quantitative evaluation relies on six widely used metrics: RMSE, ERGAS, SSIM, PSNR, SAM, and CC, introduced earlier in Section 3.3. Together, they capture complementary aspects of fusion quality, including pixel-wise accuracy, spectral consistency, structural similarity, and global fidelity. To ensure a representative and balanced comparison, we selected approaches that are established in the STF literature with demonstrated effectiveness and that provide publicly available implementations to guarantee reproducibility. Following these criteria, four methods were retained. ESTARFM [39], originally designed for SR-based STF, serves as a classical reference for weight-based methods. STTFN [65], a CNN-based architecture explicitly developed for LST STF. EDCSTFN [72], an AE-based SRF framework, enables testing the transferability of SR-oriented STF to LST data. MLFF-GAN [78], a GAN-based model integrating multi-level feature fusion, represents a modern generative approach with strong performance on SR tasks. All DL-based STF models were trained for 200 epochs with a batch size of 32. The learning rates were set to

1.5 \times 10^{- 5}

for STTFN,

10^{- 3}

for EDCSTFN, and

2 \times 10^{- 4}

for MLFF-GAN. All experiments were conducted on an NVIDIA RTX A6000 GPU.

Quantitative results obtained from these evaluations are reported in Table 12. On 29 August 2022, ESTARFM yields the best scores across all metrics among the compared methods. Nevertheless, its RMSE remains high at 5.35 °C, indicating large absolute LST errors. This confirms that existing STF methods exhibit substantial degradation on thermal data. DL-based approaches show even larger errors and lower structural consistency, which reflects difficulties in modeling the strong thermal contrasts typical of late-summer conditions. On 30 September 2022, EDCSTFN achieves the lowest RMSE of 2.32 °C and the highest CC, PSNR, and SAM, slightly outperforming ESTARFM in absolute accuracy. ESTARFM and MLFF-GAN follow closely, while STTFN exhibits lower structural and spectral fidelity. Despite EDCSTFN’s better performance, RMSE values above 2 °C still indicate notable deviations in LST prediction. For 01 November 2022, 13 June 2023, and 19 October 2023, the results clearly indicate that DL-based STF methods outperform the traditional ESTARFM. Across these dates, DL approaches consistently achieve RMSE values below approximately 2 °C, which is generally considered acceptable for LST estimation. This improvement reflects their superior ability to exploit non-linear spatial and temporal relationships under thermally stable or moderately varying conditions, whereas ESTARFM remains limited by its linear assumptions and shows higher absolute errors. However, this apparent advantage does not necessarily reflect a superior capability to model LST-specific dynamics. Instead, the limited spatio-temporal variability between the target date

t_{2}

and the reference pairs (

P_{1}

and

P_{3}

) results in thermally stable conditions, under which all models benefit from reduced temporal complexity. In such cases, DL models are able to effectively capture LST patterns, while their performance gains primarily stem from favorable data conditions. On 28 May 2023, the traditional ESTARFM method outperforms DL-based models. However, the RMSE is above 2 °C. This suggests that when LST exhibits smoother, quasi-linear temporal evolution, classical STF methods can remain competitive, though they do not provide a substantial accuracy advantage. For 12 April 2024 and 19 September 2024, all evaluated models fail to accurately reconstruct the spatio-temporal variability of LST. Errors increase substantially, and none of the methods demonstrate consistent spatial or temporal fidelity.

The average results reported in Table 13 confirm that none of the evaluated STF methods is able to reliably reconstruct LST with high accuracy. Across all methods, RMSE values remain above 3 °C, indicating substantial absolute temperature errors that are incompatible with many LST-driven applications. Although EDCSTFN achieves the best overall performance among the DL-based approaches, its RMSE of 3.04 °C remains high. ESTARFM exhibits the weakest performance overall, reflecting the inability of linear, weight-based formulations to capture the non-linear and scale-dependent behavior of LST. Importantly, the relatively small performance gap between traditional and DL-based methods suggests that architectures originally designed for SR fusion do not directly transfer to thermal data.

5.4. Qualitative Comparison

Figure 12 provides a qualitative comparison between the Landsat 8 reference LST and STF-based reconstructions produced by STTFN, EDCSTFN, and MLFF-GAN over the full ROI and seven representative zoomed subregions. Satellite map views are used to associate observed thermal patterns and artifacts with underlying land-cover structures.

In Figure 12a, corresponding to the full ROI, the Landsat 8 reference LST exhibits a well-defined spatial organization of hot and cold regions driven by land-cover heterogeneity. All STF methods successfully identify large cold areas associated with the Loire river and forested regions. However, substantial discrepancies arise over urban and industrial zones. STTFN systematically underestimates high LST values, leading to muted urban hot spots. In contrast, EDCSTFN overestimates temperatures over built-up areas, producing spatially expanded hot regions that extend beyond actual impervious surfaces. MLFF-GAN yields a more realistic global LST distribution but introduces pronounced block-like artifacts aligned with patch boundaries, which disrupt spatial continuity and degrade physically plausible thermal gradients.

Figure 12b focuses on the Orléans city center, where the reference LST reveals strong thermal heterogeneity linked to dense urban fabric, bridges, and the Loire river, as confirmed by the satellite imagery. EDCSTFN markedly overestimates urban temperatures, generating exaggerated hot zones while partially preserving linear bridge-related structures. STTFN produces overly smoothed temperature fields that suppress fine-scale urban morphology and fail to capture road and bridge-induced thermal contrasts. MLFF-GAN better balances under and overestimation and preserves the primary urban-river contrast. However, severe patch-wise artifacts introduce abrupt and unrealistic transitions between adjacent regions.

Figure 12c, corresponding to the Orléans forest, the Landsat 8 reference indicates generally low temperatures with subtle intra-forest variability associated with canopy structure and forest edges. While STTFN and EDCSTFN correctly reproduce the overall cooling effect of dense vegetation, both generate overly homogeneous temperature fields and fail to recover fine-scale thermal variability. Localized warmer areas present in the reference are either smoothed out or merged with surrounding cooler regions. MLFF-GAN introduces artificial spatial patterns unrelated to forest structure, with block-aligned artifacts cutting across natural boundaries and further degrading physical coherence.

Figure 12d depicts the semi-urban corridor along the Loire and Loiret rivers. The reference LST clearly delineates both rivers as cold linear features with sharp banks visible in the satellite imagery. STTFN struggles to preserve narrow water bodies, particularly the Loiret river, resulting in weakened river-induced cooling and blurred boundaries. EDCSTFN more accurately captures river geometry and cooling signals but overestimates adjacent urban temperatures. MLFF-GAN reproduces large-scale thermal patterns but exhibits grid-like artifacts that intersect river courses, disrupting spatial continuity and river morphology.

Figure 12e focuses on a large industrial area characterized by extensive impervious surfaces, dense road networks, and sparse vegetation, as indicated by the satellite map view. The Landsat 8 reference LST exhibits pronounced thermal heterogeneity, with distinct hot spots associated with large industrial buildings and paved surfaces, and cooler patches linked to vegetated buffers. STTFN substantially underestimates LST over this area, resulting in attenuated thermal contrasts that fail to represent the strong heat retention properties of industrial materials. Conversely, EDCSTFN markedly overestimates LST, producing spatially inflated hot regions that extend well beyond the actual industrial footprint visible in the satellite imagery. MLFF-GAN yields more balanced temperature magnitudes. However, its reconstruction is affected by conspicuous block-like artifacts aligned with patch boundaries, which compromise spatial realism.

Figure 12f depicts a heterogeneous residential environment composed of small housing blocks, local road networks, vegetated parcels, and the Loiret river crossing the scene, as confirmed by the satellite map view. The Landsat 8 reference LST reveals fine-scale thermal variability, with warmer residential surfaces interspersed with cooler vegetated areas and a clearly defined cooling corridor associated with the river. STTFN reproduces a relatively realistic overall temperature distribution across residential zones, capturing the general contrast between built-up and vegetated surfaces. However, it fails to preserve the Loiret river’s thermal signature, which becomes largely indistinguishable from adjacent land covers due to excessive spatial smoothing. EDCSTFN substantially overestimates LST across the residential fabric. Although the river-induced cooling effect remains partially visible, it lacks sharp boundaries and physical coherence. MLFF-GAN succeeds in jointly representing both residential thermal patterns and the presence of the Loiret river, but this apparent advantage is undermined by strong grid-like artifacts that introduce artificial spatial structures inconsistent with the underlying urban morphology.

Figure 12g presents an agricultural landscape dominated by croplands with relatively homogeneous land cover, intersected by a curved section of the Loire river, as shown in the satellite map view. The Landsat 8 reference LST is characterized by generally low temperatures and smooth spatial gradients. All STF methods correctly reproduce the overall cooling effect of croplands, indicating that under thermally uniform and slowly varying conditions, the STF task is less challenging. Nevertheless, notable differences persist. STTFN fails to clearly delineate the Loire river, whose cooling signal is substantially weakened and partially blended into surrounding croplands. MLFF-GAN captures the broad thermal patterns and the river-induced cooling effect but introduces artificial block-wise structures that cut across agricultural parcels and natural boundaries.

Overall, the qualitative analysis over the ROI and across urban, industrial, residential, forested, and agricultural subregions reveals systematic limitations shared by DL-based STF methods when applied to LST. Models originally developed for SR-based STF struggle to preserve sharp thermal gradients induced by rivers, bridges, and impervious surfaces, and to accurately reconstruct localized thermal extremes associated with urban and industrial hotspots. STTFN exhibits a pronounced tendency toward spatial oversmoothing, which suppresses fine-scale thermal variability and attenuates narrow cooling features. EDCSTFN consistently overestimates LST over built-up areas, generating spatially inflated hotspots that ignore land-cover boundaries evident in the satellite imagery. MLFF-GAN yields more balanced temperature magnitudes, but introduces strong patch-wise artifacts, leading to artificial spatial discontinuities that undermine physical consistency and thermal realism.

6. Limitations and Future Trends

Although substantial advances have been achieved in DL-based STF methods, our experimental analysis demonstrates that directly transferring STF models originally developed for SR to LST remains highly challenging. The intrinsic physical differences between reflectance and thermal signals introduce systematic errors that current DL architectures fail to adequately address. These observations reveal several fundamental limitations of existing STF approaches and motivate the identification of future research directions. This section enumerates the main limitations of current STF-based models when applied to LST data and outlines potential future directions for improvement.

6.1. Inaccurate LST Estimations

LST is typically retrieved from satellite observations using physically based algorithms, as described in Section 2.1. These retrieval processes introduce uncertainties arising from atmospheric correction, surface emissivity assumptions, sensor noise, and viewing geometry. In the context of STF, multiple satellite-derived LST products are fused. Consequently, inaccuracies present in the input data are directly propagated through the fusion pipeline. When DL-based models are trained on such imperfect inputs, initial LST errors are not corrected but instead may be amplified or accumulated, ultimately degrading the reliability of the fused predictions. This issue complicates the interpretation of STF performance, as observed discrepancies may reflect limitations of the LST retrieval process rather than deficiencies of the STF model itself. For a more realistic and fair evaluation of STF methods, the integration of independent and reliable reference data, such as in situ ground-based temperature measurements, is highly desirable. Validation against such measurements would enable a more accurate assessment of STF model performance while reducing the confounding influence of retrieval-induced LST errors [199,200].

6.2. Cloudy Conditions

Cloud cover and associated shadows constitute a major challenge for RS applications, as they obstruct the sensor’s ability to acquire clear and temporally consistent observations of the Earth’s surface [201]. These effects result in missing or severely corrupted measurements [202]. In the context of STF, such gaps are commonly handled through interpolation or gap-filling strategies prior to fusion. However, the accuracy of these approaches remains limited for LST data, especially in regions experiencing persistent cloud cover or rapid land surface dynamics. A more robust strategy consists in explicitly incorporating missing pixels during the training of DL-based STF models. By learning directly from incomplete observations, the network can exploit spatio-temporal dependencies among the available inputs, such as

X_{1} (s, t_{1})

,

P_{1}

, and

P_{3}

, to infer missing information. This transforms the STF task into a multi-objective learning problem, which aims not only to enhance spatial and temporal resolution but also to reduce or eliminate data gaps in the reconstructed outputs, as formalized in Section 3.1. Recent work, such as MUSTFN [68], demonstrates the potential of this paradigm by producing fused outputs with substantially fewer missing pixels. Nevertheless, the explicit treatment of cloud-induced gaps within STF frameworks remains largely unexplored and constitutes a critical research direction.

6.3. Poor Generalizability

When transferring a trained STF model to a new geographical ROI, one of the primary challenges is the domain shift problem. This issue arises because spatio-temporal dynamics, such as seasonal cycles, meteorological conditions, land-cover composition, and urban development, can differ substantially between the training region and the target region. As a result, models trained under specific climatic or environmental conditions may fail to generalize effectively. To mitigate this limitation, fine-tuning becomes essential. Fine-tuning adapts the pretrained model to the target region by updating its parameters using region-specific data, allowing it to capture local temporal dynamics while retaining the transferable representations learned during initial training. Ideally, STF models should be designed to learn robust and generalizable features that remain valid across diverse regions and environmental conditions. For example, studies such as [81] assume that spatial relationships and sensor-related biases are largely invariant across regions, and therefore emphasize adapting temporal representations through fine-tuning.

6.4. Leveraging Pretrained Models

The use of pretrained models, although highly effective in many computer vision tasks, remains largely underexplored in STF. To date, ref. [61] is among the few studies that explicitly investigate the role of pretraining in STF. In broader image fusion and representation learning tasks, large-scale natural image datasets such as ImageNet are commonly used to pretrain deep networks for generic feature extraction. For example, refs. [203,204] employed ResNet50 to capture high-frequency features, while refs. [205,206] adopted VGG19 for deep feature extraction. Similarly, ref. [207] utilized DenseNet-201 pretrained on ImageNet to extract hierarchical representations. However, features learned from natural RGB images are not necessarily optimal for STF, particularly for LST, where the underlying physical processes and signal characteristics differ substantially from those of natural imagery. An alternative and potentially more suitable strategy is to pretrain models directly on fusion-oriented datasets, allowing them to learn task-specific representations tailored to multi-source and multi-resolution data. With the rapid growth of fusion datasets in domains such as medical imaging [208], these datasets could serve as a valuable pretraining foundation. Models pretrained in this manner could then be fine-tuned for STF tasks [209].

6.5. Insufficient Spatial Resolution

DL-based STF methods for LST are fundamentally constrained by the spatial resolution of the finest available thermal observations, which are typically provided by Landsat at 30 m. While this resolution is adequate for many regional-scale applications, it is insufficient for studies requiring detailed characterization of surface thermal patterns, such as UHI analysis. UHI phenomena are driven by fine-scale urban elements, including roads, buildings, vegetation patches, and urban morphology, that induce strong thermal contrasts at spatial scales smaller than 30 m. As a result, microclimatic variations and localized thermal extremes are often smoothed or entirely missed at this resolution. A promising direction is the integration of higher spatial resolution optical sensors, such as Sentinel-2 (10 m) and PlanetScope (3 m), which provide detailed SR information. Although these platforms lack TIR sensors, their high-resolution optical data can be exploited to guide the spatial disaggregation of coarser thermal observations. Some studies have explored the fusion of MODIS and Sentinel-2 [210] or Landsat 8 and Sentinel-2 [211] to estimate LST at 10 m resolution. However, most existing approaches rely on linear assumptions, which limits their ability to capture complex urban thermal processes. Consequently, these methods often struggle to deliver accurate and physically consistent high-resolution LST estimates, particularly in heterogeneous urban environments.

6.6. Joint Spatio-Temporal Deep Learning Architectures

Current STF research largely relies on DL models that prioritize either spatial or temporal modeling, but rarely both in a unified manner. Temporal architectures, such as RNNs and LSTM networks, have demonstrated effectiveness in capturing temporal dynamics in satellite time series. However, they are limited in their ability to explicitly model spatial dependencies and fine-scale spatial heterogeneity. In contrast, spatial architectures, including CNNs, AEs, GANS, and ViT-based models, excel at extracting spatial features and structural patterns, but are not designed to model temporal evolution. Although a small number of studies have explored the integration of CNNs with LSTMs, such as [70], fully unified DL-based STF architectures remain largely underexplored in the STF literature. This limitation is particularly critical for LST fusion, where thermal processes are governed by the joint interaction of spatial heterogeneity and temporal dynamics. Developing models that can simultaneously learn spatial structure and temporal evolution explicitly and coherently represents an important direction.

6.7. Integration of Large Language Models

Large Language Models (LLMs) have recently demonstrated strong capabilities across a wide range of natural language processing and multimodal reasoning tasks. Recent studies have explored LLMs in diverse domains, including RS data interpretation [212], medical diagnosis [213], and automated code generation [214]. This rapid progress raises the question of whether LLMs can play a role in STF, not as direct fusion operators, but as complementary components that enrich the fusion process. One potential direction is the extraction of high-level semantic information from satellite imagery and its integration alongside visual features. For example, ref. [215] demonstrated that semantic prompts can be generated from images and subsequently processed by LLMs such as ChatGPT to produce structured textual descriptions. These semantic representations can then be fused with visual features through cross-attention. To date, such LLM-assisted paradigms have not been applied to STF tasks, and no existing methods incorporate textual semantics into the STF of satellite data. Nonetheless, it may offer new opportunities for improving STF, particularly for complex scenes where land-cover context plays an important role.

7. Conclusions

In this work, we have presented a comprehensive survey and experimental analysis of the latest advancements in STF methods for LST estimation, with a particular focus on DL-based approaches. We began by formulating the STF for LST problem mathematically and reviewing the principal DL techniques employed in this domain. Building on this foundation, we proposed a novel taxonomy that categorizes existing methods across key dimensions, including architecture, learning paradigms, training strategies, and the use of pre-trained models. Through extensive experiments on our proposed MODIS-Landsat LST pair dataset (STF-LST), we systematically assessed the performance and reliability of state-of-the-art STF methods. Our analysis demonstrates that, although DL-based approaches can capture spatio-temporal patterns effectively under conditions of limited variability, directly transferring models designed for SR to LST remains problematic. Key challenges include accurately reconstructing sharp thermal gradients, representing extreme LST values, and handling cloud-induced gaps, all of which limit the practical applicability of current methods. Finally, we identified several promising avenues for future research. By integrating a comprehensive taxonomy with extensive empirical evaluation, this study offers a clear reference for understanding the strengths and limitations of DL-based STF methods, while providing guidance for developing more robust and physically consistent approaches to LST fusion.

Author Contributions

Conceptualization, S.B., A.H., R.C. and R.N.; methodology, S.B. and A.H.; software, S.B.; validation, S.B., A.H. and R.N.; formal analysis, S.B.; investigation, S.B., A.H., R.C. and R.N.; resources, S.B., A.H., R.C. and R.N.; data curation, S.B. and R.N.; writing—original draft preparation, S.B.; writing—review and editing, S.B., A.H. and R.N.; visualization, S.B.; supervision, A.H., R.C. and R.N.; project administration, S.B., A.H., R.C. and R.N.; funding acquisition, A.H., R.C. and R.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research was carried out as a part of the CHOISIR project funded by Métropole d’Orléans and Région Centre-Val de Loire.

Data Availability Statement

The data and source code supporting the findings of this study are publicly available in the GitHub repository https://github.com/Sofianebouaziz1/STF-LST (accessed on 12 January 2026). The repository provides a fully reproducible framework for generating the STF-LST dataset, including scripts for downloading, preprocessing, and organising the MODIS and Landsat data pairs using the Google Earth Engine platform. The dataset consists of 51 paired MODIS/Landsat LST images covering the Orléans Métropole area (France) between April 2013 and October 2024. All satellite data used are publicly accessible from their respective providers. Users must hold a valid Google Earth Engine account to reproduce the dataset.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bureau, P.R. 2007 World Population Data Sheet; Population Reference: Washington, DC, USA, 2007. [Google Scholar]
Vanos, J.; Cakmak, S.; Kalkstein, L.; Yagouti, A. Association of weather and air pollution interactions on daily mortality in 12 Canadian cities. Air Qual. Atmos. Health 2015, 8, 307–320. [Google Scholar] [CrossRef]
Nuruzzaman, M. Urban heat island: Causes, effects and mitigation measures-a review. Int. J. Environ. Monit. Anal. 2015, 3, 67–73. [Google Scholar] [CrossRef]
Masiol, M.; Agostinelli, C.; Formenton, G.; Tarabotti, E.; Pavoni, B. Thirteen years of air pollution hourly monitoring in a large city: Potential sources, trends, cycles and effects of car-free days. Sci. Total Environ. 2014, 494, 84–96. [Google Scholar] [CrossRef]
Hulley, G.C.; Ghent, D.; Göttsche, F.M.; Guillevic, P.C.; Mildrexler, D.J.; Coll, C. 3-Land Surface Temperature. In Taking the Temperature of the Earth; Hulley, G.C., Ghent, D., Eds.; Elsevier: Amsterdam, The Netherlands, 2019; pp. 57–127. [Google Scholar] [CrossRef]
Li, X.; Zhou, W.; Ouyang, Z. Relationship between land surface temperature and spatial pattern of greenspace: What are the effects of spatial resolution? Landsc. Urban Plan. 2013, 114, 1–8. [Google Scholar] [CrossRef]
Kerr, Y.H.; Lagouarde, J.P.; Nerry, F.; Ottlé, C. Land surface temperature retrieval techniques and applications: Case of the AVHRR. In Thermal Remote Sensing in Land Surface Processing; CRC Press: Boca Raton, FL, USA, 2004; pp. 33–109. [Google Scholar]
Schneider, P.; Hook, S.J. Space observations of inland water bodies show rapid surface warming since 1985. Geophys. Res. Lett. 2010, 37, L22405. [Google Scholar] [CrossRef]
Hall, D.K.; Comiso, J.C.; DiGirolamo, N.E.; Shuman, C.A.; Key, J.R.; Koenig, L.S. A satellite-derived climate-quality data record of the clear-sky surface temperature of the Greenland ice sheet. J. Clim. 2012, 25, 4785–4798. [Google Scholar] [CrossRef]
Ibrahim, I.; Samah, A.A.; Fauzi, R. Land surface temperature and biophysical factors in urban planning. In Proceedings of the International Conference on Ecosystem, Environment and Sustainable Development, Kuala Lumpur, Malaysia, 21–23 June 2012; Volume 68, pp. 1792–1797. [Google Scholar]
Maimaitiyiming, M.; Ghulam, A.; Tiyip, T.; Pla, F.; Latorre-Carmona, P.; Halik, Ü.; Sawut, M.; Caetano, M. Effects of green space spatial pattern on land surface temperature: Implications for sustainable urban planning and climate change adaptation. ISPRS J. Photogramm. Remote Sens. 2014, 89, 59–66. [Google Scholar] [CrossRef]
Luyssaert, S.; Jammet, M.; Stoy, P.C.; Estel, S.; Pongratz, J.; Ceschia, E.; Churkina, G.; Don, A.; Erb, K.; Ferlicoq, M.; et al. Land management and land-cover change have impacts of similar magnitude on surface temperature. Nat. Clim. Chang. 2014, 4, 389–393. [Google Scholar] [CrossRef]
Kafy, A.A.; Rahman, M.S.; Faisal, A.A.; Hasan, M.M.; Islam, M. Modelling future land use land cover changes and their impacts on land surface temperatures in Rajshahi, Bangladesh. Remote Sens. Appl. Soc. Environ. 2020, 18, 100314. [Google Scholar] [CrossRef]
Li, Z.L.; Tang, B.H.; Wu, H.; Ren, H.; Yan, G.; Wan, Z.; Trigo, I.F.; Sobrino, J.A. Satellite-derived land surface temperature: Current status and perspectives. Remote Sens. Environ. 2013, 131, 14–37. [Google Scholar] [CrossRef]
Wan, Z.; Dozier, J. A generalized split-window algorithm for retrieving land-surface temperature from space. IEEE Trans. Geosci. Remote Sens. 1996, 34, 892–905. [Google Scholar]
Gillespie, A.; Rokugawa, S.; Matsunaga, T.; Cothern, J.S.; Hook, S.; Kahle, A.B. A temperature and emissivity separation algorithm for Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER) images. IEEE Trans. Geosci. Remote Sens. 1998, 36, 1113–1126. [Google Scholar] [CrossRef]
Sun, D.; Pinker, R.T. Estimation of land surface temperature from a Geostationary Operational Environmental Satellite (GOES-8). J. Geophys. Res. Atmos. 2003, 108, 4326. [Google Scholar] [CrossRef]
Trigo, I.F.; Dacamara, C.C.; Viterbo, P.; Roujean, J.L.; Olesen, F.; Barroso, C.; Camacho-de Coca, F.; Carrer, D.; Freitas, S.C.; García-Haro, J.; et al. The satellite application facility for land surface analysis. Int. J. Remote Sens. 2011, 32, 2725–2744. [Google Scholar] [CrossRef]
Malakar, N.K.; Hulley, G.C.; Hook, S.J.; Laraby, K.; Cook, M.; Schott, J.R. An operational land surface temperature product for Landsat thermal data: Methodology and validation. IEEE Trans. Geosci. Remote Sens. 2018, 56, 5717–5735. [Google Scholar] [CrossRef]
Koetz, B.; Bastiaanssen, W.; Berger, M.; Defourney, P.; Del Bello, U.; Drusch, M.; Drinkwater, M.; Duca, R.; Fernandez, V.; Ghent, D.; et al. High spatio-temporal resolution land surface temperature mission-a copernicus candidate mission in support of agricultural monitoring. In Proceedings of the Igarss 2018—2018 IEEE International Geoscience and Remote Sensing Symposium, Valencia, Spain, 22–27 July 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 8160–8162. [Google Scholar]
Shen, Y.; Shen, H.; Cheng, Q.; Zhang, L. Generating comparable and fine-scale time series of summer land surface temperature for thermal environment monitoring. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 14, 2136–2147. [Google Scholar] [CrossRef]
Chen, B.; Huang, B.; Xu, B. Comparison of spatiotemporal fusion models: A review. Remote Sens. 2015, 7, 1798–1835. [Google Scholar] [CrossRef]
Shen, H.; Meng, X.; Zhang, L. An integrated framework for the spatio–temporal–spectral fusion of remote sensing images. IEEE Trans. Geosci. Remote Sens. 2016, 54, 7135–7148. [Google Scholar] [CrossRef]
Zhang, J.; Li, J. Chapter 11—Spacecraft. In Spatial Cognitive Engine Technology; Zhang, J., Li, J., Eds.; Academic Press: Cambridge, MA, USA, 2023; pp. 129–162. [Google Scholar] [CrossRef]
Gibson, P. Chapter 1—A systematic view of remote sensing (Second Edition). In Advanced Remote Sensing; Liang, S., Wang, J., Eds.; Academic Press: Cambridge, MA, USA, 2020; pp. 1–57. [Google Scholar] [CrossRef]
Zhan, W.; Chen, Y.; Zhou, J.; Wang, J.; Liu, W.; Voogt, J.; Zhu, X.; Quan, J.; Li, J. Disaggregation of remotely sensed land surface temperature: Literature survey, taxonomy, issues, and caveats. Remote Sens. Environ. 2013, 131, 119–139. [Google Scholar] [CrossRef]
Mao, Q.; Peng, J.; Wang, Y. Resolution enhancement of remotely sensed land surface temperature: Current status and perspectives. Remote Sens. 2021, 13, 1306. [Google Scholar] [CrossRef]
Belgiu, M.; Stein, A. Spatiotemporal image fusion in remote sensing. Remote Sens. 2019, 11, 818. [Google Scholar] [CrossRef]
Agam, N.; Kustas, W.P.; Anderson, M.C.; Li, F.; Neale, C.M. A vegetation index based technique for spatial sharpening of thermal imagery. Remote Sens. Environ. 2007, 107, 545–558. [Google Scholar] [CrossRef]
Nichol, J. An emissivity modulation method for spatial enhancement of thermal satellite images in urban heat island analysis. Photogramm. Eng. Remote Sens. 2009, 75, 547–556. [Google Scholar] [CrossRef]
Duan, S.B.; Li, Z.L. Spatial downscaling of MODIS land surface temperatures using geographically weighted regression: Case study in northern China. IEEE Trans. Geosci. Remote Sens. 2016, 54, 6458–6469. [Google Scholar] [CrossRef]
Zhu, X.; Song, X.; Leng, P.; Hu, R. Spatial downscaling of land surface temperature with the multi-scale geographically weighted regression. Natl. Remote Sens. Bull. 2021, 25, 1749–1766. [Google Scholar]
Agathangelidis, I.; Cartalis, C. Improving the disaggregation of MODIS land surface temperatures in an urban environment: A statistical downscaling approach using high-resolution emissivity. Int. J. Remote Sens. 2019, 40, 5261–5286. [Google Scholar] [CrossRef]
Dominguez, A.; Kleissl, J.; Luvall, J.C.; Rickman, D.L. High-resolution urban thermal sharpener (HUTS). Remote Sens. Environ. 2011, 115, 1772–1780. [Google Scholar] [CrossRef]
Zhan, W.; Huang, F.; Quan, J.; Zhu, X.; Gao, L.; Zhou, J.; Ju, W. Disaggregation of remotely sensed land surface temperature: A new dynamic methodology. J. Geophys. Res. Atmos. 2016, 121, 10–538. [Google Scholar] [CrossRef]
Zhu, X.; Cai, F.; Tian, J.; Williams, T.K.A. Spatiotemporal fusion of multisource remote sensing data: Literature survey, taxonomy, principles, applications, and future directions. Remote Sens. 2018, 10, 527. [Google Scholar] [CrossRef]
Song, H.; Liu, Q.; Wang, G.; Hang, R.; Huang, B. Spatiotemporal satellite image fusion using deep convolutional neural networks. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2018, 11, 821–829. [Google Scholar] [CrossRef]
Gao, F.; Masek, J.; Schwaller, M.; Hall, F. On the blending of the Landsat and MODIS surface reflectance: Predicting daily Landsat surface reflectance. IEEE Trans. Geosci. Remote Sens. 2006, 44, 2207–2218. [Google Scholar]
Zhu, X.; Chen, J.; Gao, F.; Chen, X.; Masek, J.G. An enhanced spatial and temporal adaptive reflectance fusion model for complex heterogeneous regions. Remote Sens. Environ. 2010, 114, 2610–2623. [Google Scholar] [CrossRef]
Kim, J.; Hogue, T.S. Evaluation and sensitivity testing of a coupled Landsat-MODIS downscaling method for land surface temperature and vegetation indices in semi-arid regions. J. Appl. Remote Sens. 2012, 6, 063569. [Google Scholar] [CrossRef]
Wu, P.; Shen, H.; Ai, T.; Liu, Y. Land-surface temperature retrieval at high spatial and temporal resolutions based on multi-sensor fusion. Int. J. Digit. Earth 2013, 6, 113–133. [Google Scholar] [CrossRef]
Huang, B.; Wang, J.; Song, H.; Fu, D.; Wong, K. Generating high spatiotemporal resolution land surface temperature for urban heat island monitoring. IEEE Geosci. Remote Sens. Lett. 2013, 10, 1011–1015. [Google Scholar] [CrossRef]
Wu, P.; Shen, H.; Zhang, L.; Göttsche, F.M. Integrated fusion of multi-scale polar-orbiting and geostationary satellite observations for the mapping of high spatial and temporal resolution land surface temperature. Remote Sens. Environ. 2015, 156, 169–181. [Google Scholar] [CrossRef]
Wu, M.; Niu, Z.; Wang, C.; Wu, C.; Wang, L. Use of MODIS and Landsat time series data to generate high-resolution temporal synthetic Landsat data using a spatial and temporal reflectance fusion model. J. Appl. Remote Sens. 2012, 6, 063507. [Google Scholar]
Zhang, W.; Li, A.; Jin, H.; Bian, J.; Zhang, Z.; Lei, G.; Qin, Z.; Huang, C. An enhanced spatial and temporal data fusion model for fusing Landsat and MODIS surface reflectance to generate high temporal Landsat-like data. Remote Sens. 2013, 5, 5346–5368. [Google Scholar] [CrossRef]
Zhang, H.K.; Huang, B.; Zhang, M.; Cao, K.; Yu, L. A generalization of spatial and temporal fusion methods for remotely sensed surface parameters. Int. J. Remote Sens. 2015, 36, 4411–4445. [Google Scholar] [CrossRef]
Wu, M.; Huang, W.; Niu, Z.; Wang, C. Generating daily synthetic Landsat imagery by combining Landsat and MODIS data. Sensors 2015, 15, 24002–24025. [Google Scholar] [CrossRef] [PubMed]
Zhu, X.; Helmer, E.H.; Gao, F.; Liu, D.; Chen, J.; Lefsky, M.A. A flexible spatiotemporal method for fusing satellite images with different resolutions. Remote Sens. Environ. 2016, 172, 165–177. [Google Scholar] [CrossRef]
Li, X.; Ling, F.; Foody, G.M.; Ge, Y.; Zhang, Y.; Du, Y. Generating a series of fine spatial and temporal resolution land cover maps by fusing coarse spatial resolution remotely sensed images and fine spatial resolution land cover maps. Remote Sens. Environ. 2017, 196, 293–311. [Google Scholar]
Quan, J.; Zhan, W.; Ma, T.; Du, Y.; Guo, Z.; Qin, B. An integrated model for generating hourly Landsat-like land surface temperatures over heterogeneous landscapes. Remote Sens. Environ. 2018, 206, 403–423. [Google Scholar]
Xia, H.; Chen, Y.; Li, Y.; Quan, J. Combining kernel-driven and fusion-based methods to generate daily high-spatial-resolution land surface temperatures. Remote Sens. Environ. 2019, 224, 259–274. [Google Scholar] [CrossRef]
Huang, B.; Song, H. Spatiotemporal reflectance fusion via sparse representation. IEEE Trans. Geosci. Remote Sens. 2012, 50, 3707–3716. [Google Scholar] [CrossRef]
Wu, B.; Huang, B.; Zhang, L. An error-bound-regularized sparse coding for spatiotemporal reflectance fusion. IEEE Trans. Geosci. Remote Sens. 2015, 53, 6791–6803. [Google Scholar] [CrossRef]
Wei, J.; Wang, L.; Liu, P.; Song, W. Spatiotemporal fusion of remote sensing images with structural sparsity and semi-coupled dictionary learning. Remote Sens. 2016, 9, 21. [Google Scholar] [CrossRef]
Peng, Y.; Li, W.; Luo, X.; Du, J.; Zhang, X.; Gan, Y.; Gao, X. Spatiotemporal reflectance fusion via tensor sparse representation. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5608318. [Google Scholar]
Li, A.; Bo, Y.; Zhu, Y.; Guo, P.; Bi, J.; He, Y. Blending multi-resolution satellite sea surface temperature (SST) products using Bayesian maximum entropy method. Remote Sens. Environ. 2013, 135, 52–63. [Google Scholar] [CrossRef]
Huang, B.; Zhang, H.; Song, H.; Wang, J.; Song, C. Unified fusion of remote-sensing imagery: Generating simultaneously high-resolution synthetic spatial–temporal–spectral earth observations. Remote Sens. Lett. 2013, 4, 561–569. [Google Scholar] [CrossRef]
Liao, L.; Song, J.; Wang, J.; Xiao, Z.; Wang, J. Bayesian method for building frequent Landsat-like NDVI datasets by integrating MODIS and Landsat NDVI. Remote Sens. 2016, 8, 452. [Google Scholar] [CrossRef]
Xue, J.; Leung, Y.; Fung, T. A Bayesian data fusion approach to spatio-temporal fusion of remotely sensed images. Remote Sens. 2017, 9, 1310. [Google Scholar] [CrossRef]
Addesso, P.; Longo, M.; Restaino, R.; Vivone, G. Sequential Bayesian methods for resolution enhancement of TIR image sequences. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2014, 8, 233–243. [Google Scholar] [CrossRef]
Chen, Y.; Yang, Y.; Pan, X.; Meng, X.; Hu, J. Spatiotemporal fusion network for land surface temperature based on a conditional variational autoencoder. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5002813. [Google Scholar] [CrossRef]
Tan, Z.; Yue, P.; Di, L.; Tang, J. Deriving high spatiotemporal remote sensing images using deep convolutional network. Remote Sens. 2018, 10, 1066. [Google Scholar] [CrossRef]
Liu, X.; Deng, C.; Chanussot, J.; Hong, D.; Zhao, B. StfNet: A two-stream convolutional neural network for spatiotemporal image fusion. IEEE Trans. Geosci. Remote Sens. 2019, 57, 6552–6564. [Google Scholar] [CrossRef]
Zheng, Y.; Song, H.; Sun, L.; Wu, Z.; Jeon, B. Spatiotemporal fusion of satellite images via very deep convolutional networks. Remote Sens. 2019, 11, 2701. [Google Scholar] [CrossRef]
Yin, Z.; Wu, P.; Foody, G.M.; Wu, Y.; Liu, Z.; Du, Y.; Ling, F. Spatiotemporal fusion of land surface temperature based on a convolutional neural network. IEEE Trans. Geosci. Remote Sens. 2020, 59, 1808–1822. [Google Scholar] [CrossRef]
Wang, X.; Wang, X. Spatiotemporal fusion of remote sensing image based on deep learning. J. Sens. 2020, 2020, 8873079. [Google Scholar] [CrossRef]
Li, Y.; Li, J.; He, L.; Chen, J.; Plaza, A. A new sensor bias-driven spatio-temporal fusion model based on convolutional neural networks. Sci. China Inf. Sci. 2020, 63, 140302. [Google Scholar] [CrossRef]
Qin, P.; Huang, H.; Tang, H.; Wang, J.; Liu, C. MUSTFN: A spatiotemporal fusion method for multi-scale and multi-sensor remote sensing images based on a convolutional neural network. Int. J. Appl. Earth Obs. Geoinf. 2022, 115, 103113. [Google Scholar] [CrossRef]
Sun, W.; Li, J.; Jiang, M.; Yuan, Q. Supervised and self-supervised learning-based cascade spatiotemporal fusion framework and its application. ISPRS J. Photogramm. Remote Sens. 2023, 203, 19–36. [Google Scholar] [CrossRef]
Meng, X.; Liu, Q.; Shao, F.; Li, S. Spatio–temporal–spectral collaborative learning for spatio–temporal fusion with land cover changes. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5704116. [Google Scholar] [CrossRef]
Yu, S.; Deng, Y.; Li, Y.; Li, J.; Chen, J.; Zhang, S. An Unsupervised Model Based on Convolutional Neural Network for Fusing Landsat-8 and Sentinel-2 Data. In Proceedings of the IGARSS 2024-2024 IEEE International Geoscience and Remote Sensing Symposium, Athens, Greece, 7–12 July 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 9214–9217. [Google Scholar]
Tan, Z.; Di, L.; Zhang, M.; Guo, L.; Gao, M. An enhanced deep convolutional model for spatiotemporal image fusion. Remote Sens. 2019, 11, 2898. [Google Scholar] [CrossRef]
Chen, J.; Wang, L.; Feng, R.; Liu, P.; Han, W.; Chen, X. CycleGAN-STF: Spatiotemporal fusion via CycleGAN-based image generation. IEEE Trans. Geosci. Remote Sens. 2020, 59, 5851–5865. [Google Scholar] [CrossRef]
Zhang, H.; Song, Y.; Han, C.; Zhang, L. Remote sensing image spatiotemporal fusion using a generative adversarial network. IEEE Trans. Geosci. Remote Sens. 2020, 59, 4273–4286. [Google Scholar] [CrossRef]
Ma, Y.; Wei, J.; Tang, W.; Tang, R. Explicit and stepwise models for spatiotemporal fusion of remote sensing images with deep neural networks. Int. J. Appl. Earth Obs. Geoinf. 2021, 105, 102611. [Google Scholar] [CrossRef]
Tan, Z.; Gao, M.; Li, X.; Jiang, L. A flexible reference-insensitive spatiotemporal fusion model for remote sensing images using conditional generative adversarial network. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5601413. [Google Scholar] [CrossRef]
Zhang, H.; Sun, Y.; Shi, W.; Guo, D.; Zheng, N. An object-based spatiotemporal fusion model for remote sensing images. Eur. J. Remote Sens. 2021, 54, 86–101. [Google Scholar] [CrossRef]
Song, B.; Liu, P.; Li, J.; Wang, L.; Zhang, L.; He, G.; Chen, L.; Liu, J. MLFF-GAN: A multilevel feature fusion with GAN for spatiotemporal remote sensing images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4410816. [Google Scholar] [CrossRef]
Tan, Z.; Gao, M.; Yuan, J.; Jiang, L.; Duan, H. A robust model for MODIS and Landsat image fusion considering input noise. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5407217. [Google Scholar] [CrossRef]
Pan, X.; Deng, M.; Ao, Z.; Xin, Q. An Adaptive Multiscale Generative Adversarial Network for the Spatiotemporal Fusion of Landsat and MODIS Data. Remote Sens. 2023, 15, 5128. [Google Scholar] [CrossRef]
Huang, H.; He, W.; Zhang, H.; Xia, Y.; Zhang, L. STFDiff: Remote sensing image spatiotemporal fusion with diffusion models. Inf. Fusion 2024, 111, 102505. [Google Scholar] [CrossRef]
Chen, Y.; Yang, Y.; Pan, X.; Hu, P. CGMFN: Conditional Generative Model Fusion Network for Land Surface Temperature Generation. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5004813. [Google Scholar] [CrossRef]
Li, W.; Cao, D.; Peng, Y.; Yang, C. MSNet: A multi-stream fusion network for remote sensing spatiotemporal fusion based on transformer and convolution. Remote Sens. 2021, 13, 3724. [Google Scholar] [CrossRef]
Yang, G.; Qian, Y.; Liu, H.; Tang, B.; Qi, R.; Lu, Y.; Geng, J. MSFusion: Multistage for remote sensing image spatiotemporal fusion based on texture transformer and convolutional neural network. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 4653–4666. [Google Scholar] [CrossRef]
Chen, G.; Jiao, P.; Hu, Q.; Xiao, L.; Ye, Z. SwinSTFM: Remote sensing spatiotemporal fusion using Swin transformer. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5410618. [Google Scholar] [CrossRef]
Li, W.; Cao, D.; Xiang, M. Enhanced multi-stream remote sensing spatiotemporal fusion network based on transformer and dilated convolution. Remote Sens. 2022, 14, 4544. [Google Scholar] [CrossRef]
Jiang, M.; Shao, H. A CNN-Transformer combined Remote Sensing Imagery Spatiotemporal Fusion Model. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 13995–14009. [Google Scholar] [CrossRef]
Benzenati, T.; Kallel, A.; Kessentini, Y. STF-Trans: A two-stream spatiotemporal fusion transformer for very high resolution satellites images. Neurocomputing 2024, 563, 126868. [Google Scholar] [CrossRef]
Ma, Z.; Bao, W.; Feng, W.; Zhang, X.; Ma, X.; Qu, K. SFT-GAN: Sparse Fast Transformer Fusion Method Based on GAN for Remote Sensing Spatiotemporal Fusion. Remote Sens. 2025, 17, 2315. [Google Scholar]
Hu, P.; Pan, X.; Yang, Y.; Dai, Y.; Chen, Y. A Two-Stage Hierarchical Spatiotemporal Fusion Network for Land Surface Temperature with Transformer. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5002320. [Google Scholar] [CrossRef]
Yang, Z.; Diao, C.; Li, B. A robust hybrid deep learning model for spatiotemporal image fusion. Remote Sens. 2021, 13, 5005. [Google Scholar]
Zhan, W.; Luo, F.; Luo, H.; Li, J.; Wu, Y.; Yin, Z.; Wu, Y.; Wu, P. Time-Series-Based Spatiotemporal Fusion Network for Improving Crop Type Mapping. Remote Sens. 2024, 16, 235. [Google Scholar] [CrossRef]
Prata, A.; Caselles, V.; Coll, C.; Sobrino, J.; Ottle, C. Thermal remote sensing of land surface temperature from satellites: Current status and future prospects. Remote Sens. Rev. 1995, 12, 175–224. [Google Scholar] [CrossRef]
Wu, P.; Yin, Z.; Zeng, C.; Duan, S.B.; Göttsche, F.M.; Ma, X.; Li, X.; Yang, H.; Shen, H. Spatially continuous and high-resolution land surface temperature product generation: A review of reconstruction and spatiotemporal fusion techniques. IEEE Geosci. Remote Sens. Mag. 2021, 9, 112–137. [Google Scholar] [CrossRef]
Yoo, C.; Im, J.; Park, S.; Cho, D. Spatial downscaling of MODIS land surface temperature: Recent research trends, challenges, and future directions. Korean J. Remote Sens. 2020, 36, 609–626. [Google Scholar]
Ran, L.; Mengmeng, W.; Zhengjia, Z.; Tian, H.; Xiuguo, L. A review of spatiotemporal fusion methods for remotely sensed land surface temperature. Natl. Remote Sens. Bull. 2024, 26, 2433–2450. [Google Scholar] [CrossRef]
Li, J.; Li, Y.; He, L.; Chen, J.; Plaza, A. Spatio-temporal fusion for remote sensing data: An overview and new benchmark. Sci. China Inf. Sci. 2020, 63, 140301. [Google Scholar] [CrossRef]
Ferchichi, A.; Abbes, A.B.; Barra, V.; Farah, I.R. Forecasting vegetation indices from spatio-temporal remotely sensed data using deep learning-based approaches: A systematic literature review. Ecol. Inform. 2022, 68, 101552. [Google Scholar]
Wang, Z.; Ma, Y.; Zhang, Y. Review of pixel-level remote sensing image fusion based on deep learning. Inf. Fusion 2023, 90, 36–58. [Google Scholar] [CrossRef]
Wang, Q.; Tang, Y.; Ge, Y.; Xie, H.; Tong, X.; Atkinson, P.M. A comprehensive review of spatial-temporal-spectral information reconstruction techniques. Sci. Remote Sens. 2023, 8, 100102. [Google Scholar] [CrossRef]
Xiao, J.; Aggarwal, A.K.; Duc, N.H.; Arya, A.; Rage, U.K.; Avtar, R. A review of remote sensing image spatiotemporal fusion: Challenges, applications and recent trends. Remote Sens. Appl. Soc. Environ. 2023, 32, 101005. [Google Scholar] [CrossRef]
Chen, G.; Lu, H.; Zou, W.; Li, L.; Emam, M.; Chen, X.; Jing, W.; Wang, J.; Li, C. Spatiotemporal fusion for spectral remote sensing: A statistical analysis and review. J. King Saud Univ.-Comput. Inf. Sci. 2023, 35, 259–273. [Google Scholar] [CrossRef]
Cui, J.; Li, J.; Gu, X.; Zhang, W.; Wang, D.; Sun, X.; Zhan, Y.; Yang, J.; Liu, Y.; Yang, X. Comprehensive Analysis of Temporal–Spatial Fusion from 1991 to 2023 Using Bibliometric Tools. Atmosphere 2024, 15, 598. [Google Scholar] [CrossRef]
Anand, S.; Sharma, R. Pansharpening and spatiotemporal image fusion method for remote sensing. Eng. Res. Express 2024, 6, 022201. [Google Scholar] [CrossRef]
Swain, R.; Paul, A.; Behera, M.D. Spatio-temporal fusion methods for spectral remote sensing: A comprehensive technical review and comparative analysis. Trop. Ecol. 2024, 65, 356–375. [Google Scholar] [CrossRef]
Lian, Z.; Zhan, Y.; Zhang, W.; Wang, Z.; Liu, W.; Huang, X. Recent Advances in Deep Learning-Based Spatiotemporal Fusion Methods for Remote Sensing Images. Sensors 2025, 25, 1093. [Google Scholar] [CrossRef] [PubMed]
Sun, E.; Cui, Y.; Liu, P.; Yan, J. A decade of deep learning for remote sensing spatiotemporal fusion: Advances, challenges, and opportunities. arXiv 2025, arXiv:2504.00901. [Google Scholar] [CrossRef]
Norman, J.M.; Becker, F. Terminology in thermal infrared remote sensing of natural surfaces. Agric. For. Meteorol. 1995, 77, 153–166. [Google Scholar] [CrossRef]
Dash, P.; Göttsche, F.M.; Olesen, F.S.; Fischer, H. Land surface temperature and emissivity estimation from passive sensor data: Theory and practice-current trends. Int. J. Remote Sens. 2002, 23, 2563–2594. [Google Scholar] [CrossRef]
Li, Z.L.; Wu, H.; Duan, S.B.; Zhao, W.; Ren, H.; Liu, X.; Leng, P.; Tang, R.; Ye, X.; Zhu, J.; et al. Satellite remote sensing of global land surface temperature: Definition, methods, products, and applications. Rev. Geophys. 2023, 61, e2022RG000777. [Google Scholar] [CrossRef]
Becker, F.; Li, Z.L. Surface temperature and emissivity at various scales: Definition, measurement and related problems. Remote Sens. Rev. 1995, 12, 225–253. [Google Scholar] [CrossRef]
Duan, S.B.; Li, Z.L.; Cheng, J.; Leng, P. Cross-satellite comparison of operational land surface temperature products derived from MODIS and ASTER data over bare soil surfaces. ISPRS J. Photogramm. Remote Sens. 2017, 126, 1–10. [Google Scholar] [CrossRef]
Jimenez-Munoz, J.C.; Sobrino, J.A.; Skoković, D.; Mattar, C.; Cristobal, J. Land surface temperature retrieval methods from Landsat-8 thermal infrared sensor data. IEEE Geosci. Remote Sens. Lett. 2014, 11, 1840–1843. [Google Scholar] [CrossRef]
Rozenstein, O.; Qin, Z.; Derimian, Y.; Karnieli, A. Derivation of land surface temperature for Landsat-8 TIRS using a split window algorithm. Sensors 2014, 14, 5768–5780. [Google Scholar] [CrossRef]
Wang, F.; Qin, Z.; Song, C.; Tu, L.; Karnieli, A.; Zhao, S. An improved mono-window algorithm for land surface temperature retrieval from Landsat 8 thermal infrared sensor data. Remote Sens. 2015, 7, 4268–4289. [Google Scholar] [CrossRef]
Gómez, C.; White, J.C.; Wulder, M.A. Optical remotely sensed time series data for land cover classification: A review. ISPRS J. Photogramm. Remote Sens. 2016, 116, 55–72. [Google Scholar] [CrossRef]
Chraibi, E.; De Boissieu, F.; Barbier, N.; Luque, S.; Féret, J.B. Stability in time and consistency between atmospheric corrections: Assessing the reliability of Sentinel-2 products for biodiversity monitoring in tropical forests. Int. J. Appl. Earth Obs. Geoinf. 2022, 112, 102884. [Google Scholar] [CrossRef]
Tran, D.X.; Pla, F.; Latorre-Carmona, P.; Myint, S.W.; Caetano, M.; Kieu, H.V. Characterizing the relationship between land use land cover change and land surface temperature. ISPRS J. Photogramm. Remote Sens. 2017, 124, 119–132. [Google Scholar] [CrossRef]
Wang, S.; Luo, Y.; Li, X.; Yang, K.; Liu, Q.; Luo, X.; Li, X. Downscaling land surface temperature based on non-linear geographically weighted regressive model over urban areas. Remote Sens. 2021, 13, 1580. [Google Scholar] [CrossRef]
Kingma, D.P. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Tian, Y.; Su, D.; Lauria, S.; Liu, X. Recent advances on loss functions in deep learning for computer vision. Neurocomputing 2022, 497, 129–158. [Google Scholar] [CrossRef]
Huber, P.J. Robust estimation of a location parameter. In Breakthroughs in Statistics: Methodology and Distribution; Springer: Berlin/Heidelberg, Germany, 1992; pp. 492–518. [Google Scholar]
Huang, S.; Tang, L.; Hupy, J.P.; Wang, Y.; Shao, G. A commentary review on the use of normalized difference vegetation index (NDVI) in the era of popular remote sensing. J. For. Res. 2021, 32, 1–6. [Google Scholar] [CrossRef]
Zha, Y.; Gao, J.; Ni, S. Use of normalized difference built-up index in automatically mapping urban areas from TM imagery. Int. J. Remote Sens. 2003, 24, 583–594. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
Zhao, H.; Gallo, O.; Frosio, I.; Kautz, J. Loss functions for image restoration with neural networks. IEEE Trans. Comput. Imaging 2016, 3, 47–57. [Google Scholar] [CrossRef]
Khare, N.; Thakur, P.S.; Khanna, P.; Ojha, A. Analysis of loss functions for image reconstruction using convolutional autoencoder. In Proceedings of the International Conference on Computer Vision and Image Processing, Ropar, India, 3–5 December 2021; Springer: Cham, Switzerland, 2021; pp. 338–349. [Google Scholar]
Wu, B.; Duan, H.; Liu, Z.; Sun, G. SRPGAN: Perceptual generative adversarial network for single image super resolution. arXiv 2017, arXiv:1712.05927. [Google Scholar] [CrossRef]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Adv. Neural Inf. Process. Syst. 2014, 27, 2672–2680. [Google Scholar]
Korhonen, J.; You, J. Peak signal-to-noise ratio revisited: Is simple beautiful? In Proceedings of the 2012 Fourth International Workshop on Quality of Multimedia Experience, Melbourne, VIC, Australia, 5–7 July 2012; IEEE: Piscataway, NJ, USA, 2012; pp. 37–38. [Google Scholar]
Kaneko, S.; Satoh, Y.; Igarashi, S. Using selective correlation coefficient for robust image registration. Pattern Recognit. 2003, 36, 1165–1173. [Google Scholar] [CrossRef]
Yuhas, R.H.; Goetz, A.F.; Boardman, J.W. Discrimination among semi-arid landscape endmembers using the spectral angle mapper (SAM) algorithm. In Proceedings of the JPL, Summaries of the Third Annual JPL Airborne Geoscience Workshop, Pasadenia, CA, USA, 1–5 June 1992; Nasa-Jpl: Pasadenia, CA, USA, 1992; Volume 1. [Google Scholar]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 586–595. [Google Scholar]
Wang, Z.; Bovik, A.C. A universal image quality index. IEEE Signal Process. Lett. 2002, 9, 81–84. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
Alzubaidi, L.; Zhang, J.; Humaidi, A.J.; Al-Dujaili, A.; Duan, Y.; Al-Shamma, O.; Santamaría, J.; Fadhel, M.A.; Al-Amidie, M.; Farhan, L. Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions. J. Big Data 2021, 8, 53. [Google Scholar] [CrossRef]
Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning internal representations by error propagation, ed. de rumelhart and j. mcclelland. vol. 1. 1986. Biometrika 1986, 71, 6. [Google Scholar]
Bank, D.; Koenigstein, N.; Giryes, R. Machine Learning for Data Science Handbook: Data Mining and Knowledge Discovery Handbook; Springer: Cham, Switzerland, 2023. [Google Scholar]
Hinton, G.E.; Osindero, S.; Teh, Y.W. A fast learning algorithm for deep belief nets. Neural Comput. 2006, 18, 1527–1554. [Google Scholar] [CrossRef]
Ng, A. Sparse autoencoder. CS294A Lect. Notes 2011, 72, 1–19. [Google Scholar]
Rifai, S.; Vincent, P.; Muller, X.; Glorot, X.; Bengio, Y. Contractive auto-encoders: Explicit invariance during feature extraction. In Proceedings of the 28th International Conference on International Conference on Machine Learning, Bellevue, WA, USA, 28 June–2 July 2011; pp. 833–840. [Google Scholar]
Vincent, P.; Larochelle, H.; Lajoie, I.; Bengio, Y.; Manzagol, P.A.; Bottou, L. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 2010, 11, 3371–3408. [Google Scholar]
An, J.; Cho, S. Variational autoencoder based anomaly detection using reconstruction probability. Spec. Lect. IE 2015, 2, 1–18. [Google Scholar]
Semeniuta, S.; Severyn, A.; Barth, E. A hybrid convolutional variational autoencoder for text generation. arXiv 2017, arXiv:1702.02390. [Google Scholar] [CrossRef]
Nguyen, H.D.; Tran, K.P.; Thomassey, S.; Hamad, M. Forecasting and Anomaly Detection approaches using LSTM and LSTM Autoencoder techniques with the applications in supply chain management. Int. J. Inf. Manag. 2021, 57, 102282. [Google Scholar] [CrossRef]
Sharma, P.; Kumar, M.; Sharma, H.K.; Biju, S.M. Generative adversarial networks (GANs): Introduction, Taxonomy, Variants, Limitations, and Applications. Multimed. Tools Appl. 2024, 83, 88811–88858. [Google Scholar] [CrossRef]
Hong, Y.; Hwang, U.; Yoo, J.; Yoon, S. How generative adversarial networks and their variants work: An overview. ACM Comput. Surv. (CSUR) 2019, 52, 1–43. [Google Scholar] [CrossRef]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
Zhu, J.Y.; Krähenbühl, P.; Shechtman, E.; Efros, A.A. Generative visual manipulation on the natural image manifold. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part V 14; Springer: Cham, Switzerland, 2016; pp. 597–613. [Google Scholar]
Ledig, C.; Theis, L.; Huszár, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.; Tejani, A.; Totz, J.; Wang, Z.; et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4681–4690. [Google Scholar]
Zhang, H.; Sindagi, V.; Patel, V.M. Image de-raining using a conditional generative adversarial network. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 3943–3956. [Google Scholar] [CrossRef]
Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
Radford, A. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv 2015, arXiv:1511.06434. [Google Scholar]
Denton, E.L.; Chintala, S.; Szlam, A.; Fergus, R. Deep generative image models using a laplacian pyramid of adversarial networks. Adv. Neural Inf. Process. Syst. 2015, 28, 1486–1494. [Google Scholar]
Mirza, M. Conditional generative adversarial nets. arXiv 2014, arXiv:1411.1784. [Google Scholar] [CrossRef]
Chen, X.; Duan, Y.; Houthooft, R.; Schulman, J.; Sutskever, I.; Abbeel, P. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. Adv. Neural Inf. Process. Syst. 2016, 29, 2180–2188. [Google Scholar]
Dumoulin, V.; Belghazi, I.; Poole, B.; Mastropietro, O.; Lamb, A.; Arjovsky, M.; Courville, A. Adversarially learned inference. arXiv 2016, arXiv:1606.00704. [Google Scholar]
Donahue, J.; Krähenbühl, P.; Darrell, T. Adversarial feature learning. arXiv 2016, arXiv:1605.09782. [Google Scholar]
Kingma, D.P. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
Mescheder, L.; Nowozin, S.; Geiger, A. Adversarial variational bayes: Unifying variational autoencoders and generative adversarial networks. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 2391–2400. [Google Scholar]
Emelyanova, I.V.; McVicar, T.R.; Van Niel, T.G.; Li, L.T.; Van Dijk, A.I. Assessing the accuracy of blending Landsat–MODIS surface reflectances in two landscapes with contrasting spatial and temporal dynamics: A framework for algorithm selection. Remote Sens. Environ. 2013, 133, 193–209. [Google Scholar] [CrossRef]
Xia, Y.; He, W.; Huang, Q.; Chen, H.; Huang, H.; Zhang, H. SOSSF: Landsat-8 image synthesis on the blending of Sentinel-1 and MODIS data. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5401619. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, 16–20 November 2020; pp. 38–45. [Google Scholar]
Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 10347–10357. [Google Scholar]
Han, K.; Xiao, A.; Wu, E.; Guo, J.; Xu, C.; Wang, Y. Transformer in transformer. Adv. Neural Inf. Process. Syst. 2021, 34, 15908–15919. [Google Scholar]
Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 568–578. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Chu, X.; Tian, Z.; Wang, Y.; Zhang, B.; Ren, H.; Wei, X.; Xia, H.; Shen, C. Twins: Revisiting the design of spatial attention in vision transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 9355–9366. [Google Scholar]
Zhang, P.; Dai, X.; Yang, J.; Xiao, B.; Yuan, L.; Zhang, L.; Gao, J. Multi-scale vision longformer: A new vision transformer for high-resolution image encoding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 2998–3008. [Google Scholar]
Yuan, L.; Hou, Q.; Jiang, Z.; Feng, J.; Yan, S. Volo: Vision outlooker for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 6575–6586. [Google Scholar] [CrossRef]
Medsker, L.R.; Jain, L. Recurrent neural networks. Des. Appl. 2001, 5, 2. [Google Scholar]
Das, S.; Tariq, A.; Santos, T.; Kantareddy, S.S.; Banerjee, I. Recurrent neural networks (RNNs): Architectures, training tricks, and introduction to influential research. In Machine Learning for Brain Disorders; Humana: New York, NY, USA, 2023; pp. 117–138. [Google Scholar]
Hochreiter, S.; Schmidhuber, J. Long Short-term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Graves, A.; Schmidhuber, J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 2005, 18, 602–610. [Google Scholar] [CrossRef]
Karevan, Z.; Suykens, J.A. Spatio-temporal stacked LSTM for temperature prediction in weather forecasting. arXiv 2018, arXiv:1811.06341. [Google Scholar] [CrossRef]
Cho, K. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv 2014, arXiv:1406.1078. [Google Scholar] [CrossRef]
Rani, V.; Nabi, S.T.; Kumar, M.; Mittal, A.; Kumar, K. Self-supervised learning: A succinct review. Arch. Comput. Methods Eng. 2023, 30, 2761–2775. [Google Scholar] [CrossRef]
Laal, M.; Ghodsi, S.M. Benefits of collaborative learning. Procedia-Soc. Behav. Sci. 2012, 31, 486–490. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Shafiq, M.; Gu, Z. Deep residual learning for image recognition: A survey. Appl. Sci. 2022, 12, 8972. [Google Scholar] [CrossRef]
Huang, L.; Qin, J.; Zhou, Y.; Zhu, F.; Liu, L.; Shao, L. Normalization techniques in training dnns: Methodology, analysis and application. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10173–10196. [Google Scholar] [CrossRef] [PubMed]
Ioffe, S. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv 2015, arXiv:1502.03167. [Google Scholar] [CrossRef]
Wu, Y.; He, K. Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Ulyanov, D. Instance normalization: The missing ingredient for fast stylization. arXiv 2016, arXiv:1607.08022. [Google Scholar]
Miyato, T.; Kataoka, T.; Koyama, M.; Yoshida, Y. Spectral normalization for generative adversarial networks. arXiv 2018, arXiv:1802.05957. [Google Scholar] [CrossRef]
Luo, P.; Ren, J.; Peng, Z.; Zhang, R.; Li, J. Differentiable learning-to-normalize via switchable normalization. arXiv 2018, arXiv:1806.10779. [Google Scholar]
Salehin, I.; Kang, D.K. A review on dropout regularization approaches for deep neural networks within the scholarly domain. Electronics 2023, 12, 3106. [Google Scholar] [CrossRef]
Ying, X. An overview of overfitting and its solutions. In Proceedings of the Journal of Physics: Conference Series; IOP Publishing: Bristol, UK, 2019; Volume 1168, p. 022022. [Google Scholar]
Chen, H.; Wang, Y.; Guo, T.; Xu, C.; Deng, Y.; Liu, Z.; Ma, S.; Xu, C.; Xu, C.; Gao, W. Pre-trained image processing transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 12299–12310. [Google Scholar]
Puls, E.d.S.; Todescato, M.V.; Carbonera, J.L. An evaluation of pre-trained models for feature extraction in image classification. arXiv 2023, arXiv:2310.02037. [Google Scholar] [CrossRef]
Weiss, K.; Khoshgoftaar, T.M.; Wang, D. A survey of transfer learning. J. Big Data 2016, 3, 9. [Google Scholar] [CrossRef]
Gorelick, N.; Hancher, M.; Dixon, M.; Ilyushchenko, S.; Thau, D.; Moore, R. Google Earth Engine: Planetary-scale geospatial analysis for everyone. Remote Sens. Environ. 2017, 202, 18–27. [Google Scholar] [CrossRef]
Duan, S.B.; Li, Z.L.; Li, H.; Göttsche, F.M.; Wu, H.; Zhao, W.; Leng, P.; Zhang, X.; Coll, C. Validation of Collection 6 MODIS land surface temperature product using in situ measurements. Remote Sens. Environ. 2019, 225, 16–29. [Google Scholar] [CrossRef]
Wan, Z. MODIS land-surface temperature algorithm theoretical basis document (LST ATBD). Inst. Comput. Earth Syst. Sci. Santa Barbar. 1999, 75, 18. [Google Scholar]
Ermida, S.L.; Soares, P.; Mantas, V.; Göttsche, F.M.; Trigo, I.F. Google earth engine open-source code for land surface temperature estimation from the landsat series. Remote Sens. 2020, 12, 1471. [Google Scholar] [CrossRef]
Krishnan, P.; Meyers, T.P.; Hook, S.J.; Heuer, M.; Senn, D.; Dumas, E.J. Intercomparison of in situ sensors for ground-based land surface temperature measurements. Sensors 2020, 20, 5268. [Google Scholar] [CrossRef]
Shandas, V.; Makido, Y.; Upraity, A.N. Evaluating Differences between Ground-Based and Satellite-Derived Measurements of Urban Heat: The Role of Land Cover Classes in Portland, Oregon and Washington, DC. Land 2023, 12, 562. [Google Scholar] [CrossRef]
Li, Z.; Shen, H.; Cheng, Q.; Liu, Y.; You, S.; He, Z. Deep learning based cloud detection for medium and high resolution remote sensing images of different sensors. ISPRS J. Photogramm. Remote Sens. 2019, 150, 197–212. [Google Scholar] [CrossRef]
Mo, Y.; Xu, Y.; Chen, H.; Zhu, S. A review of reconstructing remotely sensed land surface temperature under cloudy conditions. Remote Sens. 2021, 13, 2838. [Google Scholar] [CrossRef]
Li, H.; Wu, X.j.; Durrani, T.S. Infrared and visible image fusion with ResNet and zero-phase component analysis. Infrared Phys. Technol. 2019, 102, 103039. [Google Scholar] [CrossRef]
Zhang, D.; Ren, K.; Zhou, J.; Gu, G.; Chen, Q. An infrared and visible image fusion method based on deep learning. In Proceedings of the 4th Optics Young Scientist Summit (OYSS 2020), Ningbo, China, 4–7 December 2020; SPIE: Bellingham, WA, USA, 2021; Volume 11781, pp. 64–70. [Google Scholar]
Li, H.; Wu, X.J.; Kittler, J. Infrared and visible image fusion using a deep learning framework. In Proceedings of the 2018 24th International Conference on Pattern Recognition (ICPR), Beijing, China, 20–24 August 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 2705–2710. [Google Scholar]
Ren, X.; Meng, F.; Hu, T.; Liu, Z.; Wang, C. Infrared-visible image fusion based on convolutional neural networks (CNN). In Proceedings of the Intelligence Science and Big Data Engineering: 8th International Conference, IScIDE 2018, Lanzhou, China, 18–19 August 2018; Revised Selected Papers 8; Springer: Cham, Switzerland, 2018; pp. 301–307. [Google Scholar]
Feng, Y.; Lu, H.; Bai, J.; Cao, L.; Yin, H. Fully convolutional network-based infrared and visible image fusion. Multimed. Tools Appl. 2020, 79, 15001–15014. [Google Scholar] [CrossRef]
Li, Y.; Zhao, J.; Lv, Z.; Li, J. Medical image fusion method by deep learning. Int. J. Cogn. Comput. Eng. 2021, 2, 21–29. [Google Scholar] [CrossRef]
Azam, M.A.; Khan, K.B.; Salahuddin, S.; Rehman, E.; Khan, S.A.; Khan, M.A.; Kadry, S.; Gandomi, A.H. A review on multimodal medical image fusion: Compendious analysis of medical modalities, multimodal databases, fusion techniques and quality metrics. Comput. Biol. Med. 2022, 144, 105253. [Google Scholar] [CrossRef] [PubMed]
Sánchez, J.M.; Galve, J.M.; González-Piqueras, J.; López-Urrea, R.; Niclòs, R.; Calera, A. Monitoring 10-m LST from the Combination MODIS/Sentinel-2, validation in a high contrast semi-arid agroecosystem. Remote Sens. 2020, 12, 1453. [Google Scholar] [CrossRef]
Abunnasr, Y.; Mhawej, M. Towards a combined Landsat-8 and Sentinel-2 for 10-m land surface temperature products: The Google Earth Engine monthly Ten-ST-GEE system. Environ. Model. Softw. 2022, 155, 105456. [Google Scholar] [CrossRef]
Li, X.; Wen, C.; Hu, Y.; Yuan, Z.; Zhu, X.X. Vision-language models in remote sensing: Current progress and future trends. IEEE Geosci. Remote Sens. Mag. 2024, 12, 32–66. [Google Scholar] [CrossRef]
Thirunavukarasu, A.J.; Ting, D.S.J.; Elangovan, K.; Gutierrez, L.; Tan, T.F.; Ting, D.S.W. Large language models in medicine. Nat. Med. 2023, 29, 1930–1940. [Google Scholar] [CrossRef] [PubMed]
Xu, F.F.; Alon, U.; Neubig, G.; Hellendoorn, V.J. A systematic evaluation of large language models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, San Diego, CA, USA, 13 June 2022; pp. 1–10. [Google Scholar]
Zhao, Z.; Deng, L.; Bai, H.; Cui, Y.; Zhang, Z.; Zhang, Y.; Qin, H.; Chen, D.; Zhang, J.; Wang, P.; et al. Image Fusion via Vision-Language Model. arXiv 2024, arXiv:2402.02235. [Google Scholar] [CrossRef]

Figure 1. Yearly literature count related to spatio-temporal fusion (STF) for land surface temperature (LST) indexed by Google Scholar since 2015. The search query also covered the synonyms of STF, including data and image fusion.

Figure 2. Satellite-derived LST, inspired by [110].

T_{s i}

,

ε_{i}

, and

a_{i}

represent the surface temperature, emissivity, and projected area weight for the i-th visible component, respectively.

θ_{v}

is the view zenith angle, and

φ_{v}

is the viewing azimuth angle.

Figure 2. Satellite-derived LST, inspired by [110].

T_{s i}

,

ε_{i}

, and

a_{i}

represent the surface temperature, emissivity, and projected area weight for the i-th visible component, respectively.

θ_{v}

is the view zenith angle, and

φ_{v}

is the viewing azimuth angle.

Figure 3. Graphic representation of the STF for LST estimation.

X_{1}

denotes data from the MODIS Terra satellite, and

X_{2}

refers to data from the Landsat 8 satellite. The region of interest (ROI), s, corresponds to Orléans Métropole, in France. The time steps

t_{1}

,

t_{2}

, and

t_{3}

represent 6 March, 22 March, and 9 May 2022, respectively.

Figure 3. Graphic representation of the STF for LST estimation.

X_{1}

denotes data from the MODIS Terra satellite, and

X_{2}

refers to data from the Landsat 8 satellite. The region of interest (ROI), s, corresponds to Orléans Métropole, in France. The time steps

t_{1}

,

t_{2}

, and

t_{3}

represent 6 March, 22 March, and 9 May 2022, respectively.

Figure 4. Proposed Taxonomy for DL-Based STF Methods based on Architecture, Learning Paradigm, Training Strategy, and Incorporation of Pre-trained Models. Methods originally developed for related tasks (e.g., SR, NDVI) are included to highlight transferable design patterns relevant to LST estimation.

Figure 5. Typical architecture of a CNN-based STF method using a single pair of images,

P_{1}

, composed of four main blocks: spatial feature extraction, temporal variation extraction, fusion of spatial and temporal representations, and satellite image reconstruction. The fusion rule can be element-wise addition, multiplication, concatenation, attention, etc.

Figure 5. Typical architecture of a CNN-based STF method using a single pair of images,

P_{1}

, composed of four main blocks: spatial feature extraction, temporal variation extraction, fusion of spatial and temporal representations, and satellite image reconstruction. The fusion rule can be element-wise addition, multiplication, concatenation, attention, etc.

Figure 6. Typical architecture of an AE-based STF method using a single pair of LST data

P_{1}

. It consists of two encoders: a spatial feature encoder and a temporal variation encoder, along with a decoder. The fusion occurs between the latent representations of the outputs from both encoders. The fusion rule can be element-wise addition, multiplication, concatenation, attention, etc.

Figure 6. Typical architecture of an AE-based STF method using a single pair of LST data

P_{1}

. It consists of two encoders: a spatial feature encoder and a temporal variation encoder, along with a decoder. The fusion occurs between the latent representations of the outputs from both encoders. The fusion rule can be element-wise addition, multiplication, concatenation, attention, etc.

Figure 7. Standard GAN architecture. The generator G creates synthetic data from random noise z, while the discriminator D evaluates inputs to distinguish real samples x from generated outputs. Both networks are trained adversarially using loss functions

L^{(D)}

and

L^{(G)}

, with the generator aiming to fool the discriminator and the discriminator attempting to correctly classify real from fake data.

Figure 7. Standard GAN architecture. The generator G creates synthetic data from random noise z, while the discriminator D evaluates inputs to distinguish real samples x from generated outputs. Both networks are trained adversarially using loss functions

L^{(D)}

and

L^{(G)}

, with the generator aiming to fool the discriminator and the discriminator attempting to correctly classify real from fake data.

Figure 8. Typical architecture of a cGAN-based STF method. It consists of a generator that creates high-resolution fused satellite images using input pairs and a discriminator that distinguishes between real and generated fine-resolution images. The generator includes feature extraction, fusion, and image reconstruction stages, while the discriminator evaluates the fused satellite image consistency.

Figure 9. Basic architecture of a ViT [166]. The input image is first divided into fixed-size non-overlapping patches, which are flattened and mapped to patch embeddings through a linear projection. Positional encodings are added to preserve spatial order, and the resulting sequence is fed into a transformer encoder composed of stacked multi-head self-attention and feed-forward layers. The final encoded representation is processed by an MLP head to produce the output.

Figure 10. Typical architecture of a ViT-based STF method. It consists of a temporal transformer encoder for low-resolution inputs, a spatial feature transformer encoder for high-resolution input, and a transformer decoder for generating high-resolution output. The method processes LST images as patch sequences with positional embeddings.

Figure 11. Geographic overview of the ROI. (a1) Position of the ROI within France. (a2) High-resolution satellite image of the ROI. (b1–b5) Representative examples of key land cover categories observed in the ROI: urban, water, forest, industrial, and agricultural zones.

Figure 12. Qualitative comparison of DL-based STF methods (STTFN, EDCSTFN, and ESTARFM) over the ROI and seven representative zoomed-in subregions on 13 June 2022: (a) Orléans Métropole, (b) Orléans city center, (c) Orléans forest, (d) a semi-urban corridor along the Loire and Loiret rivers, (e) a major industrial area, (f) a mixed residential and vegetated neighborhood intersected by the Loiret river, and (g) croplands.

Table 1. Comparative analysis of existing STF surveys in remote sensing. Criteria include: (1) Publication year, (2) Application scope (surface reflectance (SR), normalized difference vegetation index (NDVI), LST), (3) Adaptation of SR methods for LST, (4) Depth of DL coverage, (5) Discussion of deep learning (DL) limitations and open challenges, (6) Experimental evaluation, (7) Introduction of a benchmark dataset. A checkmark ✓ indicates substantive coverage.

Survey	Year	Application Scope	Adaptation to LST	Deep Learning	Open Challenges	Experimental Evaluation	New Dataset
[36]	2018	SR	✗	✗	✗	✗	✗
[28]	2019	SR	✗	✗	✗	✗	✗
[97]	2020	SR	✗	✓	✗	✓	✓
[95]	2020	LST	✗	✗	✗	✗	✗
[66]	2020	SR	✗	✓	✓	✓	✗
[94]	2021	LST	✗	✗	✗	✗	✗
[98]	2022	NDVI	✗	✓	✓	✗	✗
[99]	2023	SR	✓	✗	✓	✓	✗
[100]	2023	SR	✗	✗	✗	✗	✗
[101]	2023	SR	✗	✓	✓	✗	✗
[102]	2023	SR	✗	✗	✗	✗	✗
[103]	2024	SR	✗	✗	✗	✗	✗
[104]	2024	SR	✗	✗	✗	✗	✗
[96]	2024	LST	✓	✗	✗	✗	✗
[105]	2024	SR	✗	✗	✗	✗	✗
[106]	2025	SR	✗	✓	✓	✓	✗
[107]	2025	SR	✗	✓	✓	✓	✗
Ours	2025	LST	✓	✓	✓	✓	✓

Table 2. Comparison of common satellite-based LST products with their thermal Sensors, spatial and temporal resolutions, and temporal extents.

Satellite	Thermal Sensor	Spatial Resolution	Temporal Resolution	Temporal Extent
GF-5	VIMS	40 m	7 days	9 May 2018–Present
Landsat 9	TIRS-2	100 m, resampled to 30 m	16 days	27 September 2021–Present
Landsat 8	TIRS	100 m, resampled to 30 m	16 days	11 February 2013–Present
Landsat 7	ETM+	100 m, resampled to 30 m	16 days	15 April 1999–Present (Partially)
Landsat 5	TM	120 m, resampled to 30 m	16 days	1 May 1984–5 June 2013
Terra	ASTER	90 m	16 days	18 December 1999–Present
Aqua	MODIS	1 km	1 day	4 May 2002–Present
Terra	MODIS	1 km	1 day	18 December 1999–Present
Sentinel-3A	SLSTR	1 km	1 day	16 February 2016–Present
FY-3D	MERSI-2	375 m	12 h	15 November 2017–Present
SNPP	VIIRS	375 m	12 h	28 November 2011–Present
GOES-8	Imager	4 km	30 min	16 September 1994–4 June 2009
FY-2F	VISSR	5 km	1 h	13 January 2012–Present

Table 3. Notations used in STF for LST estimation problem formulation.

Notation	Significance
$X_{1}$	Satellite providing LST data with LSHT
$X_{2}$	Satellite providing LST data with HSLT
$t_{i}$	Temporal time steps
s	Region of interest
$X_{i} (s, t_{i})$	LST data at time $t_{i}$ for the region s
${\hat{X}}_{2} (s, t_{i})$	Predicted HSHT LST data at time $t_{i}$ and for region s
$P_{i}$	A pair of $X_{1}$ and $X_{2}$ for a specific ROI s at time $t_{i}$

Table 4. A comparative overview of CNN-based DL methods for STF. For each method, the table lists the satellite sensors used (Satellites), whether the method is evaluated on LST data (LST), the number of temporal pairs required for training (Pairs), the loss functions employed (Loss Functions), the evaluation metrics reported (Evaluation metrics), the type of datasets used (Datasets), and the availability of implementation code (Code).

Method	Satellites	LST	Pairs	Loss Functions	Evaluation Metrics	Datasets	Code
[37]	MODIS, Landsat	✗	2	Content	RMSE, ERGAS, SAM, SSIM	CIA, LGC	✗
[62]	MODIS, Landsat	✗	1	Content	RMSE, $R^{2}$ , SSIM	Hand-crafted	✓ (https://github.com/theonegis/rs-data-fusion, accessed on 12 January 2026)
[63]	MODIS, Landsat	✗	2	Content	RMSE, CC, SSIM	Hand-crafted	✗
[64]	MODIS, Landsat	✗	2	Content	RMSE, SAM, SSIM, ERGAS	CIA, LGC	✗
[65]	MODIS, Landsat	✓	2	Content	RMSE, SSIM	Hand-crafted	✗
[66]	MODIS, Landsat	✗	2	Content	RMSE, CC, UIQI	Hand-crafted	✗
[67]	MODIS, Landsat	✗	2	Content	RMSE, CC, ERGAS, SSIM, SAM	CIA, LGC	✗
[68]	MODIS, Landsat	✗	2	Content, Vision	RMSE, $R^{2}$ , MAE, rMAE, MAEC	Hand-crafted	✓ (https://github.com/qpyeah/MUSTFN, accessed on 12 January 2026)
[70]	MODIS, Landsat	✗	2	Content, Vision	RMSE, SAM, SSIM	CIA, LGC	✗
[69]	MODIS, Landsat Landsat, Sentinel-2	✓	n	Content, Adversarial	RMSE, SSIM, ERGAS, PSNR, SAM	Hand-crafted	✗
[71]	Landsat, Sentinel-2	✗	1	Content	RMSE, CC, SSIM	Hand-crafted	✗

Table 5. A comparative overview of AE-based DL methods for STF. For each method, the table lists the satellite sensors used (Satellites), whether the method is evaluated on LST data (LST), the number of temporal pairs required for training (Pairs), the loss functions employed (Loss Functions), the evaluation metrics reported (Evaluation metrics), the type of datasets used (Datasets), and the availability of implementation code (Code).

Method	Satellites	LST	Pairs	Loss Functions	Evaluation Metrics	Datasets	Code
[72]	MODIS, Landsat	✗	2	Content, Feature, Vision	RMSE, SAM, ERGAS	Hand-crafted	✓ (https://github.com/theonegis/edcstfn, accessed on 12 January 2026)
[61]	MODIS, FY-4A	✓	2	Content	RMSE, SSIM, LPIPS	Hand-crafted	✗

Table 6. A comparative overview of GAN-based DL methods for STF. For each method, the table lists the satellite sensors used (Satellites), whether the method is evaluated on LST data (LST), the number of temporal pairs required for training (Pairs), the loss functions employed (Loss Functions), the evaluation metrics reported (Evaluation metrics), the type of datasets used (Datasets), and the availability of implementation code (Code).

Method	Satellites	LST	Pairs	Loss Functions	Evaluation Metrics	Datasets	Code
[73]	MODIS, Landsat	✗	2	Adversarial	RMSE, CC, SSIM, SAM, ERGAS	Hand-crafted	✗
[74]	MODIS, Landsat	✗	2	Content, Adversarial	MAE, RMSE, SSIM, SAM, ERGAS, time	CIA, LGC, Shenzhen	✗
[75]	MODIS, Landsat	✗	1	Content, Vision, Spectral, Adversarial	RMSE, SSIM, SAME, $Q 4$	CIA, LGC	✗
[76]	MODIS, Landsat	✗	1/2	Content, Vision, Feature, Adversarial	MAE, RMSE, SAM, SSIM	CIA, LGC	✓ (https://github.com/theonegis/ganstfm, accessed on 12 January 2026)
[77]	MODIS, Landsat	✗	1	Adversarial	MSE, SSIM, CC, UIQI, ERGAS, SAM	CIA, LGC	✗
[78]	MODIS, Landsat	✗	1	Content, Vision, Spectral, Adversarial	MAE, RMSE, SAM, SSIM	CIA, LGC	✓ (https://github.com/songbingze/MLFF-GAN, accessed on 12 January 2026)
[79]	MODIS, Landsat	✗	2	Content, Vision, Spectral, Feature	RMSE, SAM, SSIM, ERGAS	CIA, LGC	✓ (https://github.com/theonegis/rsfn, accessed on 12 January 2026)
[80]	MODIS, Landsat	✗	2	Content, Vision, Spectral, Adversarial	RMSE, SSIM, PSNR, CC	CIA, LGC, Tianjin	✓ (https://github.com/xxsfish/AMS-STF.git, accessed on 12 January 2026)
[81]	MODIS, Landsat	✗	2	Content	RMSE, SSIM, CC, SAM, ERGAS	CIA, LGC E-SMILE	✓ (https://github.com/prowDIY/STF, accessed on 12 January 2026)
[82]	MODIS, FY-4A	✓	1	Content, Adversarial	RMSE, SSIM, LIPIPS	Hand-crafted	✗

Table 7. A comparative overview of ViT-based DL methods for STF. For each method, the table lists the satellite sensors used (Satellites), whether the method is evaluated on LST data (LST), the number of temporal pairs required for training (Pairs), the loss functions employed (Loss Functions), the evaluation metrics reported (Evaluation metrics), the type of datasets used (Datasets), and the availability of implementation code (Code).

Method	Satellites	LST	Pairs	Loss Functions	Evaluation Metrics	Datasets	Code
[83]	MODIS, Landsat	✗	2	Content	RMSE, MSE, CC, SAM, SSIM, ERGAS, PSNR	CIA, LGC, AHB	✗
[84]	MODIS, Landsat	✗	2	Feature	RMSE, SSIM, ERGAS, SAM, CC	CIA-LGC, DX	✗
[85]	MODIS, Landsat	✗	1	Content, Vision	RMSE, CC, SAM, SSIM, UIQI	CIA-LGC	✓ (https://github.com/LouisChen0104/swinstfm.git, accessed on 12 January 2026)
[86]	MODIS, Landsat	✗	1	Content	RMSE, MSE, CC, SAM, SSIM, ERGAS, PSNR	CIA-LGC, AHB	✗
[87]	MODIS, Landsat	✗	2	Content	MAE, SAM, SSIM, PSNR	CIA, DX	✗
[88]	Planetscope, Pléiades	✗	1/2	Content, Vision	RMSE, CC, SAM, SSIM, UIQI	Hand-crafted	✗
[89]	MODIS, Landsat	✗	1	Content, Vision, Spectral, Adversarial	RMSE, PSNR, ERGAS, SAM, SSIM, UIQI	CIA, LGC, AHB, Tianjin	✓ (https://github.com/MaZhaoX/SFT-GAN.git, accessed on 12 January 2026)
[90]	MODIS, Landsat	✓	1	Content, Vision	MAE, RMSE, PSNR, $R^{2}$	Hand-crafted	✓ (https://github.com/HuPengHua2021/THSTNet.git, accessed on 12 January 2026)

Table 8. A comparative overview of RNN-based DL methods for STF. For each method, the table lists the satellite sensors used (Satellites), whether the method is evaluated on LST data (LST), the number of temporal pairs required for training (Pairs), the loss functions employed (Loss Functions), the evaluation metrics reported (Evaluation metrics), the type of datasets used (Datasets), and the availability of implementation code (Code).

Method	Satellites	LST	Pairs	Loss Functions	Evaluation Metrics	Datasets	Code
[91]	MODIS, Landsat	✗	2	Content	RMSE, ERGAS, SAM	Hand-crafted	✗
[70]	MODIS, Landsat	✗	2	Content, Vision	RMSE, SAM, SSIM	CIA, LGC	✗
[92]	MODIS, Sentinel	✗	n	Content	RMSE, SSIM	Hand-crafted	✗

Table 10. Overview of pre-training strategies adopted in state-of-the-art STF methods. The table categorizes approaches into feature extraction and transfer learning. Corresponding methods employing each strategy are listed.

Training Strategy	List of Methods
Feature Extraction	[72,75,76,78,79,80,84]
Transfer Learning	[61]

Table 11. Paired MODIS/Terra and Landsat 8 observations used in this study. Each entry lists the sample number, acquisition date, and overpass times for MODIS and Landsat 8. Only scenes with acceptable cloud coverage were retained (<10% for MODIS, <20% for Landsat 8).

Sample No.	Date	MODIS/Terra	Landsat 8	Sample No.	Date	MODIS/Terra	Landsat 8
1	14 April 2013	11:54	10:43	27	21 October 2018	11:54	10:41
2	01 June 2013	11:08	10:43	28	26 February 2019	11:54	10:40
3	04 August 2013	11:54	10:43	29	02 June 2019	11:54	10:40
4	20 August 2013	11:54	10:43	30	04 July 2019	11:54	10:41
5	05 September 2013	11:54	10:42	31	06 September 2019	11:54	10:41
6	10 December 2013	11:54	10:42	32	01 April 2020	11:54	10:40
7	16 March 2014	11:54	10:41	33	19 May 2020	11:54	10:40
8	17 April 2014	11:54	10:41	34	22 July 2020	11:54	10:41
9	19 May 2014	11:54	10:40	35	07 August 2020	11:54	10:41
10	08 September 2014	11:54	10:41	36	27 November 2020	11:54	10:41
11	24 September 2014	11:54	10:41	37	04 April 2021	11:54	10:40
12	20 April 2015	11:54	10:40	38	10 August 2021	11:03	10:41
13	10 August 2015	11:54	10:41	39	06 March 2022	11:48	10:41
14	11 September 2015	11:47	10:41	40	22 March 2022	11:48	10:41
15	09 June 2016	11:54	10:40	41	09 May 2022	11:42	10:41
16	12 August 2016	11:54	10:41	42	13 August 2022	11:42	10:41
17	15 October 2016	11:54	10:41	43	29 August 2022	11:42	10:41
18	31 October 2016	11:54	10:41	44	30 September 2022	11:26	10:41
19	19 January 2017	11:54	10:41	45	01 November 2022	11:12	10:41
20	09 April 2017	11:54	10:40	46	28 May 2023	11:10	10:40
21	12 June 2017	11:54	10:40	47	13 June 2023	10:36	10:40
22	23 February 2018	11:54	10:40	48	19 October 2023	10:55	10:41
23	11 March 2018	10:40	10:40	49	12 April 2024	10:34	10:40
24	02 August 2018	10:48	10:40	50	19 September 2024	10:00	10:41
25	18 August 2018	11:55	10:40	51	05 October 2024	10:48	10:41
26	05 October 2018	11:53	10:41

Table 12. Per-date quantitative comparison of STF methods (ESTARFM, STTFN, EDCSTFN, and MLFF-GAN) for LST estimations. Results are reported for multiple target dates using RMSE, SSIM, PSNR, SAM, CC, and ERGAS. For each metric, the best performance per date is highlighted in bold, while arrows indicate whether lower (↓) or higher (↑) values correspond to better performance.

Metrics	ESTARFM	STTFN	EDCSTFN	MLFF-GAN	ESTARFM	STTFN	EDCSTFN	MLFF-GAN
	29 August 2022				30 September 2022
RMSE (↓)	5.350	6.258	5.725	5.758	2.650	2.649	2.317	2.549
SSIM (↑)	0.940	0.833	0.918	0.872	0.870	0.780	0.862	0.829
PSNR (↑)	21.410	20.048	20.821	20.772	22.900	22.894	24.058	23.227
SAM (↓)	8.450	9.374	8.812	9.118	6.38	7.661	6.336	6.798
CC (↑)	0.640	0.537	0.600	0.537	0.650	0.536	0.669	0.595
ERGAS (↓)	5.000	5.850	5.352	5.382	4.730	4.728	4.135	4.550
	01 November 2022				28 May 2023
RMSE (↓)	2.720	2.137	1.579	1.037	2.390	3.213	2.719	3.175
SSIM (↑)	0.870	0.685	0.848	0.776	0.880	0.768	0.844	0.730
PSNR (↑)	23.190	18.980	20.449	24.098	23.780	21.2	22.651	21.304
SAM (↓)	6.500	4.861	2.937	3.666	3.410	4.600	4.043	5.555
CC (↑)	0.660	0.644	0.653	0.536	0.900	0.808	0.849	0.719
ERGAS (↓)	5.65	4.445	3.283	2.157	2.520	3.396	2.874	3.356
	13 June 2023				19 October 2023
RMSE (↓)	1.950	1.993	1.937	2.710	3.820	1.877	2.842	3.028
SSIM (↑)	0.900	0.817	0.892	0.843	0.820	0.845	0.858	0.834
PSNR (↑)	26.140	25.936	26.183	23.26	20.970	24.322	24.718	20.166
SAM (↓)	3.010	3.414	3.115	3.671	8.470	5.132	5.540	5.131
CC (↑)	0.910	0.867	0.890	0.846	0.320	0.452	0.436	0.467
ERGAS (↓)	2.000	2.051	1.994	2.789	6.190	3.039	4.602	4.904
	12 April 2024				19 September 2024
RMSE (↓)	4.480	4.365	4.186	4.000	3.890	4.484	3.030	3.281
SSIM (↑)	0.800	0.789	0.827	0.806	0.800	0.759	0.833	0.718
PSNR (↑)	20.440	20.656	21.014	21.414	20.660	16.923	20.329	19.637
SAM (↓)	10.910	8.738	9.702	9.852	9.300	7.480	7.230	7.825
CC (↑)	0.430	0.428	0.449	0.410	0.430	0.535	0.595	0.449
ERGAS (↓)	6.480	6.313	6.058	5.785	5.460	6.289	4.249	4.601

Table 13. Average quantitative performance of STF methods (ESTARFM, STTFN, EDCSTFN, and MLFF-GAN) over all test samples. Result are reported using RMSE, SSIM, PSNR, SAM, CC, and ERGAS. For each metric, the best performance per date is highlighted in bold, and ↓ and ↑ indicate whether lower or higher values correspond to better performance.

Metrics	ESTARFM	STTFN	EDCSTFN	MLFF-GAN
	Average
RMSE (↓)	3.406	3.372	3.042	3.196
SSIM (↑)	0.860	0.7845	0.861	0.800
PSNR (↑)	22.436	21.371	22.279	21.736
SAM (↓)	7.054	6.408	5.96	6.452
CC (↑)	0.618	0.601	0.643	0.576
ERGAS (↓)	4.754	4.5139	4.068	4.191

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bouaziz, S.; Hafiane, A.; Canals, R.; Nedjai, R. Deep Learning for Spatio-Temporal Fusion in Land Surface Temperature Estimation: A Comprehensive Survey, Experimental Analysis, and Future Trends. Remote Sens. 2026, 18, 289. https://doi.org/10.3390/rs18020289

AMA Style

Bouaziz S, Hafiane A, Canals R, Nedjai R. Deep Learning for Spatio-Temporal Fusion in Land Surface Temperature Estimation: A Comprehensive Survey, Experimental Analysis, and Future Trends. Remote Sensing. 2026; 18(2):289. https://doi.org/10.3390/rs18020289

Chicago/Turabian Style

Bouaziz, Sofiane, Adel Hafiane, Raphaël Canals, and Rachid Nedjai. 2026. "Deep Learning for Spatio-Temporal Fusion in Land Surface Temperature Estimation: A Comprehensive Survey, Experimental Analysis, and Future Trends" Remote Sensing 18, no. 2: 289. https://doi.org/10.3390/rs18020289

APA Style

Bouaziz, S., Hafiane, A., Canals, R., & Nedjai, R. (2026). Deep Learning for Spatio-Temporal Fusion in Land Surface Temperature Estimation: A Comprehensive Survey, Experimental Analysis, and Future Trends. Remote Sensing, 18(2), 289. https://doi.org/10.3390/rs18020289

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Learning for Spatio-Temporal Fusion in Land Surface Temperature Estimation: A Comprehensive Survey, Experimental Analysis, and Future Trends

Highlights

Abstract

1. Introduction

2. Satellite-Derived LST

2.1. LST Concept and Retrieval

2.2. Trade-Offs in Spatial and Temporal Resolution for LST Retrieval

2.3. Differences Between SR and LST Dynamics

2.3.1. Spatial Variations

2.3.2. Temporal Variations

3. STF Problem Formulation for LST

3.1. Mathematical Definition

3.2. Loss Functions

3.2.1. Content Loss

3.2.2. Vision Loss

3.2.3. Feature Loss

3.2.4. Spectral Loss

3.2.5. Adversarial Loss

3.3. Evaluation Metrics

4. Taxonomy of DL-Based STF Methods

4.1. Architectures

4.1.1. Convolutional Neural Networks

4.1.2. Autoencoder

4.1.3. Generative Adversarial Networks

4.1.4. Vision Transformers

4.1.5. Recurrent Neural Networks

4.2. Learning Paradigm

4.3. Training Strategy

4.4. Incorporation of Pre-Trained Models

5. Experiment Analysis and Results

5.1. Region of Interest

5.2. Satellite Data

5.3. Quantitative Comparison

5.4. Qualitative Comparison

6. Limitations and Future Trends

6.1. Inaccurate LST Estimations

6.2. Cloudy Conditions

6.3. Poor Generalizability

6.4. Leveraging Pretrained Models

6.5. Insufficient Spatial Resolution

6.6. Joint Spatio-Temporal Deep Learning Architectures

6.7. Integration of Large Language Models

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI