Next Article in Journal
Quantifying the Potential Contribution of Submerged Aquatic Vegetation to Coastal Carbon Capture in a Delta System from Field and Landsat 8/9-Operational Land Imager (OLI) Data with Deep Convolutional Neural Network
Previous Article in Journal
Coupled MOP and PLUS-SA Model Research on Land Use Scenario Simulations in Zhengzhou Metropolitan Area, Central China
Previous Article in Special Issue
High-Resolution Remote Sensing Image Change Detection Method Based on Improved Siamese U-Net
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Experimental Study of the Accuracy and Change Detection Potential of Blending Time Series Remote Sensing Images with Spatiotemporal Fusion

1
School of Mathematics and Computer Sciences, Nanchang University, Nanchang 330031, China
2
Institute of Space Science and Technology, Nanchang University, Nanchang 330031, China
3
School of Information Management, Jiangxi University of Finance and Economics, Nanchang 330013, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2023, 15(15), 3763; https://doi.org/10.3390/rs15153763
Submission received: 21 June 2023 / Revised: 17 July 2023 / Accepted: 21 July 2023 / Published: 28 July 2023
(This article belongs to the Special Issue Recent Progress of Change Detection Based on Remote Sensing)

Abstract

:
Over one hundred spatiotemporal fusion algorithms have been proposed, but convolutional neural networks trained with large amounts of data for spatiotemporal fusion have not shown significant advantages. In addition, no attention has been paid to whether fused images can be used for change detection. These two issues are addressed in this work. A new dataset consisting of nine pairs of images is designed to benchmark the accuracy of neural networks using one-pair spatiotemporal fusion with neural-network-based models. Notably, the size of each image is significantly larger compared to other datasets used to train neural networks. A comprehensive comparison of the radiometric, spectral, and structural losses is made using fourteen fusion algorithms and five datasets to illustrate the differences in the performance of spatiotemporal fusion algorithms with regard to various sensors and image sizes. A change detection experiment is conducted to test if it is feasible to detect changes in specific land covers using the fusion results. The experiment shows that convolutional neural networks can be used for one-pair spatiotemporal fusion if the sizes of individual images are adequately large. It also confirms that the spatiotemporally fused images can be used for change detection in certain scenes.

Graphical Abstract

1. Introduction

High-spatial-resolution satellite sequence data can be used to observe changes on Earth. However, it is difficult to obtain satellite data with both high temporal resolution and high spatial resolution. For example, the Landsat-8 Operational Land Imager (OLI) [1] sensor has a ground resolution of 30 m, but it takes at least 16 days for it to obtain a repeatable image of the same location. In contrast, the Moderate-resolution Imaging Spectroradiometer (MODIS) obtains an image every half a day but with a coarse 500 m ground resolution. Some new satellites have high-resolution capabilities. For instance, Sentinel-2 provides 10 m ground resolution with a five-day revisiting period; these values are 16 m and 4 days, respectively, for Gaofen-1. However, adverse weather conditions make these satellite images far less available, as would be expected. Thus, spatiotemporal fusion algorithms have been designed to combine images from different sources in order to obtain data with both high temporal resolution and high spatial resolution.
In the past twenty years, more than one hundred spatiotemporal fusion algorithms have been proposed [2]. In recent years, many spatiotemporal fusion algorithms based on convolutional neural networks (CNNs) have emerged to challenge the classic algorithms represented by the spatial and temporal adaptive reflectance fusion model (STARFM) [3] and flexible spatiotemporal data fusion (FSDAF) [4]. Various models, including deep convolutional spatiotemporal fusion networks (DCSTFN) [5]; enhanced deep convolutional spatiotemporal fusion networks (EDCSTFN) [6]; spatiotemporal adaptive reflectance fusion models using generative adversarial networks (GASTFN) [7]; and spatial, sensor, and temporal spatiotemporal fusion (SSTSTF) have been proposed. These models contributed to building a common framework for spationtemporal fusion algorithms that employs the use of two streams and the stepwise modeling of spatial, sensor, and temporal differences. In recent works [8,9,10,11,12,13,14,15], multiscale learning, spatial channel attention mechanisms, and edge reservation have been introduced into CNNs for the extraction and integration of features.
Most CNN-based algorithms use large amounts of time-series training data, while traditional algorithms perform better using one-pair training. Time series data allow an algorithm to learn the trends seen in changes in features over the course of seasons. Traditional algorithms lack the ability to learn big data and therefore are not good at anticipating temporal trends. A few algorithms, such as enhanced STARFM [16], attempt to make interpolations in the time dimension with two pairs of sequences. However, it is time-consuming to collect long time series data. Data preparation takes up to one or two years after a satellite is launched, resulting in the inability to synthesize data during this period. Considering the limited lifetime of satellites and the emergence of new satellites, there is still a need for research on one-pair spatiotemporal fusion.
Whether end-to-end CNNs can be used for one-pair spatiotemporal fusion is not known yet. This question may be partially addressed by paying attention to the sizes of the individual images used for this purpose. In the datasets most commonly used to train CNNs [17], each image is 1720 × 2040 and 3200 × 2720 , respectively. Although a standard Landsat scene has 6000 × 6000 pixels before it is geometrically corrected, the scene’s reduced size may not be adequate for the process of training a CNN. It is a known fact that CNNs require multiple pairs of images in their training process. If a single pair of images is sufficiently large, can a CNN be used for one-pair spatiotemporal fusion? This is the question we want to explore.
As far as land cover changes are concerned, it is not clear whether change detection can be performed on fused images of land cover successfully. Some works have shown that the images predicted by spatiotemporal fusion can be applied for some interpretation and inversion tasks. For example, the authors of [18] found that by mapping planting patterns and paddies to the spatiotemporally fused images obtained from a phenology-based fusion method, the accuracy of rice recognition can exceed 90%. In [19], the land surface temperature was quantitatively predicted with spatiotemporally fused images, with the results showing that the average deviation was within about 2.5 K; furthermore, the R 2 scores were greater than 0.96. Although room exists for further improvements, these values are approaching those necessary for practical use. As an important downstream task, change detection has not been investigated in terms of its relationship to spatiotemporal fusion; therefore, this will be investigated in this work.
The exploration of the accuracy of and potential for change detection with the use of spatiotemporal fusion constitutes the research goal of this article. In addition to collecting the three existing commonly used spatiotemporal fusion datasets [3,17], this paper produces a new dataset, which we partly use for mosaicking purposes [20]. Each image in the new dataset has 5792 columns and 5488 rows, which is much larger than the images in the commonly used datasets; therefore, this new dataset may benefit the performance achieved by CNN-based one-pair spatiotemporal fusion methods. A time series dataset [21] for super-resolution tests is also harnessed for fusion. Fourteen representative algorithms are tested on these five datasets. Compared with the existing studies, our work shows the performance boundary and limitations of the spatiotemporal fusion algorithms in a comprehensive way, allowing some new conclusions to be drawn.
The contributions of this paper can be summarized as follows:
  • A new dataset is designed for one-pair spatiotemporal fusion with CNN-based models.
  • A comprehensive comparison involving 14 fusion algorithms and 5 datasets is given to illustrate the differences in the performance of spatiotemporal fusion algorithms with regard to various sensors and image sizes.
  • The feasibility of the use of spatiotemporal fusion for change detection is investigated.
The rest of this paper is organized as follows. Section 2 provides a survey of the existing fusion algorithms. Section 3 presents the datasets, methods, and metrics used to compare the performance of these algorithms, and our new dataset is proposed. Section 4 gives the experimental results via a comparison of our results with fourteen state-of-the-art methods on five datasets. In Section 5, the reconstructed results are tested for change detection. The performance threshold, stability, and potential of each algorithm are discussed in Section 6. Section 7 gives the conclusions.

2. Background and Related Work

Spatiotemporal fusion consists of two types of remote sensing images, as shown in Figure 1. One type has high temporal and low spatial resolutions (hereinafter referred to as low-resolution or coarse-resolution images). The other type has high spatial and low temporal resolution (hereinafter referred to as high-resolution or fine-resolution images). The spatiotemporal fusion is to predict the missing high-resolution image on the prediction date t 2 by utilizing the low-resolution image at t 2 and at least one pair of high- and low-resolution images for reference at t i (where i 2 ).
Existing spatiotemporal fusion methods can be categorized as weight-based, unmixing-based, learning-based, and hybrid methods. Weight-function-based methods apply a linear model to multi-source observations of pure coarse-resolution pixels, and further utilize a weighting strategy to enhance predictions for mixed pixels. These methods exploit the spatial dependence of spectrally similar pixels to reduce the uncertainty and block artifacts in fusion results. The fusion is performed locally, which leads to fast and linear processing speeds. Besides the classic STARFM algorithm [3], enhanced STARFM [16], linear injection [22], and Fit-FC [23] are also typical weight-based methods. However, strategies that relies solely only on pixel similarity fail to maintain structure and detail, so that complex regions require higher coarse image resolution.
Following the framework proposed by Zhukov et al. [24], unmixing-based methods [25,26,27,28] employ spatial unmixing techniques for fusion, which estimate the high-resolution endmembers by unmixing the coarse-resolution pixels using the class scores explained by the reference image. Due to the wide spectrum and large resolution ratio, the unmixing-based methods may be prone to errors in abundance estimation, spectral variations, and nonlinear mixing.
Learning-based methods take advantage of recent advances in machine learning [29] to model the relationship between inputs and outputs, which include dictionary learning, extreme learning machine, random forest, Bayesian framework, and convolutional neural networks. The dictionary-pair-based algorithms [30,31,32,33] use sparse representation to establish connections between high- and low-resolution images. Deep neural networks have replaced them for learning large volumes of data more efficiently. Using complex network structures, neural network learning has the potential to map spatial, temporal, and sensor relationships between images from different sources, as have been proposed for spatiotemporal fusion [6,20,34,35,36,37]. These methods have significant modeling advantages, but suffer from the quality and size of the training data. Low-quality data will train worse nonlinear relationships than dictionary learning. The fusion effect of inadequate training may also be inferior to that of traditional weight or unmixed based models. Therefore, neural network methods are not easily used for the spatiotemporal fusion of one pair.
Hybrid methods combine the advantages of diverse categories to pursue better performance. The flexible spatiotemporal data fusion (FSDAF) algorithm is an important representative that harnesses weight and unmixing for spatiotemporal fusion. Its revisions, such as SFSDAF [38], FSDAF 2.0 [39], and EFSDAF [40], can also be categorized into this type. Other hybrid studies [28,41] integrated weight and unmixing strategies, too. Our previous work [42] is also a hybrid type, which integrates the results of FSDAF and Fit-FC to enhance performance.
A typical spatiotemporal fusion uses three images, but a few studies have attempted to reduce the number of input images to two. Fung et al. [43] utilized the optimization concept in the Hopfield neural network for spatiotemporal image fusion and proposed a new algorithm named the Hopfield neural network Spatio-temporal data fusion model (HNN-SPOT). The algorithm uses a fine-resolution image taken on an arbitrary date and a coarse image taken on the forecast date to derive a synthesized fine-resolution image of the forecast date. Subsequently, Wu et al. [44] also achieved data fusion only using the other two images as input. They proposed an efficient fusion strategy that degenerates the high-resolution images of the reference date to obtain simulated low-resolution images, which can be combined with any spatiotemporal fusion model to accomplish the fusion with simplified input. On the three spatiotemporal fusion algorithms of STARFM, STNLFFM, and FSDAF, experiments were carried out on the datasets of MODIS, Landsat, and Sentinel-2 land surface reflectance product, and the results suggest that the fusion performance with only two input images is comparable to or even superior than that of three input images. Tan et al. [36] proposed the GAN-based spatiotemporal fusion model (GAN-STFM) with a conditional generative adversarial network to reduce the number of model inputs free of the time restriction on reference image selection. Liu et al. [45] presented a GAN survey for remote sensing fusion.
Some algorithms focus on improving the fusion speed of spatiotemporal fusion. Li et al. [46] proposed an extremely fast spatiotemporal fusion method with local normalization to extract spatial information from the prior high-spatial-resolution images and embeds that information into the low-spatial-resolution images in order to predict the missing high-spatial-resolution images. Gao et al. [47] proposed an enhanced FSDAF (cuFSDAF) with GPUs of different computing capabilities to process datasets of arbitrary size.
It is well-known that the performance of spatiotemporal fusion algorithms is unstable. Therefore, several studies have analyzed the impact of factors such as time interval, registration error, number of bands, and clouds. The experiment conducted by Shao et al. [48] originating from an enhanced super-resolution convolutional neural network demonstrated that the number of input auxiliary images and the temporal interval (i.e., the difference between image acquisition dates) between the auxiliary images and the target image both influence the performance of the fusion network. Tang and Wang [49] analyzed the influence of geometric registration errors on spatiotemporal fusion. Subsequently, Wang et al. [50] studied the effect of registration errors on patch-based spatiotemporal fusion methods. Experimental results show that the patch-based fusion model SPSTFM is more robust and accurate than pixel-based fusion models (such as STARFM and Fit-FC), and for each method, the effect of the registration error is greater for heterogeneous regions than for homogeneous regions. Tan et al. [6] proposed an enhanced deep convolutional spatiotemporal fusion network (EDCSTFN) and found that multiband deep learning models slightly outperform single-band deep learning models. Luo et al. [51] proposed a generic and fully automated method (STAIR) for spatiotemporal fusion to impute the missing-value pixels due to cloud cover or sensor mechanical issues in satellite images using an adaptive-average correction process to generate cloud- or gap-free data.
Spatiotemporal fusion has the potential to construct results whose resolution exceeds that of high-resolution reference images. Chen and Xu [52] proposed a unified spatial–temporal–spectral blending model to improve the utilization of accessible satellite data. First, an improved adaptive intensity–hue–saturation approach was used to enhance the spatial resolution of Landsat Enhanced Thematic Mapper Plus (ETM+) data; then, STARFM was used to fuse the MODIS and enhanced Landsat ETM+ data to generate the final synthetic data. Wei et al. [53] fused 8-m multispectral images with 16-m wide-field-view images to reduce the revisiting time of the 8-m multispectral images to 4 days from the original 49 days. The fused results are further improved to 2 m using the panchromatic band.
The data used for spatiotemporal fusion are largely either the Landsat series or MODIS data, but spatiotemporal fusion for other satellites is also being explored, too. For example, Rao et al. [54] conducted spatiotemporal fusion of the LISS sensor (23.5 m with a 24-day revisiting period) and the AWiFS sensor (56 m and a revisiting period of 5 days) on the Indian satellite Resourcesat-2 to obtain synthetic data with a ground resolution of 23.5 m and a revisiting period of 5 days. Similar studies were performed between Sentinel-3 and Sentinel-2 [23], SPOT5 and MODIS [55], Landsat Thematic Mapper (TM) and Envisat Medium Resolution Imaging Spectrometer (MERIS) [56], Planet and Worldview [57], and the multispectral sensors within Gaofen-1 [53].
Although the performance of spatiotemporal fusion algorithms is far from perfect, they have been put to practice. For example, Ding et al. [18] used the fusion results to extract rice fields. Xin et al. [58] used the fusion results to improve the near-real-time monitoring of forest disturbances. Zhang et al. [59] utilized the NDVI data obtained via spatiotemporal fusion to establish a grassland biomass estimation model for monitoring seasonal vegetable changes. In terms of water applications, Guo et al. [60] proposed a spatiotemporal fusion model to monitor marine ecology through chlorophyll-a inversion. In addition to conventional remote sensing applications, spatiotemporal fusion is also applied to synthesize surface brightness temperature data [19,61]. Shi et al. [62] proposed a comprehensive FSDAF (CFSDAF) method to observe the land surface temperature in urban areas with high spatial and temporal resolutions.

3. Preparation for Comparison

3.1. Datasets

Most spatiotemporal fusion studies were conducted between the Landsat series and MODIS Terra. Both are in sun-synchronous orbits at an altitude of 705 km and captured between 10:00 a.m. and 10:30 a.m. local time. The spectral response curves of commonly used data sources are shown in Figure 2. The mean absolute deviations between MODIS and Landsat-5, Landsat-7, and Landsat-8 are 0.1704, 0.1524, and 0.3301, respectively.
Four Landsat datasets have been prepared for the experiment and comparisons. All Landsat images are the product of surface reflectance obtained after atmospheric correction. The pixel values are then magnified by a factor of 10,000 and quantized with 16-bit integers so that they fall within the theoretical range of 0 to 10,000. Besides the Landsat-7 and Landsat-8 sources, we also tested a time-series FY4ASRcolor dataset from the FY4A Meteorological Satellite. The FY4ASRcolor images are from two separate cameras of the same satellite. The summaries of all the datasets are given in Table 1 and will be detailed as follows.

3.1.1. L7STARFM

L7STARFM contains three pairs of Landsat-7 ETM+ and MODIS images that were captured on 24 May 2001; 11 July 2001; and 12 August 2001, respectively. All images consist of green, red, and near-infrared bands that are derived from the surface reflectance products. The image sizes are 1200 × 1200. The ground resolution of Landsat-7 images is 30 m, while it is 500 m for MODIS. The data were first used in STARFM [3] and have been tested in numerous works for traditional one-pair algorithms including weight-based, unmixing-based, and dictionary-learning-based methods, which have no complex architectures for training; therefore, the dataset is named L7STARFM. The dataset is available at https://github.com/isstncu/l8jx (accessed on 20 July 2023).

3.1.2. CIA

The CIA dataset [17] is widely used to benchmark spatiotemporal fusion algorithms. Seventeen cloud-free Landsat7 ETM+ to MODIS image pairs were captured between October 2001 and May 2002, a time when crop phenology has significant temporal dynamics. Geographically, the area covered by the CIA dataset is the Coleambally Irrigation Area (CIA) in southern New South Wales, Australia (34.0034 ° E, 145.0675 ° S). Each image spans 43 km from north to south and 51 km from east to west. The total area is 2193 km 2 and consists of 1720 columns and 2040 rows at a ground resolution of 25 m. Each image consists of six bands. The dataset is available at https://data.csiro.au/collection/csiro:5846v1 (accessed on 20 July 2023).
In the test, the blue, green, red, and near-infrared (NIR) bands of CIA images are used, and the image size is reduced to 1408 columns and 1824 rows by removing the blank areas outside the valid image scope. The blank areas are cropped because the algorithms do not account for invalid data during processing, which may result in significant errors in training and reconstruction. The neural-network-based algorithms require a large amount of data for training. Therefore, the training data come from the 10 images in 2002, while the validation dataset comprises 5 images from 2001. Other algorithms can perform one-pair spatiotemporal fusion, in which the reference times are 5 January 2002; 13 February 2002; 11 April 2002; 18 April 2002; and 18 April 2002, respectively, and the prediction times are 4 December 2001; 2 November 2001; 17 October 2001; 9 November 2001; and 25 November 2001, respectively.
Other algorithms can perform one-pair spatiotemporal fusion, in which the reference time is 5 January 2002, 13 February 2002, 11 April 2002, 18 April 2002, and 18 April 2002, respectively, and the prediction time is 4 December 2001, 2 November 2001, 17 October 2001, 9 November 2001, and 25 November 2001, respectively.

3.1.3. LGC

The Lower Gwydir Catchment (LGC) study site [17] is located in northern New South Wales, Australia (149.2815 ° E, 29.0855 ° S). Fourteen cloud-free Landsat Thematic Mapper (TM) and MODIS image pairs across the LGC were taken from April 2004 to April 2005. The LGC dataset spans the Gwidir river with a width of 80 km from north to south and 68 km from east to west and an total area of 5440 km 2 . The images have 3200 columns and 2720 rows with a ground resolution of 25 m. LGC experienced severe flooding in mid-December 2004, resulting in approximately 44% of the area being submerged. Due to the different spatial and temporal changes caused by flood events, the LGC dataset can be considered a dynamically changing site. The dataset is available at https://data.csiro.au/collection/csiro:5847v1 (accessed on 20 July 2023).
Similar to the CIA dataset, the blue, green, red, and NIR bands of the LGC dataset are extracted after removing the outer blank areas within the images, so 3184 columns and 2704 rows remain. When used for neural network evaluation, the nine images from 2004 are used for training, while the four images from 2001 are to be reconstructed. The dates to be reconstructed are set as follows: 3 April 2005; 2 March 2005; 13 January 2005; and 29 January 2005. Other algorithms use one-pair fusion, where the corresponding reference images are 2 May 2004; 26 November 2004; 28 December 2004; and 28 December 2004.

3.1.4. L8JX

The L8JX dataset is designed by us to test the availability of neural networks for one-pair spatiotemporal fusion. Landsat-8 OLI images were captured in December 2017, October 2018, and November 2017, respectively. The corresponding low-resolution images are the synthesized 8-day MODIS MOD09A1 products. Each image has the blue, green, red, and NIR bands. According to the grid rule of Landsat satellites, these images cover Jiangxi Province in China. The path numbers are 121, and the row numbers are 41, 42, and 43, respectively. The dataset is given the name L8JX for abbreviation, which has 9 pairs of images.
Due to the tilt of the orbit, there are black borders around the image content. In order to remove the useless space in the images, additional processing was carried. The original Landsat-8 images were rotated counterclockwise by approximately 13 degrees, and then the common areas without black borders were extracted. The entire dataset was then divided into three scenes that are geographically connected through small overlapping areas, which can be used as the ground truth to benchmark spatiotemporal fusion or mosaicking algorithms. Each image in L8JX has 5792 columns and 5488 rows. The 8-day MODIS images were rotated, and the blank areas were also removed at the same time. The dataset is available at https://github.com/isstncu/l8jx (accessed on 20 July 2023).

3.1.5. FY4ASRcolor

The FY4ASRcolor dataset [21] is proposed for testing the super resolution of low-resolution remote sensing images. The FY4ASRcolor dataset spans the blue (450–490 nm), red–green (550–750 nm), and visible near-infrared (VNIR, 750–900 nm) bands. The ground resolutions are 1 km for the high-resolution part and 4 km for the low-resolution part. Images in the dataset were captured on 16 September 2021. Each image in L8JX has 10,992 columns and 4368 rows. All the bands are in a 16-bit data format with 12-bit quantization, meaning that each digital number ranges from 0 to 4095. The dataset is available at https://github.com/isstncu/fy4a (accessed on 20 July 2023).
FY4ASRcolor can be used to test spatiotemporal fusion because it comprises time-series data. The images in FY4ASRcolor are all captured by the Advanced Geostationary Radiation Imager (AGRI) camera with full disc scanning covering China (region of China, REGC) with a 5-min time interval for regional scanning. Because of the continuous change in solar angle, the radiation values can change dramatically over large time intervals. Since the daily radiative variation is repetitive, the spatiotemporal fusion of FY4ASRcolor can be used to assess the feasibility of learning the variation pattern throughout a year.
The FY4ASRcolor dataset gives another chance of spatiotemporal fusion within homogeneous platforms instead of heterogenous platforms. The high- and low-resolution images in FY4ASRcolor are acquired by separately mounted sensors of the same type. Different from MODIS and Landsat, each pair of images in the FY4ASRcolor dataset was taken simultaneously using the same sensor response. The sensor difference in FY4ASRcolor is much smaller compared to the L7STARFM, CIA, LGC, and L8JX datasets, as the average absolute error is 29.63. Usually, the sensor difference is hardly modeled, as it is stochastic and scene-dependent. The minimal sensor difference in the FY4ASRcolor dataset makes it ideal for conducting spatiotemporal fusion studies, as it eliminates the fatal sensor discrepancy issue in fusing MODIS and Landsat. A similar work was carried out by us for the spatiotemporal–spectral fusion of the Gaofen-1 images [53]. However, there is only a 2-fold difference in spatial resolution.

3.2. Methods

Fourteen spatiotemporal fusion algorithms covering three categories are collected for evaluation. These algorithms include STARFM [3], Fit-FC [3], VIPSTF [63], FSDAF [4], SFSDAF [38], SPSTFM [30], EBSCDL [31], CSSF [33], BiaSTF [34], DMNet [35], EDCSTFN [6], GANSTFM [36], MOST [20], and SSTSTF [37].
STARFM, Fit-FC, and VIPSTF-SW are weight-based methods. STARFM is a classic and widely used spatiotemporal fusion approach. Fit-FC addresses the problem of discontinuities caused by clouds or shadows in spatiotemporal fusion. VIPSTF is a flexible framework with two versions, VIPSTF-SW and VIPSTF-SU, where the weight-based version (VIPSTF-SW) will be used and abbreviated as VIPSTF. VIPSTF produced the concept of a virtual image pair (VIP), which makes use of the observed image pairs to reduce the uncertainty of estimating the increment from fine-resolution images.
FSDAF and SFSDAF are hybrid methods. FSDAF is a spatiotemporal fusion framework that is compatible with both slow and abrupt changes in land surface reflectance and automatically predicts both gradual and land cover changes through an error analysis in the fusion process. SFSDAF is an enhanced FSDAF framework that aims to reconstruct heterogeneous regions undergoing land cover changes. It utilizes sub-pixel class fraction change information to make inferences.
SPSTFM, EBSCDL, and CSSF are dictionary-learning-based methods. SPSTFM trains two dictionaries generated from coarse- and fine-resolution difference image patches at the given time to build the coupled dictionaries for reconstruction. EBSCDL fixed the dictionary perturbations in SPSTFM using the error-bound-regularized method, which leads to a semi-coupled dictionary pair to address the differences between the coarse- and fine-resolution images. Compressed sensing for spatiotemporal fusion is addressed in CSSF, which explicitly describes the downsampling process and solves it using the dual semi-coupled dictionary pairs.
In the comparison methods, STARFM, BiaSTF, DMNet, EDCSTFN, GANSTFM, MOST, and SSTSTF are coded with Python. The PyTorch framework is used for deep learning. Fit-FC, VIPSTF, SFSDAF, SPSTFM, EBSCDL, and CSSF are coded with MATLAB. FSDAF is coded with Interactive Data Language (IDL). All hyperparameters are set according to the original articles. The experimental conditions are presented in Table 2.
BiaSTF, DMNet, EDCSTFN, GANSTF, MOST, and SSTSTF use convolutional neural networks for fusion. BiaSTF models the sensor differences as a bias, which is modeled with convolutional neural networks to alleviate the spectral and spatial distortions in reconstructed images. DMNet introduces multiscale mechanisms and dilated convolutions to capture more abundant details while reducing the number of trainable parameters. EDCSTFN is an enhanced deep convolutional spatiotemporal fusion network with convolutional neural networks used to extract details of high-resolution images and residuals between all known images. GANSTF introduces the conditional generative adversarial network and switchable normalization technique into the spatiotemporal fusion problem, where the low-resolution images at the given time are not needed. MOST cascades enhanced deep neural networks and trains the spatial and sensor differences separately to fuse images quickly and effectively. SSTSTF proposes a step-by-step modeling framework, and three models have been designed based on deep neural networks to explicitly model the spatial difference, sensor difference, and temporal difference separately. In the training stage, BiaSTF, DMNet, MOST, and SSTSTF can be trained with only one pair, while EDCSTFN and GANSTF need two or more pairs to learn temporal changes. Training parameters are given in Table 3 for CNN-based algorithms, where the average training time is also given by training the first band of the L8JX dataset twice.

3.3. Metrics

Metrics are also used to assess the performance of the synthesized images. Root mean square error (RMSE) measures the radiometric discrepancy. Spectral angle mapper (SAM), relative average spectral error (RASE) [64], relative dimensionless global error in synthesis (ERGAS) [65], and Q4 [66] measure the color consistency. Three metrics are used to measure the structural similarity, including the classic structural similarity (SSIM), the normalized difference Robert’s edge (ndEdge) for edge similarity, and the normalized difference local binary pattern (ndLBP) for textural similarity. To help readers understand the digital trends, the negative SSIM (nSSIM) and negative Q4 (nQ4) are used instead of the standard definitions.
To establish the metrics, the LandSat images taken at a specific time are used as the ground truth. To evaluate the spectral consistency with SAM, RASE, ERGAS, and nQ4, the NIR, red, and green bands are used, but not the blue band. The ideal results are 0 for all the metrics.
RMSE is calculated as
RMSE = 1 N i = 1 N x i y i 2
where x and y are two single-band images sharing the same pixel quantity N, and i is a pixel location.
For a reference image x and an evaluation image y, the spectral angle SAM metric between y and x is calculated using the normalized correlated coefficient as follows:
SAM = arccos x , y x y
where x , y is the inner product between x and y.
For a reference image x and an evaluation image y, the RASE metric for y to reference x is calculated as
RASE = RMSE x y x ¯ = 1 N · C i = 1 N · C x i y i 2 / x ¯
where RMSE is the root mean square error between two images (calculated using pixels in all bands), N is the number of pixel locations (product of the width and height), and C is the number of bands.
The ERGAS metric is calculated as
ERGAS = 1 r 1 C c = 1 C R M S E x c y c 2 x ¯ c 2
where C is the number of total bands in an image, x c is the c t h band of image x, y c is the c t h band of image y, and x ¯ c is the mean value of x c . r is the resolution ratio between high- and low-resolution images which is initially defined for pansharpening. For example, r is set to 2 to evaluate the LandSat-7 fusion between the 15m panchromatic band and the 30 m multispectral bands. For hyperspectral visualization, r is set to 1.
Q4 is defined by
Q 4 = 4 · cov z x , z y · z ¯ x · z ¯ y cov z x , z x + cov z y , z y z ¯ x 2 + z ¯ y y
where two quaternion variables are defined as
z x = a x + i · b x + j · c x + k · d x
z y = a y + i · b y + j · c y + k · d y
where i , j , and k are basic notations of a quaternion.
For a color image x, a x is 0, and b x , c x , and d x correspond in turn to pixel values of the three bands. This forms a column vector of quaternions. Analogously, a y is 0, and b y , c y , and d y correspond in turn to pixel values of the three bands of the image y. z ¯ x and z ¯ y are the mean values of quaternion vectors z x and z y , respectively.
For a quaternion notation z = a + i b + j c + k d , the modulus z ¯ is calculated using
z = z · z * = a 2 + b 2 + c 2 + d 2
where z * is the conjugate of z and defined by
z * = a i b j c k d .
The covariance between quaternion vectors z x and z y is
cov z x , z y = z x z ¯ x T z y z ¯ y * N .
Instead of the standard Q4 index, the negative Q4 (nQ4) is used, which is defined as
nQ 4 = 1 Q 4
nSSIM is calculated using
SSIM x , y = 2 μ x μ y + c 1 2 σ x y + c 2 μ x 2 + μ y 2 + c 1 σ x 2 + σ y 2 + c 2 nSSIM = 1 SSIM
where x and y are two single-band images, μ x and μ y denote the mean values, σ x and σ y denote the standard deviations, and σ x y is the covariance between x and y. c 1 and c 2 are small constants to avoid zeros.
Robert’s edge (Edge) is used to measure the edge features of fused images by detecting local discontinuities. For an image x, the edge discontinuity at the coordinate ( i , j ) is defined as Edge i , j
Edge i , j = x i , j x i + 1 , j + 1 + x i , j + 1 x i + 1 , j
where x i , j is the pixel value at the location ( i , j ) .
A Robert’s edge image is generated after the point-by-point calculation for the image x, which can be defined as Edge x . By denoting the Robert’s edge images for the fused image and the ground truth image as Edge F and Edge R , respectively, the normalized difference Edge index ( ndEdge ) between them is calculated as
ndEdge = Edge F Edge R Edge F + Edge R
It should be noted that the ndEdge’s calculation involves only 10% of locations with higher Edge values which are determined from Edge R . For a multiband image, after finding the ndEdge of each band, the average of these values yields a total ndEdge.
The local binary pattern (LBP) is an operator that describes the local texture characteristics of an image. In a 3 × 3 window, the center pixel is adjacent to 8 pixels. The adjacent locations are marked as 1 if their gray values are larger than that of the center pixel, and 0 otherwise. By concatenating the eight comparison results, an 8-bit binary digit is obtained ranging from 0 to 255, which we call the LBP code of the center pixel. The point-by-point calculation for each point gives an LBP image.
By denoting the LBP images for the fused image and the ground truth image as LBP F and LBP R , respectively, the normalized difference LBP index ( ndEdge ) between them is calculated as
ndLBP = LBP F LBP R LBP F + LBP R
Similar to the calculation of multiband ndEdge, after finding the ndLBP for each band of a multiband image, the average of these values is the total ndLBP.
In order to present the radiometric error with a percentage indicator, the relative uncertainty is evaluated, which is defined as
uncertainty = absolute error measured value .
The mean relative uncertainty of a single-channel image is the mean value of relative uncertainty across all pixel locations. The best uncertainty is given to represent the optimal performance that the state-of-the-art algorithms can reach.
The mean value (mean) and correlated coefficient (CC) are also calculated. The mean value is an indicator for data range, and CC illustrates the similarity between the reference images and the target images.

4. Experimental Results for Spatiotemporal Fusion

All five datasets were tested using fourteen algorithms from three categories, and the results were evaluated using the RMSE, SAM, RASE, ERGAS, nQ4, nSSIM, ndEdge, and ndLBP. The scores are listed in Table 4, Table 5, Table 6, Table 7, Table 8, Table 9, Table 10, Table 11, Table 12, Table 13, Table 14, Table 15, Table 16 and Table 17. It is noticed that offline training was involved for the neural-network-based methods when the CIA and LGC datasets were tested.
Although some spatiotemporal fusion algorithms suggest fusion strategies with two or more known pairs, only the one-pair fusion is tested in this paper. As a matter of fact, any algorithm for one-pair fusion can be transformed into multi-pair fusion. When two or more reference pairs are provided, temporal constraints or combination policies may dominate the fusion performance, which is an open topic linked to practical use. For example, Chen et al. [67] has explored this issue of selecting the best reference image when multiple reference sources are given. This issue is as complex as a one-pair fusion, which cannot be adequately presented in a limited space.

4.1. CIA Results

Five images from the CIA dataset were tested at different time intervals. Table 4 presents the RMSE scoring results on the CIA dataset. Compared to other methods, the weight-based method achieves the highest scores, with Fit-FC having a significant advantage. STARFM performs best for the fourth image. Neural-network-based methods do not perform as well as hybrid methods, but outperform dictionary-learning-based methods. Additionally, SSTSTF outperforms other neural-network-based methods.
Table 5 presents several indicators of spectral consistency. Weight-based methods yield all the best spectral results. In particular, Fit-FC and STARFM hit the highest scores among the metrics for almost all of the data. Neural-network-based algorithms also perform well in preserving spectral consistency, among which the SSTSTF method achieves excellent results.
The structural consistency of the CIA dataset is assessed and shown in Table 6. Except for the first image, the structural scores are too poor to account for reasonable reconstruction quality. FSDAF received the highest SSIM score for the first image, while Fit-FC won the second, third, and fifth competitions. Neural-network-based methods outperform dictionary- and weight-based methods, where EDCSTFN and DMNet are better than GANSTFM and SSTSTF. As for the ndEdge scores, neural-network-based methods show outstanding advantages. These methods produce results that are less oversharpened compared to other types.

4.2. LGC Results

Four images from the LGC dataset were evaluated and demonstrated. In terms of the RMSE in Table 7, the neural-network-based methods rank first for most of the bands. Among them, GANSTFM, MOST, and SSTSTF achieve the best scores for the blue, green, and NIR bands, respectively. The weight-based methods perform better than the hybrid and dictionary-learning-based methods, where Fit-FC scores are higher than those of FSDAF and EBSCDL.
In terms of spectral consistency, GANSTFM, MOST, and SSTSTF exhibit higher scores, as shown in Table 8. To sum up, neural-network-based algorithms demonstrate superior performance for the LGC dataset where SSTSTF ranking first.
In terms of structural similarity, Table 9 shows that the image structures predicted by the neural-network-based methods are the closest to the real image, in which SSTSTF and GANSTFM achieve the best structural similarity. The lowest ndEdge scores convinced us once again that neural-network-based methods tend to produce less oversharpening. The hybrid methods also performs well, as FSDAF achieves the highest structural score on the fourth image.

4.3. L8JX Results

Three images in the L8JX dataset were tested and evaluated. All algorithms run the one-pair fusion without offline training. Since the images are larger, the results are much different from those of CIA and LGC.
Table 10 presents the RMSE scores on the L8JX dataset. Except for the first NIR band of the first image and the red band of the third image, the first places are all won by the neural-network-based algorithms. Among the neural networks, the SSTSTF algorithm achieves the best scores. A similar performance is also observed for the SSIM in Table 12. The scores of ndEdge and ndLPM for L8JX are much smaller than those of CIA and LGC, but the advantages of neural-network-based algorithms are diminished.
In the spectral consistency evaluation for L8JX, Table 11 shows that the hybrid and neural-network-based algorithms perform well, with VIPSTF achieving the best results for almost half of the metrics. CNN-based methods work well, too. Generally, the scores are close to each other.
By combining the results of Table 10, Table 11 and Table 12, it is concluded that the performance of the neural network algorithms are the best for L8JX, and the SSTSTF algorithm best suits this dataset. A similar conclusion can be observed by a detailed comparison between Figure 3 and Figure 4, where two local blocks of the reconstructed images on row 42, October 3, 2018, are shown.
Loss of details and colors can be observed directly from Figure 3. STARFM produces noticeable speckles. By introducing residual correction for STARFM, the image reconstructed by Fit-FC is blurry. The details of VIPSTF are similar to Fit-FC. Contrarily, the color and detail of both FSDAF and SFSDAF are superior. There are significant differences in the results of algorithms based on CNNs. The color deviations of DMNet and EDCSTFN are severe, although the details are rich and distinguishable. GANSTFM and SSTSTF present the best colors. SSTSTF gives the richest details which are the closest to the ground truth. When it comes to the details in the upper left corner, most neural network algorithms can reconstruct them well, while weight-based and hybrid algorithms fail.
The conclusions in Figure 4 are similar to those in Figure 3. Excluding DMNet, EDCSTFN, and MOST, the colors of the other algorithms are natural. VIPSTF has more details than Fit-FC. Among all the images, the results produced by SSTSTF present the richest details. The conclusions of the visual evaluation of Figure 3 and Figure 4 are consistent with the quantitative scores, except for the spectral continuity of MOST, which is evaluated as having a small spectral loss.

4.4. L7STARFM Results

Three images from the L7STARFM dataset were tested. Table 13, Table 14 and Table 15 show the scores for radiation deviation, spectral fidelity, and structural similarity on the L7STARFM dataset, respectively. It is evident that the reconstructed spectral consistency of the L7STARFM dataset is significantly higher than that of the other three datasets. Additionally, the structural similarity is also much higher compared to CIA and LGC.
In terms of the radiometric deviation in Table 13, Fit-FC and VIPSTF outperform other algorithms significantly on the second and third images. In terms of spectral consistency in Table 14, Fit-FC and VIPSTF perform the best. In terms of structural similarity, VIPSTF and CSSF have the highest scores. Generally, the weight-based methods perform the best for this dataset despite the weak superiority. The dictionary-learning-based algorithms also perform well for this dataset. This conclusion is quite different from the conclusions for other datasets.

4.5. FY4ASRcolor Results

Three pairs of images from the FY4ASRcolor dataset were tested, which were captured at 5:30, 6:30, and 11:30, respectively. The large image size enables neural networks to be fully trained. Other images from nine moments were used for training, and they formed eight groups.
Table 16 presents the first measured results on the FY4ASRcolor dataset, where images at 5:30 are used as references to predict the high-resolution image at 6:30. Although the RMSE scores are large, the structural similarity is good, and the spectral consistency is acceptable when Q4 is considered. The huge RMSE errors may come from the fact that the scene is dark at 5:30. Although the scores are poor, neural-network-based algorithms perform better than weight-based and hybrid methods. Among the neural networks, the MOST algorithm achieves the best scores for all metrics.
Table 17 presents the second set of measured results on the FY4ASRcolor dataset, where images taken at 6:30 serve as references to predict the high-resolution image at 11:30. Compared to Table 17, the RMSE scores become smaller, but the spectral consistency deteriorates. Neural-network-based algorithms win the comparison, as they perform far better than weight-based and hybrid methods. Among the neural networks, GANSTFM achieves the best scores for all metrics.
By combining the results of Table 16 and Table 17, it is concluded that neural networks are best suited for FY4ASRcolor. SSTSTF and BiaSTF are the only algorithms that model the sensor difference explicitly, but they do not achieve the best results, which is partially addressed by the negligible sensor discrepancy in FY4ASRcolor. Temporal difference is the main challenge for this homogeneous dataset.

5. Experiments for Change Detection

The results of spatiotemporal fusion can be used in downstream applications. It was mentioned in Section 2 that the results have been used for land use classification and quantitative applications such as biomass and surface temperature. Change detection is an important downstream application, but it has not been tested in existing spatiotemporal fusion work. Therefore, an experiment is conducted to evaluate the performance of the reconstructed images for change detection.

5.1. Experimental Scheme for Change Detection

Standard change detection is challenging due to the use of continuous labels and long time sequences.Alternatively, a simplified classification strategy is adopted to avoid the time-consuming cost of continuous labels. In the experiment, labels were manually assigned to a small number of discontinuous pixels in the reference and ground truth images. Then, a Support Vector Machine (SVM) classifier was used to classify the reference image, the ground truth image, and the spatiotemporal fusion results with the given labels. It is noted that the parameters of the SVM are trained individually for each image. In the SVM, the radius basis function is used as the kernel, the gamma value is 0.1, and the regularization parameter (C) is 100. After the pixel-wise classification, the superpixel post-processing technique presented in [68] was harnessed to smooth the label fragments for better accuracy. After all pixels were given labels using the SVM and superpixel post-processing, changes can be detected by comparing the pixel types at different moments. The workflow is given in Figure 5.
The first two image pairs for the test are from the CIA dataset. The reference time is 18 April 2002. The check dates are 9 November 2001 and 25 November 2001, respectively. These images are selected to illustrate the significant change from farmland to barren land. The image sizes are 512 × 512 with the blue, green, red, and NIR bands. The categories are set as farmland and non-farmland, so any changes in farmland can be detected. In the common reference image, 101,447 pixels were labeled. In the target ground truth images, 83,733 and 81,916 pixels were labeled, respectively.
The last image pair is from the LGC dataset. The location was prone to flooding, causing some areas to alternate between farmland and water area. The reference time is 28 December 2004, which was during a flood. The check day is 13 January 2005, when the flood had receded. The spatiotemporally fused images on the check day are classified for change detection. The image size is 500 × 1200 with the blue, green, red, and NIR bands. The categories are divided into water areas and non-water areas, allowing for the detection of changes in water areas. In the reference image, 63,764 pixels were labeled. In the target ground truth image, 79,548 pixels were labeled.
The results of the detected changes are presented in Figure 6, Figure 7 and Figure 8. The results are evaluated using metrics such as the intersection over union (IOU), F1-score, precision, recall, and overall accuracy (OA). The scores are presented in Table 18 and Table 19.

5.2. Results for the CIA dataset

In Table 18, STARFM outperforms other algorithms for the CIA image. The best accuracy can approach about 0.9. CSSF ranks second, but there is a big gap between is and STARFM. Overall, the effects of all methods are not good. Figure 6 and Figure 7 show that most algorithms ignore the obvious change areas, while STARFM covers the most changes. The poor scores on the CIA image result from the difficulty of identifying bare land. In addition, SVM uses only spectral information to distinguish land features, such that the rich structures synthesized by neural network algorithms lose effect. However, neural-network-based methods are not suitable due to the small scale of the data.

5.3. Results for the LGC Dataset

Table 19 shows that VIPSTF gives the best result for the LGC image. All algorithms present higher scores than for CIA images, indicating that to recognize the change in water area is much easier than that of farmland. Except for SPSTFM, the accuracy scores for the other algorithms all exceed 0.9. Figure 8 confirms that the majority of the water area can be checked effectively. The optimal scores on various datasets demonstrate the feasibility of using spatiotemporal fusion results for change detection. Since VIPSTF and STARFM both use the weighted combination strategy, it is reasonably inferred that weight-based spatiotemporal algorithms are more suitable for detecting changes at small image sizes.

6. Discussion

The motivation of this work is to address the questions raised in the first section, which are the upper performance boundaries, the stability of the algorithms, and the possibility of using CNN for one-pair fusion. Answers to these questions can be inferred from the comparison of the experimental results.

6.1. Performance Analysis

When the digital scores are concerned in the popular CIA dataset, the RMSE performance of the current algorithms is not significantly improved by comparing them with the STARFM algorithm that was proposed in 2006. When considering uncertainty, the relative uncertainties for the first image of the CIA dataset obtained by STARFM are 2.87%, 2.25%, 4.45%, 2.54%, 7.78%, 4.43%, 9.32%, 20.70%, 4.98%, 2.22%, 4.24%, 8.05%, 0.00%, 0.00%, 0.00%, 0.00%, 0.76%, 1.99%, 2.91%, and 6.46% higher than the best scores, respectively. For the fourth image of the CIA dataset, STARFM even achieves the best scores for all bands.
However, when considering the structural quality, the results obtained using STARFM are not as good as state-of-the-art methods because it tends to predict blurry images with a large number of speckles in smooth regions, which can be observed in Figure 3 and Figure 4. The neural-network-based algorithms produce acceptable images only when they are given sufficient training data. In this case, the results by SSTSTF show admirable spectral color and structural details. In summary, although the digital evaluation shows small improvement, the existing spatiotemporal fusion algorithms have made significant advancements in fusion quality. These algorithms have been able to remove speckles and adapt to abrupt changes or heterogeneous regions.
The best uncertainties show that the reconstruction accuracy of the NIR bands is generally higher than that of the red, green, and blue bands. In the L8JX and LGC datasets, the best uncertainties of NIR are all below 12%. The uncertainty scores over 20% are only observed in the CIA dataset.
More conclusions can be drawn from Table 20. It can be seen that the average uncertainty scores of LGC and L8JX are both less than 15%, indicating that the current spatiotemporal fusion methods are practically feasible. However, the best performance of LGC and L8JX are produced with CNN-based algorithms. When the training data are insufficient to effectively train the CNNs (e.g., L7STARFM), the least error of some in the red band is 35.4%, which is too large to be accepted. Therefore, the scale of the training data is possibly the most important factor that determines the outcome of the fusion, followed by the algorithms.
It is observed that among the four bands, the red band is the most difficult to reconstruct. The reason may be attributed to the rich structure and high temporal sensitivity of the red band. Surface reflectance is generally used in spatiotemporal fusion, where the amplitude of the red band is negatively correlated with vegetation density. Limited by scale, a typical fusion scene can include woodland, farmland, and grassland, resulting in rich structures in the red band. The red spectrum of vegetation varies greatly with the changing of the seasons, resulting in a significant shift in the red band. In contrast, the blue band has a small intensity and less detail, and the green band is less sensitive to seasonal variations than the red band. Although the NIR band is also susceptible to temporal variations, it is spatially smooth.

6.2. Threshold of Uncertainty for Practical Use

When the popular uncertainty metric is used to assess the feasibility of spatiotemporal fusion to practical applications, a threshold helps to judge the fusion quality conveniently. The radiometric standards of the ground processing systems can be referenced. In terms of radiometric calibration targets, the uncertainty is uniformly set to 5% for the multispectral sensors of MODIS, Landsat-5 TM, Landsat-7 ETM+, and Landsat-8 OLI in terms of the blue, green, red, and NIR bands. As for the actual uncertainty, it is within 2% for MODIS [69], 5% for Landsat-5 TM [70], 5% for Landsat-7 ETM+ [71], and 4% for Landsat-8 OLI [71].
Compared to radiometric calibration, the fusion problem is more similar to the cross-calibration problem, which has been widely investigated. Based on the radiometric values of MODIS as the baseline, the differences are 4% for Landsat-5 TM [72], 7% for Landsat-7 ETM+ [73], and 4% for Landsat-8 OLI [72]. In these studies, MODIS is commonly adopted as the calibration reference in the reflective solar spectral range due to its exceptional radiometric accuracy.
A simple criterion is needed to evaluate the practicality of fusion tasks. Due to the unavailability of high-resolution images at the target time, the error in spatiotemporal fusion is usually greater than that of cross-calibration. Consequently, it seems not practical to set the uncertainty threshold to 5% for spatiotemporal fusion.
As a result, this paper suggests a 10% uncertainty as the common threshold to determine the uniform availability of fusion results. Scores show that the best fusion results can reach to about 10%. Furthermore, several practices have proved the feasibility of the fusion results in its present form. The ideal threshold may vary depending on the type of downstream tasks. Quantitative applications such as surface temperature may pursue small uncertainty. Interpretive tasks, such as classification and segmentation, are less sensitive to uncertainty if the structures are rich.

6.3. Stability

An algorithm is considered stable in this paper when it consistently outperforms other algorithms. As the quality of fusion is influenced by image content and time intervals, it is unrealistic to expect an algorithm to perform equally well in all scenarios. Instead, certain algorithms may outperform others in specific scenarios, allowing us to choose the most suitable algorithm accordingly.
Table 21 shows that CNN-based algorithms achieved the best performance for three out of the five datasets. The weight-based algorithms rank first for CIA and L7STARFM. Although the hybrid algorithms failed to win first place, they performed much more stably, ranking second in four competitions. To conclude, hybrid algorithms handle all cases steadily in spite of their training scales. However, when the training data are sufficient, neural networks can work stably.
When considering the stability of each algorithm, things are much different. The conclusion can be drawn from the results of the datasets where neural networks have advantages. In LGC, GANSTFM ranked first and SSTSTF ranked second. In L8JX, SSTSTF ranked first and DMNet ranked second. In FY4A, GANSTFM and MOST both ranked first. This shows that GANSTFM and SSTSTF are superior in performance and stability compared to other neural-network-based algorithms.
The remaining two datasets can be analyzed for other types of algorithms. In CIA, Fit-FC ranked first and FSDAF ranked second. In L7STARFM, Fit-FC ranked first and VIPSTF ranked second. This shows that Fit-FC is more stable among the weight-based algorithms that were tested.

6.4. CNN for One-Pair Spatiotemporal Fusion

It can be concluded from Table 21 that the performance of CNN-based spatiotemporal fusion algorithms is greatly affected by the size of the single input image. An image has to be cropped into patches before being fed into the networks, so a larger single image size yields a higher number of patches from the same moment. The algorithms can only show better performance with sufficient training, such as with many groups of small reference images for offline training as in CIA and LGC or a group of large reference images for online one-pair fusion as in L8JX and FY4ASRcolor.
The data scales are similar between CIA and L8JX, but there is a significant difference in the ranks of CNN-based methods. By comparing their results, it can be concluded that the quality of CNN-based algorithms is more influenced by the image size. Larger image sizes train algorithms better for modeling spatial differences and recognizing land covers. A larger number of image pairs can lead to better training for modeling time differences. If spatiotemporal fusion is understood as an aggregation of various categories, such as in the unmixing-based methods, the accurate extraction of ground features with an encoder forms the foundation for learning temporal differences.
Therefore, we can conclude that CNN-based algorithms can perform one-pair spatiotemporal fusion only when the image size is sufficiently large. If this condition is not satisfied, hybrid algorithms are alternative choices. On the other hand, when there are large images with long time series, CNN-based algorithms can further improve performance by learning temporal differences. When the amount of data becomes even larger, neural networks can possibly be constructed with a transformer or a diffusion model.
The results of weight-based methods are independent of image size. The unmixing-based or unmixing-involved hybrid methods are extremely slow when clustering large images, so the images of L8JX have to be divided into four blocks and fused separately. CNN-based methods perform the best on both LGC, L8JX, and FY4ASRcolor, which is consistent with their large image size. In particular, in the FY4ASRcolor dataset, which has the largest image size, neural-network-based algorithms achieve the optimal values for all bands. Therefore, the ability of neural networks to learn from big data makes them more promising than other types of algorithms.

6.5. Similarity and Content for Fusion

To further clarify the key factors influencing fusion quality, Table 22 and Table 23 are presented, which illustrate the relationship between the average results of Fit-FC and FSDAF and the input image pairs. These two algorithms were chosen because they suffer less performance loss from small image sizes. NIR is not used because its amplitudes are not in line with other bands.
Table 22 and Table 23 demonstrate that the time interval between the reference time and the predicted time has the first-place influence on the reconstruction results. The smallest error occurs for the 16-day interval of LGC. The second smallest error occurs for the 32-day interval of LGC. The third smallest error occurs for the CIA’s 32-day interval. The maximum error occurs for the CIA’s 176-day interval, which is the actual maximum interval when considering a cycle of four seasons. The errors of the 32-day interval are significantly larger than those of the 16-day interval, but the average uncertainty remains less than 15%. The small time interval indicates that the similarity between the reference image and the target image is the main factor influencing the reconstruction performance. This finding has been validated in all five datasets.
Besides the time interval, the correlation coefficient is another available indicator that is useful when the time intervals are the same. For the CIA dataset, the correlation coefficient of the first image pair is significantly higher than that of the other four images, and its uncertainty is the lowest. The same conclusion can be found in the third image of LGC, the first image of L8JX, and the second image of L7STARFM. By comparing the results of images 1 and 4 of LGC in Table 23, which have the same 32-day interval, we can expect better performance from higher correlated coefficients.
In terms of heterogeneity, CIA is considered a homogeneous region and LGC is a heterogeneous region. It is evident from Table 22 and Table 23 that Fit-FC performs better on CIA while FSDAF wins LGC. This conclusion is consistent with their motivations. To conclude, FSDAF focuses on heterogeneous or changing land covers, while Fit-FC works well for homogeneous areas. Similar conclusions have been drawn in [42].

6.6. Ranking the Algorithms and Metrics

Ranks could be given for each class of spatiotemporal algorithms. Among the neural-network-based methods, SSTSTF achieves the highest scores on both the CIA dataset and the L8JX dataset. Among the weight-based methods, Fit-FC performs the best as it outperforms the other weight-based algorithms on datasets, with the exception of L8JX. Among the dictionary-based algorithms, CSSF is the best for LGC, L7, and FY4ASRCOLOR. Among the hybrid methods, FSDAF consistently outperforms SFSDAF.
As far as the metrics are concerned, RMSE and SSIM are classical and accurate, which evaluate the radiometric and structural errors, respectively. Among the spectral criteria, ERGAS shows the best stability. In the change detection experiment, the IOU and F1-score show similar results with admirable stability.

7. Conclusions

In the face of the wide application of deep learning for spatiotemporal fusion, there is a growing demand to reveal practical application scenarios. To identify the feasibility of CNN for one-pair spatiotemporal fusion, a new dataset is designed with large single image sizes for both training and testing purposes. The potential of change detection with fused images is also being investigated. These issues are addressed by preparing fourteen fusion algorithms and five datasets for comparison. A comprehensive experiment was conducted to illustrate the variety in the performance of spatiotemporal fusion algorithms in relation to various sensors and image sizes. The reconstruction results are assessed in terms of radiometric, spectral, and structural loss. Some results are tested to identify the feasibility of change detection. The experiment shows that convolutional neural networks can be used for one-pair spatiotemporal fusion if the single image’s size is sufficiently large (e.g., 6000 × 6000). It also confirms that the spatiotemporally fused images can be used for change detection in certain scenes.

Author Contributions

Data curation, L.C.; Investigation, J.W.; Software, L.C.; Writing—original draft, J.W.; Writing—review and editing, Y.H. and Z.C.; Resources, Z.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Natural Science Foundation of China (No. 42267070).

Data Availability Statement

The L8JX dataset can be downloaded from https://github.com/isstncu/l8jx (accessed on 20 July 2023).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Feng, G.; Hilker, T.; Xiaolin, Z.; Anderson, M.; Masek, J.; Peijuan, W.; Yun, Y. Fusing Landsat and MODIS Data for Vegetation Monitoring. Geosci. Remote. Sens. Mag. IEEE 2015, 3, 47–60. [Google Scholar]
  2. Zhu, X.; Cai, F.; Tian, J.; Williams, T.K.A. Spatiotemporal Fusion of Multisource Remote Sensing Data: Literature Survey, Taxonomy, Principles, Applications, and Future Directions. Remote Sens. 2018, 10, 527. [Google Scholar]
  3. Gao, F.; Masek, J.; Schwaller, M.; Hall, F. On the blending of the Landsat and MODIS surface reflectance: Predicting daily Landsat surface reflectance. IEEE Trans. Geosci. Remote. Sens. 2006, 44, 2207–2218. [Google Scholar]
  4. Zhu, X.; Helmer, E.H.; Gao, F.; Liu, D.; Chen, J.; Lefsky, M.A. A flexible spatiotemporal method for fusing satellite images with different resolutions. Remote. Sens. Environ. 2016, 172, 165–177. [Google Scholar] [CrossRef]
  5. Tan, Z.; Yue, P.; Di, L.; Tang, J. Deriving High Spatiotemporal Remote Sensing Images Using Deep Convolutional Network. Remote Sens. 2018, 10, 1066. [Google Scholar]
  6. Tan, Z.; Di, L.; Zhang, M.; Guo, L.; Gao, M. An Enhanced Deep Convolutional Model for Spatiotemporal Image Fusion. Remote Sens. 2019, 11, 1066. [Google Scholar] [CrossRef] [Green Version]
  7. Shang, C.; Li, X.; Yin, Z.; Li, X.; Wang, L.; Zhang, Y.; Du, Y.; Ling, F. Spatiotemporal Reflectance Fusion Using a Generative Adversarial Network. IEEE Trans. Geosci. Remote. Sens. 2021, 60, 1–15. [Google Scholar] [CrossRef]
  8. Huang, Z.Q.; Li, Y.J.; Bai, M.H.; Wei, Q.; Gu, Q.; Mou, Z.J.; Zhang, L.P.; Lei, D.J. A Multiscale Spatiotemporal Fusion Network Based on an Attention Mechanism. Remote Sens. 2023, 15, 182. [Google Scholar]
  9. Lei, D.J.; Ran, G.S.; Zhang, L.P.; Li, W.S. A Spatiotemporal Fusion Method Based on Multiscale Feature Extraction and Spatial Channel Attention Mechanism. Remote Sens. 2022, 14, 461. [Google Scholar] [CrossRef]
  10. Qin, P.; Huang, H.B.; Tang, H.L.; Wang, J.; Liu, C. MUSTFN: A spatiotemporal fusion method for multi-scale and multi-sensor remote sensing images based on a convolutional neural network. Int. J. Appl. Earth Obs. Geoinf. 2022, 115, 103113. [Google Scholar]
  11. Li, W.; Zhang, X.; Peng, Y.; Dong, M. Spatiotemporal Fusion of Remote Sensing Images using a Convolutional Neural Network with Attention and Multiscale Mechanisms. Int. J. Remote. Sens. 2020, 42, 1973–1993. [Google Scholar] [CrossRef]
  12. Cao, H.M.; Luo, X.B.; Peng, Y.D.; Xie, T.S. MANet: A Network Architecture for Remote Sensing Spatiotemporal Fusion Based on Multiscale and Attention Mechanisms. Remote Sens. 2022, 14, 4600. [Google Scholar] [CrossRef]
  13. Li, W.S.; Wu, F.Y.; Cao, D.W. Dual-Branch Remote Sensing Spatiotemporal Fusion Network Based on Selection Kernel Mechanism. Remote Sens. 2022, 14, 4282. [Google Scholar] [CrossRef]
  14. Cheng, F.F.; Fu, Z.T.; Tang, B.H.; Huang, L.; Huang, K.; Ji, X.R. STF-EGFA: A Remote Sensing Spatiotemporal Fusion Network with Edge-Guided Feature Attention. Remote Sens. 2022, 14, 4282. [Google Scholar]
  15. Liu, H.; Yang, G.; Deng, F.; Qian, Y.; Fan, Y. MCBAM-GAN: The Gan Spatiotemporal Fusion Model Based on Multiscale and CBAM for Remote Sensing Images. Remote Sens. 2023, 15, 1583. [Google Scholar] [CrossRef]
  16. Zhu, X.; Chen, J.; Gao, F.; Chen, X.; Masek, J.G. An enhanced spatial and temporal adaptive reflectance fusion model for complex heterogeneous regions. Remote. Sens. Environ. 2010, 114, 2610–2623. [Google Scholar] [CrossRef]
  17. Emelyanova, I.V.; McVicar, T.R.; Van Niel, T.G.; Li, L.T.; van Dijk, A.I.J.M. Assessing the accuracy of blending Landsat-MODIS surface reflectances in two landscapes with contrasting spatial and temporal dynamics: A framework for algorithm selection. Remote. Sens. Environ. 2013, 133, 193–209. [Google Scholar] [CrossRef]
  18. Ding, M.; Guan, Q.; Li, L.; Zhang, H.; Liu, C.; Zhang, L. Phenology-Based Rice Paddy Mapping Using Multi-Source Satellite Imagery and a Fusion Algorithm Applied to the Poyang Lake Plain, Southern China. Remote Sens. 2020, 12, 1022. [Google Scholar] [CrossRef] [Green Version]
  19. Wu, P.; Shen, H.; Zhang, L.; Gottsche, F.M. Integrated fusion of multi-scale polar-orbiting and geostationary satellite observations for the mapping of high spatial and temporal resolution land surface temperature. Remote. Sens. Environ. 2015, 156, 169–181. [Google Scholar] [CrossRef]
  20. Wei, J.; Tang, W.; He, C. Enblending Mosaicked Remote Sensing Images with Spatiotemporal Fusion of Convolutional Neural Networks. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2021, 14, 5891–5902. [Google Scholar] [CrossRef]
  21. Wei, J.; Zhou, C.; Wang, J.; Chen, Z. Time-Series FY4A Datasets for Super-Resolution Benchmarking of Meteorological Satellite Images. Remote Sens. 2022, 14, 5594. [Google Scholar] [CrossRef]
  22. Sun, Y.c.; Zhang, H.; Shi, W. A spatio-temporal fusion method for remote sensing data Using a linear injection model and local neighbourhood information. Int. J. Remote. Sens. 2018, 40, 2965–2985. [Google Scholar] [CrossRef]
  23. Wang, Q.; Atkinson, P.M. Spatio-temporal fusion for daily Sentinel-2 images. Remote. Sens. Environ. 2018, 204, 31–42. [Google Scholar] [CrossRef] [Green Version]
  24. Zhukov, B.; Oertel, D.; Lanzl, F.; Reinhackel, G. Unmixing-based multisensor multiresolution image fusion. IEEE Trans. Geosci. Remote. Sens. 1999, 37, 1212–1226. [Google Scholar] [CrossRef]
  25. Wu, M.; Niu, Z.; Wang, C.; Wu, C.; Wang, L. Use of MODIS and Landsat time series data to generate high-resolution temporal synthetic Landsat data using a spatial and temporal reflectance fusion model. J. Appl. Remote. Sens. 2012, 6, 063507. [Google Scholar]
  26. Lu, M.; Chen, J.; Tang, H.; Rao, Y.; Yang, P.; Wu, W. Land cover change detection by integrating object-based data blending model of Landsat and MODIS. Remote. Sens. Environ. 2016, 184, 374–386. [Google Scholar] [CrossRef]
  27. Wang, Q.; Peng, K.; Tang, Y.; Tong, X.; Atkinson, P.M. Blocks-removed spatial unmixing for downscaling MODIS images. Remote. Sens. Environ. 2021, 256, 112325. [Google Scholar] [CrossRef]
  28. Peng, K.; Wang, Q.; Tang, Y.; Tong, X.; Atkinson, P.M. Geographically Weighted Spatial Unmixing for Spatiotemporal Fusion. IEEE Trans. Geosci. Remote. Sens. 2021, 60, 1–17. [Google Scholar] [CrossRef]
  29. Salazar, A.; Vergara, L.; Vidal, E. A proxy learning curve for the Bayes classifier. Pattern Recognit. 2023, 136, 109240. [Google Scholar] [CrossRef]
  30. Huang, B.; Song, H. Spatiotemporal Reflectance Fusion via Sparse Representation. IEEE Trans. Geosci. Remote. Sens. 2012, 50, 3707–3716. [Google Scholar] [CrossRef]
  31. Wu, B.; Huang, B.; Zhang, L. An Error-Bound-Regularized Sparse Coding for Spatiotemporal Reflectance Fusion. IEEE Trans. Geosci. Remote. Sens. 2015, 53, 6791–6803. [Google Scholar] [CrossRef]
  32. Wei, J.; Wang, L.; Liu, P.; Song, W. Spatiotemporal Fusion of Remote Sensing Images with Structural Sparsity and Semi-Coupled Dictionary Learning. Remote Sens. 2017, 9, 21. [Google Scholar] [CrossRef] [Green Version]
  33. Wei, J.; Wang, L.; Liu, P.; Chen, X.; Li, W.; Zomaya, A.Y. Spatiotemporal Fusion of MODIS and Landsat-7 Reflectance Images via Compressed Sensing. IEEE Trans. Geosci. Remote. Sens. 2017, 55, 7126–7139. [Google Scholar] [CrossRef]
  34. Li, Y.; Li, J.; He, L.; Chen, J.; Plaza, A. A new sensor bias-driven spatio-temporal fusion model based on convolutional neural networks. Sci. China-Inf. Sci. 2020, 63, 1–16. [Google Scholar] [CrossRef] [PubMed]
  35. Li, W.; Zhang, X.; Peng, Y.; Dong, M. DMNet: A Network Architecture Using Dilated Convolution and Multiscale Mechanisms for Spatiotemporal Fusion of Remote Sensing Images. IEEE Sens. J. 2020, 20, 12190–12202. [Google Scholar] [CrossRef]
  36. Tan, Z.; Gao, M.; Li, X.; Jiang, L. A Flexible Reference-Insensitive Spatiotemporal Fusion Model for Remote Sensing Images Using Conditional Generative Adversarial Network. IEEE Trans. Geosci. Remote. Sens. 2021, 60, 1–13. [Google Scholar] [CrossRef]
  37. Ma, Y.; Wei, J.; Tang, W.; Tang, R. Explicit and stepwise models for spatiotemporal fusion of remote sensing images with deep neural networks. Int. J. Appl. Earth Obs. Geoinf. 2021, 105, 102611. [Google Scholar] [CrossRef]
  38. Li, X.D.; Foody, G.M.; Boyd, D.S.; Ge, Y.; Zhang, Y.; Du, Y.; Ling, F. SFSDAF: An enhanced FSDAF that incorporates sub-pixel class fraction change information for spatio-temporal image fusion. Remote Sens. Environ. 2020, 237, 111537. [Google Scholar] [CrossRef]
  39. Guo, D.; Shi, W.; Hao, M.; Zhu, X. FSDAF 2.0: Improving the performance of retrieving land cover changes and preserving spatial details. Remote Sens. Environ. 2020, 248, 111973. [Google Scholar] [CrossRef]
  40. Shi, C.; Wang, X.; Zhang, M.; Liang, X.; Niu, L.; Han, H.; Zhu, X. A Comprehensive and Automated Fusion Method: The Enhanced Flexible Spatiotemporal DAta Fusion Model for Monitoring Dynamic Changes of Land Surface. Appl. Sci. 2019, 9, 3693. [Google Scholar] [CrossRef] [Green Version]
  41. Xu, Y.; Huang, B.; Xu, Y.; Cao, K.; Guo, C.; Meng, D. Spatial and Temporal Image Fusion via Regularized Spatial Unmixing. IEEE Geosci. Remote Sens. Lett. 2015, 12, 1362–1366. [Google Scholar]
  42. Ma, Y.; Wei, J.; Huang, X. Integration of One-Pair Spatiotemporal Fusion With Moment Decomposition for Better Stability. Front. Environ. Sci. 2021, 9, 731452. [Google Scholar] [CrossRef]
  43. Fung, C.H.; Wong, M.S.; Chan, P.W. Spatio-Temporal Data Fusion for Satellite Images Using Hopfield Neural Network. Remote. Sens. 2019, 11, 2077. [Google Scholar] [CrossRef] [Green Version]
  44. Wu, J.; Cheng, Q.; Li, H.; Li, S.; Guan, X.; Shen, H. Spatiotemporal Fusion With Only Two Remote Sensing Images as Input. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2020, 13, 6206–6219. [Google Scholar] [CrossRef]
  45. Liu, P.; Li, J.; Wang, L.; He, G. Remote Sensing Data Fusion With Generative Adversarial Networks: State-of-the-art methods and future research directions. IEEE Geosci. Remote. Sens. Mag. 2022, 10, 295–328. [Google Scholar] [CrossRef]
  46. Li, Y.; Li, J.; Zhang, S. A Extremely Fast Spatio-Temporal Fusion Method for Remotely Sensed Images. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, Brussels, Belgium, 11–16 July 2021; pp. 4452–4455. [Google Scholar]
  47. Gao, H.; Zhu, X.; Guan, Q.; Yang, X.; Yao, Y.; Zeng, W.; Peng, X. cuFSDAF: An Enhanced Flexible Spatiotemporal Data Fusion Algorithm Parallelized Using Graphics Processing Units. IEEE Trans. Geosci. Remote. Sens. 2021, 60, 1–16. [Google Scholar] [CrossRef]
  48. Shao, Z.; Cai, J.; Fu, P.; Hu, L.; Liu, T. Deep learning-based fusion of Landsat-8 and Sentinel-2 images for a harmonized surface reflectance product. Remote Sens. Environ. 2019, 235, 111425. [Google Scholar] [CrossRef]
  49. Tang, Y.; Wang, Q. On the Effect of Misregistration on Spatio-temporal Fusion. In Proceedings of the 2019 10th International Workshop on the Analysis of Multitemporal Remote Sensing Images (MultiTemp), Shanghai, China, 5–7 August 2019; pp. 1–4. [Google Scholar]
  50. Wang, L.; Wang, X.; Wang, Q.; Atkinson, P.M. Investigating the Influence of Registration Errors on the Patch-Based Spatio-Temporal Fusion Method. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2020, 13, 6291–6307. [Google Scholar] [CrossRef]
  51. Luo, Y.; Guan, K.; Peng, J. STAIR: A generic and fully-automated method to fuse multiple sources of optical satellite data to generate a high-resolution, daily and cloud-/gap-free surface reflectance product. Remote. Sens. Environ. 2018, 214, 87–99. [Google Scholar]
  52. Chen, B.; Xu, B. A unified spatial-spectral-temporal fusion model using Landsat and MODIS imagery. In Proceedings of the 2014 Third International Workshop on Earth Observation and Remote Sensing Applications (EORSA), Changsha, China, 11–14 June 2014; pp. 256–260. [Google Scholar]
  53. Wei, J.; Yang, H.; Tang, W.; Li, Q. Spatiotemporal-Spectral Fusion for Gaofen-1 Satellite Images. IEEE Geosci. Remote. Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
  54. Rao, J.M.; Rao, C.V.; Kumar, A.S.; Lakshmi, B.; Dadhwal, V.K. Spatiotemporal Data Fusion Using Temporal High-Pass Modulation and Edge Primitives. IEEE Trans. Geosci. Remote. Sens. 2015, 53, 5853–5860. [Google Scholar]
  55. Zheng, Y.; Wu, B.; Zhang, M.; Zeng, H. Crop Phenology Detection Using High Spatio-Temporal Resolution Data Fused from SPOT5 and MODIS Products. Sensors 2016, 16, 2099. [Google Scholar] [CrossRef] [Green Version]
  56. Amorós-López, J.; Gómez-Chova, L.; Alonso, L.; Guanter, L.; Zurita-Milla, R.; Moreno, J.F.; Camps-Valls, G. Multitemporal fusion of Landsat/TM and ENVISAT/MERIS for crop monitoring. Int. J. Appl. Earth Obs. Geoinf. 2013, 23, 132–141. [Google Scholar] [CrossRef]
  57. Kwan, C.; Zhu, X.; Gao, F.; Chou, B.; Perez, D.; Li, J.; Shen, Y.; Koperski, K.; Marchisio, G. Assessment of Spatiotemporal Fusion Algorithms for Planet and Worldview Images. Sensors 2018, 18, 1051. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  58. Xin, Q.; Olofsson, P.; Zhu, Z.; Tan, B.; Woodcock, C. Toward near real-time monitoring of forest disturbance by fusion of MODIS and Landsat data. Remote Sens. Environ. 2013, 135, 234–247. [Google Scholar] [CrossRef]
  59. Zhang, B.; Zhang, L.; Xie, D.; Yin, X.; Liu, C.; Liu, G. Application of Synthetic NDVI Time Series Blended from Landsat and MODIS Data for Grassland Biomass Estimation. Remote Sens. 2016, 8, 10. [Google Scholar] [CrossRef] [Green Version]
  60. Guo, S.; Sun, B.; Zhang, H.K.; Liu, J.; Chen, J.; Wang, J.; Jiang, X.; Yang, Y. MODIS ocean color product downscaling via spatio-temporal fusion and regression: The case of chlorophyll-a in coastal waters. Int. J. Appl. Earth Obs. Geoinf. 2018, 73, 340–361. [Google Scholar] [CrossRef]
  61. Addesso, P.; Longo, M.; Restaino, R.; Vivone, G. Spatio-temporal resolution enhancement for cloudy thermal sequences. Eur. J. Remote. Sens. 2019, 52, 2–14. [Google Scholar] [CrossRef] [Green Version]
  62. Shi, C.; Wang, N.; Zhang, Q.; Liu, Z.; Zhu, X. A Comprehensive Flexible Spatiotemporal DAta Fusion Method (CFSDAF) for Generating High Spatiotemporal Resolution Land Surface Temperature in Urban Area. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2022, 15, 9885–9899. [Google Scholar] [CrossRef]
  63. Wang, Q.; Tang, Y.; Tong, X.; Atkinson, P.M. Virtual image pair-based spatio-temporal fusion. Remote Sens. Environ. 2020, 249, 112009. [Google Scholar] [CrossRef]
  64. Ranchin, T.; Wald, L. Fusion of high spatial and spectral resolution images: The ARSIS concept and its implementation. Photogramm. Eng. Remote. Sens. 2000, 66, 49–61. [Google Scholar]
  65. Du, Q.; Younan, N.H.; King, R.; Shah, V.P. On the Performance Evaluation of Pan-Sharpening Techniques. IEEE Geosci. Remote. Sens. Lett. 2007, 4, 518–522. [Google Scholar] [CrossRef]
  66. Alparone, L.; Baronti, S.; Garzelli, A.; Nencini, F. A Global Quality Measurement of Pan-Sharpened Multispectral Imagery. IEEE Geosci. Remote. Sens. Lett. 2004, 1, 313–317. [Google Scholar] [CrossRef]
  67. Chen, Y.; Cao, R.; Chen, J.; Zhu, X.; Zhou, J.; Wang, G.; Shen, M.; Chen, X.; Yang, W. A New Cross-Fusion Method to Automatically Determine the Optimal Input Image Pairs for NDVI Spatiotemporal Data Fusion. IEEE Trans. Geosci. Remote. Sens. 2020, 58, 5179–5194. [Google Scholar] [CrossRef]
  68. Ma, Y.; Deng, X.; Wei, J. Land Use Classification of High-Resolution Multispectral Satellite Images with Fine-Grained Multiscale Networks and Superpixel Post Processing. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2023, 16, 3264–3278. [Google Scholar] [CrossRef]
  69. Xiong, X.; Angal, A.; Barnes, W.L.; Chen, H.; Chiang, V.; Geng, X.; Li, Y.; Twedt, K.; Wang, Z.; Wilson, T.; et al. Updates of Moderate Resolution Imaging Spectroradiometer on-orbit calibration uncertainty assessments. J. Appl. Remote. Sens. 2018, 12, 034001. [Google Scholar] [CrossRef]
  70. Helder, D.L.; Karki, S.; Bhatt, R.; Micijevic, E.; Aaron, D.; Jasinski, B. Radiometric Calibration of the Landsat MSS Sensor Series. IEEE Trans. Geosci. Remote. Sens. 2012, 50, 2380–2399. [Google Scholar] [CrossRef]
  71. Mishra, N.; Haque, M.O.; Leigh, L.; Aaron, D.; Helder, D.; Markham, B. Radiometric Cross Calibration of Landsat 8 Operational Land Imager (OLI) and Landsat 7 Enhanced Thematic Mapper Plus (ETM plus). Remote Sens. 2014, 6, 12619–12638. [Google Scholar] [CrossRef] [Green Version]
  72. Angal, A.; Mishra, N.; Xiong, X.J.; Helder, D. Cross-calibration of Landsat 5 TM and Landsat 8 OLI with Aqua MODIS using PICS. In Proceedings of the Earth Observing Systems XIX Conference on Earth Observing Systems XIX, San Diego, CA, USA, 18–20 August 2014. [Google Scholar] [CrossRef]
  73. Angal, A.; Xiong, X.; Wu, A.; Chander, G.; Choi, T. Multitemporal Cross-Calibration of the Terra MODIS and Landsat 7 ETM+ Reflective Solar Bands. IEEE Trans. Geosci. Remote Sens. 2013, 51, 1870–1882. [Google Scholar] [CrossRef]
Figure 1. Data for Spatiotemporal Fusion.
Figure 1. Data for Spatiotemporal Fusion.
Remotesensing 15 03763 g001
Figure 2. Spectral response functions of Landsat series and MODIS.
Figure 2. Spectral response functions of Landsat series and MODIS.
Remotesensing 15 03763 g002
Figure 3. Local manifestation of the red, green, and blue bands of the first L8JX image block (extracted from row 42, 3 October 2018).
Figure 3. Local manifestation of the red, green, and blue bands of the first L8JX image block (extracted from row 42, 3 October 2018).
Remotesensing 15 03763 g003
Figure 4. Local manifestation of the red, green, and blue bands of the second L8JX image block (extracted from row 42, 3 October 2018).
Figure 4. Local manifestation of the red, green, and blue bands of the second L8JX image block (extracted from row 42, 3 October 2018).
Remotesensing 15 03763 g004
Figure 5. Work flow of change detection with incomplete labels.
Figure 5. Work flow of change detection with incomplete labels.
Remotesensing 15 03763 g005
Figure 6. Change detection between 18 April 2002 and 9 November 2001 (white for changes). The reference and ground truth images are from the CIA dataset, while others are from the spatiotemporally fused images.
Figure 6. Change detection between 18 April 2002 and 9 November 2001 (white for changes). The reference and ground truth images are from the CIA dataset, while others are from the spatiotemporally fused images.
Remotesensing 15 03763 g006
Figure 7. Change detection between 18 April 2002 and 25 November 2001 (white for changes). The reference and ground truth images are from the CIA dataset, while others are from the spatiotemporally fused images.
Figure 7. Change detection between 18 April 2002 and 25 November 2001 (white for changes). The reference and ground truth images are from the CIA dataset, while others are from the spatiotemporally fused images.
Remotesensing 15 03763 g007
Figure 8. Change detection between 28 December 2004 and 13 January 2005 (white for changes). The reference and ground truth images are from the LGC dataset, while others are from the spatiotemporally fused images.
Figure 8. Change detection between 28 December 2004 and 13 January 2005 (white for changes). The reference and ground truth images are from the LGC dataset, while others are from the spatiotemporally fused images.
Remotesensing 15 03763 g008
Table 1. Summary of the datasets.
Table 1. Summary of the datasets.
Columns *Rows *BandsAmountLocation
L7STARFM12001200green, red, NIR3Canada (104 ° W, 54 ° N)
CIA14081824blue, green, red, NIR17Australia (34.0034 ° E, 145.0675 ° S)
LGC31842704blue green, red, NIR14Australia (149.2815 ° E, 286 29.0855 ° S)
L8JX57925488blue, green, red, NIR9China (115.8247 ° E, 25.9868 ° N)
FY4ASRcolor10,9924368blue, red–green, VNIR165whole China
* The columns and rows of CIA and LGC are smaller than the original sizes after the black borders are removed.
Table 2. Hardware and software for experiments.
Table 2. Hardware and software for experiments.
HardwareRAMCPUGPU
128GBIntel Xeon E5-2682 v4 @ 2.50GHznVidia Tesla V100
SoftwarePythonCUDAPyTorch
3.6.210.21.2.0
Table 3. Parameters for CNN training.
Table 3. Parameters for CNN training.
AlgorithmOptimizerInitial Learning RateEpochsBatch SizePatch SizeTraining Time (s)
BiaSTFAdam0.000130064128 × 12810,647
DMNetAdam0.001603260 × 6015,609
EDCSTFNAdam0.00016036256 × 2563036
GANSTFMAdam0.000230032256 × 25618,445
MOSTAdam0.00013001654 × 5418,566
SSTSTFAdam0.000130036256 × 25625,427
Table 4. RMSE evaluation on the CIA (LandSat-7) dataset.
Table 4. RMSE evaluation on the CIA (LandSat-7) dataset.
target4 December 20012 November 200117 October 20019 November 200125 November 2001
reference5 January 200213 February 200211 April 200218 April 200218 April 2002
bandbluegreenredNIRbluegreenredNIRbluegreenredNIRbluegreenredNIRbluegreenredNIR
mean71310691474255050778398823764697088592431439724971220750284112032234
CC0.8220.8050.8610.3930.4130.4350.406−0.4180.2310.2620.249−0.2100.3820.3100.330−0.1720.4460.3340.315−0.149
STARFM125.7186.5293.4526.3175.1221.8362.1913.7169.8200.3324.5846.986.3115.2189.9465.2151.5203.5345.0624.4
Fit-FC101.0145.5223.5457.9131.0170.4261.6562.7148.3187.0284.7671.2119.9162.3247.5556.3136.7187.0296.1473.2
VIPSTF120.3176.1223.2464.0162.7207.9292.1613.3168.3203.1312.2711.6144.1194.5286.8583.6157.6225.6346.7491.1
FSDAF114.5168.3256.5481.0162.4205.6328.1825.4173.2200.9319.9856.2143.1179.5287.1704.0146.8198.4328.7621.7
SFSDAF117.5171.8269.2482.5160.0202.6320.9823.6173.6201.7316.3854.6146.4184.4292.4708.4152.5203.8335.4627.5
SPSTFM156.4239.1366.1733.4234.2301.8497.51260.6208.4248.4419.41172.2183.0240.3391.1992.8175.9250.8427.8899.1
EBSCDL136.9204.8315.4604.6188.9241.0392.11080.7178.2209.0340.41003.0153.1197.5315.2837.1154.7214.4357.1748.9
CSSF139.3208.5321.8602.2195.4249.5408.81091.9184.8217.1354.01025.9156.5203.5325.2869.9158.0221.2371.0774.3
BiaSTF138.5207.7317.7584.3192.0238.4384.81032.0197.9229.5349.1971.0152.2196.5312.0820.6165.6220.9359.2737.7
DMNet120.8167.2256.7544.9170.1211.4376.3940.3156.2206.8330.6909.2140.1180.3294.0769.4154.1212.9345.9682.9
EDCSTFN121.3185.1263.2764.8201.3239.1418.81245.6164.2205.1363.8985.6141.3184.6315.9823.8149.0213.0388.2748.8
GANSTFM142.6187.0275.0539.3169.8197.4303.7893.0184.8200.5330.6908.1135.2170.2268.2738.1131.4180.3313.2617.5
MOST123.7163.4253.6571.6164.8209.7342.6802.8193.0220.0377.5957.8143.6184.8298.0733.6146.1201.1341.0671.1
SSTSTF129.0182.4291.3530.6138.9178.2291.7827.5154.1193.5298.7851.6124.6162.5256.0670.6158.6211.3323.4551.8
uncertainty *(%)11.010.812.615.720.417.625.524.422.218.931.125.918.213.918.920.324.317.525.019.8
* The uncertainty values are the lowest values across all fusion results.
Table 5. Spectral Consistency Evaluation on the CIA (LandSat-7) Dataset.
Table 5. Spectral Consistency Evaluation on the CIA (LandSat-7) Dataset.
target3 December 20011 November 200116 October 20018 November 200124 November 2001
reference4 January 200212 February 200210 April 200217 April 200217 April 2002
metricSAMRASEERGASnQ4SAMRASEERGASnQ4SAMRASEERGASnQ4SAMRASEERGASnQ4SAMRASEERGASnQ4
STARFM0.0830.2150.1940.2160.1840.4210.3480.6550.1330.4020.3390.6350.1270.2290.1900.1620.0850.3000.2700.593
Fit-FC0.0520.1800.1570.2180.1000.2690.2410.3370.1140.3260.2920.4540.0900.2800.2440.4360.0670.2380.2270.427
VIPSTF0.0690.1850.1670.2090.1300.2970.2740.4370.1270.3480.3160.5610.1000.3010.2760.5320.0820.2600.2600.526
FSDAF0.0700.1940.1740.1880.1660.3810.3160.6140.1350.4050.3380.6470.1100.3470.2890.6060.0840.2960.2630.585
SFSDAF0.0680.1970.1780.2040.1640.3790.3120.6100.1350.4040.3370.6260.1120.3500.2940.5900.0860.3000.2680.573
SPSTFM0.1100.2900.2550.2860.2490.5800.4770.6440.1900.5500.4450.8500.1520.4850.3980.9650.1180.4160.3550.904
EBSCDL0.0880.2420.2150.2420.2080.4910.3910.7390.1520.4680.3720.7720.1260.4070.3280.7280.1000.3470.2970.686
CSSF0.0930.2430.2170.2450.2180.4980.4020.7280.1570.4790.3840.8120.1300.4220.3400.7960.1030.3590.3080.750
BiaSTF0.0930.2370.2130.2440.2090.4710.3800.7210.1530.4580.3780.6970.1270.3990.3240.6850.1010.3440.2980.656
DMNet0.0750.2130.1830.2230.1860.4320.3530.6950.1430.4290.3530.7640.1130.3740.3030.7360.0910.3220.2830.716
EDCSTFN0.0830.2820.2250.2860.2430.5580.4270.6430.1690.4640.3770.8590.1310.4000.3220.8080.1080.3520.3060.809
GANSTFM0.0630.2150.1920.2990.1580.4020.3160.7640.1630.4280.3500.7350.1190.3570.2850.6890.0780.2900.2520.632
MOST0.0590.2200.1850.3020.1590.3750.3190.6170.1800.4560.3850.7830.1280.3610.3000.6370.0900.3150.2760.608
SSTSTF0.0580.2150.1930.3010.1440.3740.2950.5830.1380.4000.3260.5410.1020.3270.2660.5070.0720.2730.2560.469
Table 6. Average nSSIM/ndEdge/ndLBP evaluation on the CIA (LandSat-7) dataset.
Table 6. Average nSSIM/ndEdge/ndLBP evaluation on the CIA (LandSat-7) dataset.
target3 December 20011 November 200116 October 20018 November 200124 November 2001
reference4 January 200212 February 200210 April 200217 April 200217 April 2002
metricnSSIMndEdgendLBPnSSIMndEdgendLBPnSSIMndEdgendLBPnSSIMndEdgendLBPnSSIMndEdgendLBP
STARFM0.2990.298−0.0360.6910.345−0.0400.7080.396−0.0640.1610.166−0.0770.6590.388−0.056
Fit−FC0.2760.094−0.0690.5910.172−0.0630.6580.299−0.0700.6290.292−0.0660.6070.315−0.061
VIPSTF0.2830.141−0.0680.7160.130−0.0670.7970.185−0.0690.7520.288−0.0630.7220.302−0.059
FSDAF0.2320.262−0.0780.6460.328−0.0790.7090.370−0.0850.6770.382−0.0800.6450.369−0.075
SFSDAF0.2570.296−0.0770.6420.345−0.0800.7040.383−0.0860.6750.412−0.0830.6470.397−0.078
SPSTFM0.2980.302−0.0640.7880.332−0.0590.8530.366−0.0690.8130.391−0.0650.7740.373−0.060
EBSCDL0.2880.298−0.0790.7150.323−0.0800.7470.359−0.0870.7160.385−0.0800.6840.365−0.076
CSSF0.2890.297−0.0520.7470.323−0.0480.7920.360−0.0510.7600.384−0.0450.7300.364−0.040
BiaSTF0.2940.297−0.0790.6970.323−0.0800.7270.360−0.0880.7030.385−0.0830.6710.366−0.079
DMNet0.2720.121−0.0930.7040.190−0.1000.7590.213−0.1060.7280.243−0.1000.7050.237−0.094
EDCSTFN0.3060.125−0.0940.7680.186−0.1020.7860.191−0.1090.7390.220−0.1030.7290.185−0.097
GANSTFM0.3400.052−0.0860.6950.181−0.0910.7950.263−0.0970.6940.268−0.0910.6570.209−0.085
MOST0.3160.079−0.0870.6630.213−0.0960.8390.249−0.1020.6780.259−0.0960.6150.238−0.090
SSTSTF0.3670.087−0.0830.6220.185−0.0900.6680.259−0.0940.6620.288−0.0880.6260.255−0.085
Table 7. RMSE evaluation on the LGC (LandSat-5) dataset.
Table 7. RMSE evaluation on the LGC (LandSat-5) dataset.
target3 April 20052 March 200513 January 200529 January 2005
reference2 May 200426 November 200428 December 200428 December 2004
bandbluegreenredNIRbluegreenredNIRbluegreenredNIRbluegreenredNIR
mean702100412242094711995118523506369021005228663592510232421
CC0.5760.5350.5380.3620.5000.4630.4310.4520.7590.7810.8000.8060.7390.7440.7590.688
STARFM152.1180.5214.8331.3140.3183.8234.8436.5109.0124.2156.6343.7136.0168.1198.1441.6
Fit-FC171.1188.4203.2316.3166.4178.8219.3425.7116.6114.9143.8293.9140.4161.8192.2420.5
VIPSTF177.8207.4216.6319.2181.8207.4239.0414.7126.7134.1151.5301.8159.6189.7205.2412.8
FSDAF139.5165.3194.4319.3137.2183.3235.6412.7110.8131.4170.2320.7135.6168.2205.1422.6
SFSDAF143.8171.7200.7328.5137.9183.8235.6412.7105.0126.7163.0348.3131.8166.5199.8442.1
SPSTFM197.6248.1320.1543.4188.1263.4360.5632.8158.1168.8280.0460.9181.5208.1316.8638.0
EBSCDL151.4183.0222.1384.5144.8194.4249.4470.6123.2139.1172.8330.5144.1172.8204.0457.6
CSSF154.9188.6229.8392.6143.4194.6252.1455.1107.5154.6166.3326.0135.4172.7203.4446.6
BiaSTF144.8179.9222.2375.7148.0191.4242.9432.8101.4119.0160.8329.8125.5156.2195.6437.1
DMNet147.9187.7213.1322.8163.4206.1255.4438.2102.6126.7157.9302.4147.2183.4221.3438.2
EDCSTFN139.1151.2185.2328.8148.4150.1216.1442.1118.5140.6172.5363.2139.2187.1218.5454.7
GANSTFM122.6157.0173.8282.4110.2135.3196.8379.589.3109.2164.5337.6121.9157.3196.3426.7
MOST128.7151.7177.6297.3117.1137.5195.0366.895.6109.1177.8333.4125.5152.8191.7430.3
SSTSTF128.1155.1195.5294.6119.9143.2199.3366.897.6115.5171.3326.3134.2157.2188.5386.1
uncertainty *(%)13.010.910.69.512.210.413.110.110.89.211.39.513.112.214.011.3
* The uncertainty values are the least values across all fusion results.
Table 8. Spectral Consistency Evaluation on the LGC (LandSat-5) Dataset.
Table 8. Spectral Consistency Evaluation on the LGC (LandSat-5) Dataset.
target3 April 20052 March 200513 January 200529 January 2005
reference2 May 200426 November 200428 December 200428 December 2004
metricSAMRASEERGASnQ4SAMRASEERGASnQ4SAMRASEERGASnQ4SAMRASEERGASnQ4
STARFM0.0640.1740.1710.2900.0680.2020.1900.2370.0630.1640.1480.1340.0650.2030.1860.152
Fit-FC0.0630.1690.1690.2710.0710.1950.1820.2230.0620.1430.1330.0960.0660.1940.1790.139
VIPSTF0.0680.1760.1800.3070.0790.1990.1960.2380.0680.1500.1440.1120.0730.1980.1930.149
FSDAF0.0610.1640.1590.2670.0690.1950.1870.2170.0590.1600.1520.1170.0640.1980.1860.144
SFSDAF0.0630.1690.1640.2760.0680.1950.1870.2180.0630.1670.1520.1270.0660.2030.1860.152
SPSTFM0.0980.2720.2560.5730.1050.2960.2800.5860.0730.2330.2260.2290.0840.2940.2680.326
EBSCDL0.0720.1920.1820.3200.0690.2170.2020.2670.0630.1640.1570.1200.0680.2100.1920.158
CSSF0.0740.1970.1880.3400.0720.2120.2010.2600.0610.1640.1600.1200.0690.2060.1900.154
BiaSTF0.0710.1890.1800.3110.0690.2030.1940.2220.0610.1590.1460.1120.0680.2000.1800.141
DMNet0.0670.1720.1720.3000.0770.2090.2030.2490.0670.1500.1440.1150.0710.2080.1990.164
EDCSTFN0.0570.1630.1530.3080.0650.1970.1750.2650.0770.1760.1620.1340.0800.2130.2010.182
GANSTFM0.0550.1470.1450.2160.0590.1710.1550.1770.0600.1620.1450.1030.0670.1960.1800.141
MOST0.0590.1520.1460.2190.0590.1670.1530.1650.0600.1620.1500.1020.0670.1960.1770.138
SSTSTF0.0570.1550.1520.2040.0540.1690.1560.1590.0550.1600.1480.0950.0620.1810.1710.114
Table 9. Average nSSIM/ndEdge/ndLBP evaluation on the LGC (LandSat-5) dataset.
Table 9. Average nSSIM/ndEdge/ndLBP evaluation on the LGC (LandSat-5) dataset.
target3 April 20052 March 200513 January 200529 January 2005
reference2 May 200426 November 200428 December 200428 December 2004
metricnSSIMndEdgendLBPnSSIMndEdgendLBPnSSIMndEdgendLBPnSSIMndEdgendLBP
STARFM0.3750.350−0.1160.3530.328−0.0880.2130.293−0.0780.2250.266−0.074
Fit−FC0.3660.276−0.1240.3560.250−0.1170.1880.210−0.1080.2260.255−0.101
VIPSTF0.4850.210−0.1170.4790.171−0.1150.2600.157−0.0970.2960.169−0.094
FSDAF0.3570.356−0.1360.3410.351−0.1250.1760.306−0.1140.2020.290−0.109
SFSDAF0.3620.385−0.1370.3460.304−0.1230.1890.300−0.1160.2060.273−0.110
SPSTFM0.5580.376−0.1060.5450.332−0.1100.2510.295−0.0950.2990.263−0.091
EBSCDL0.3900.372−0.1390.3670.310−0.1340.2040.281−0.1150.2170.245−0.111
CSSF0.4270.371−0.0900.3880.311−0.1000.1940.279−0.0790.2180.245−0.076
BiaSTF0.3870.372−0.1390.3540.311−0.1340.2030.278−0.1160.2130.242−0.112
DMNet0.4180.183−0.1670.3830.109−0.1510.2520.073−0.1380.2850.044−0.134
EDCSTFN0.3650.101−0.1750.3630.125−0.1580.2400.060−0.1430.2750.015−0.138
GANSTFM0.2920.106−0.1530.2730.167−0.1390.1810.198−0.1280.2130.170−0.123
MOST0.3170.107−0.1560.2860.139−0.1440.1900.170−0.1320.2200.142−0.127
SSTSTF0.2880.232−0.1520.2750.265−0.1380.1830.246−0.1240.2080.218−0.119
Table 10. RMSE evaluation on the L8JX (LandSat-8) dataset.
Table 10. RMSE evaluation on the L8JX (LandSat-8) dataset.
targetrow 41, 3 October 2018row 42, 3 October 2018row 43, 3 October 2018
referencerow 41, 1 November 2017row 42, 1 November 2017row 43, 1 November 2017
bandbluegreenredNIRbluegreenredNIRbluegreenredNIR
mean343.5570.3486.92820.9332.0557.6449.52773.4334.8558.1437.02900.7
CC0.7250.8050.7730.8330.5750.6790.7230.8290.4390.5600.6080.826
STARFM153.2163.8231.0388.3210.1217.0254.3380.4275.0277.6306.9414.1
Fit-FC148.4155.0215.8387.5208.6214.8247.2374.5273.7275.7302.0406.6
VIPSTF144.4152.2210.9374.5207.3213.2246.3363.9271.5272.8299.8398.3
FSDAF147.9159.2221.4376.9208.5214.6249.4380.2273.6276.3306.2419.1
SFSDAF149.5162.9226.6389.7209.2216.0251.5379.3274.2277.5306.8413.1
SPSTFM158.9167.0250.1367.6208.9214.4255.6358.2274.7277.7313.7405.7
EBSCDL153.5165.7234.9389.3210.0216.9254.4393.7274.2277.8308.1430.9
CSSF150.8162.2230.5378.1208.6215.2251.4386.1273.7277.3307.6429.2
BiaSTF150.3162.4224.8385.7210.4219.1254.0394.3274.2279.2309.3442.6
DMNet134.6145.1212.1925.7203.5218.4246.0638.5269.9275.5301.7672.2
EDCSTFN153.6144.3252.3829.3203.5214.4246.4619.3269.4271.1308.6625.6
GANSTFM150.1165.7237.7394.5207.6215.2257.2343.0269.7275.9310.4377.1
MOST144.5166.0231.5402.2204.6213.1252.6347.6267.4274.1308.4381.3
SSTSTF132.6151.1202.8389.6205.2215.6247.1387.4266.2270.7300.9420.1
uncertainty *18.3%10.9%18.4%10.5%14.3%10.8%15.7%9.1%17.3%10.9%17.6%10.0%
* The uncertainty values are the lowest values across all fusion results.
Table 11. Spectral consistency evaluation on the L8JX (LandSat-8) dataset.
Table 11. Spectral consistency evaluation on the L8JX (LandSat-8) dataset.
targetrow 41, 3 October 2018row 42, 3 October 2018row 43, 3 October 2018
referencerow 41, 1 November 2017row 42, 1 November 2017row 43, 1 November 2017
metricSAMRASEERGASnQ4SAMRASEERGASnQ4SAMRASEERGASnQ4
STARFM0.0580.2150.3300.2050.0430.2320.4040.2470.0430.2600.5040.286
Fit-FC0.0520.2100.3100.2240.0400.2280.3950.2800.0420.2560.4970.330
VIPSTF0.0500.2040.3040.2170.0360.2240.3930.2550.0380.2530.4930.305
FSDAF0.0530.2080.3180.2000.0390.2300.3980.2460.0440.2610.5020.295
SFSDAF0.0540.2140.3250.2170.0390.2310.4010.2560.0430.2600.5030.300
SPSTFM0.0670.2120.3500.1830.0400.2240.4030.2180.0450.2590.5110.262
EBSCDL0.0570.2160.3350.1980.0410.2370.4050.2420.0430.2660.5060.281
CSSF0.0550.2110.3280.1900.0380.2330.4010.2370.0410.2650.5050.280
BiaSTF0.0560.2120.3230.1890.0420.2370.4060.2430.0440.2700.5080.288
DMNet0.1140.4290.3470.2540.0560.3290.4110.2540.0610.3500.5080.291
EDCSTFN0.1210.3930.3740.2590.0570.3210.4080.2540.0610.3330.5100.298
GANSTFM0.0600.2190.3380.1980.0390.2200.4050.2170.0460.2490.5050.274
MOST0.0560.2200.3320.2040.0370.2200.3990.2210.0450.2500.5020.279
SSTSTF0.0480.2070.2960.2050.0380.2330.3960.2520.0460.2590.4930.288
Table 12. Average nSSIM/ndEdge/ndLBP evaluation on the L8JX (LandSat-8) dataset.
Table 12. Average nSSIM/ndEdge/ndLBP evaluation on the L8JX (LandSat-8) dataset.
targetrow 41, 3 October 2018row 42, 3 October 2018row 43, 3 October 2018
referencerow 41, 1 November 2017row 42, 1 November 2017row 43, 1 November 2017
metricnSSIMndEdgendLBPnSSIMndEdgendLBPnSSIMndEdgendLBP
STARFM0.1960.2040.0530.1510.1740.0450.1540.1830.050
Fit−FC0.1950.0760.0020.1500.055−0.0010.1530.055−0.000
VIPSTF0.1760.035−0.0010.1310.066−0.0030.1460.035−0.002
FSDAF0.1790.132−0.0030.1380.118−0.0040.1450.148−0.008
SFSDAF0.1970.153−0.0010.1480.110−0.0030.1520.122−0.002
SPSTFM0.2040.171−0.0020.1380.141−0.0040.1730.159−0.003
EBSCDL0.2060.168−0.0000.1600.138−0.0030.1610.154−0.003
CSSF0.1810.166−0.0010.1360.135−0.0030.1540.153−0.003
BiaSTF0.1630.169−0.0010.1390.139−0.0030.1600.155−0.003
DMNet0.203−0.000−0.0040.1430.052−0.0040.1680.113−0.004
EDCSTFN0.2290.029−0.0060.1380.032−0.0040.1700.106−0.006
GANSTFM0.2000.155−0.0040.1370.130−0.0030.1620.115−0.006
MOST0.1890.168−0.0030.1350.121−0.0040.1610.117−0.005
SSTSTF0.1220.103−0.0030.1200.098−0.0030.1330.102−0.005
Table 13. RMSE evaluation on the L7STARFM (LandSat-7) dataset.
Table 13. RMSE evaluation on the L7STARFM (LandSat-7) dataset.
target11 July 200112 August 200112 August 2001
reference24 May 200124 May 200111 July 2001
bandgreenredNIRgreenredNIRgreenredNIR
mean477.5354.42160.9400.0291.52030.9400.0291.52030.9
CC0.4870.5200.8230.8260.7740.8550.5280.6110.932
STARFM181.5199.3323.961.793.8276.9170.3174.2246.6
Fit-FC186.9196.7308.853.669.8261.473.881.9198.1
VIPSTF177.8187.7299.954.270.1254.176.686.1218.8
FSDAF180.6193.6317.558.086.5266.6157.2159.4236.9
SFSDAF186.4199.7333.059.485.4273.3161.3163.7258.2
SPSTFM164.7184.5345.964.7107.1307.4157.9161.3226.0
EBSCDL173.5190.6311.362.799.4264.5161.5163.7242.2
CSSF176.3193.9311.861.898.7262.3158.7160.7234.0
uncertainty *12.5%35.4%11.0%10.8%21.1%18.8%9.9%17.8%8.8%
* The uncertainty values are the lowest values across all fusion results.
Table 14. Spectral consistency evaluation on the L7STARFM (LandSat-7) dataset.
Table 14. Spectral consistency evaluation on the L7STARFM (LandSat-7) dataset.
target11 July 200112 August 200112 August 2001
reference24 April 200124 April 200111 July 2001
metricSAMRASEERGASnQ4SAMRASEERGASnQ4SAMRASEERGASnQ4
STARFM0.0510.2440.4010.2650.0540.1900.2200.1520.0480.2210.4290.167
Fit-FC0.0480.2380.4010.2490.0460.1750.1750.1320.0370.1440.2020.078
VIPSTF0.0470.2290.3820.2290.0470.1710.1750.1190.0380.1570.2130.089
FSDAF0.0510.2390.3930.2580.0520.1820.2050.1430.0400.2070.3940.142
SFSDAF0.0520.2490.4060.2830.0530.1860.2050.1490.0440.2200.4060.166
SPSTFM0.0670.2460.3720.2950.0730.2110.2480.1970.0420.2030.3980.131
EBSCDL0.0540.2340.3840.2440.0580.1840.2290.1400.0440.2120.4050.143
CSSF0.0540.2360.3900.2500.0570.1830.2270.1400.0430.2070.3980.136
Table 15. Average nSSIM/ndEdge/ndLBP evaluation on the L7STARFM (LandSat-7) dataset.
Table 15. Average nSSIM/ndEdge/ndLBP evaluation on the L7STARFM (LandSat-7) dataset.
target11 July 200112 August 200112 August 2001
reference24 April 200124 April 200111 July 2001
metricnSSIMndEdgendLBPnSSIMndEdgendLBPnSSIMndEdgendLBP
STARFM0.3130.2180.0030.3140.248−0.0100.2070.2350.008
Fit−FC0.2910.163−0.0540.2690.120−0.0760.1510.060−0.076
VIPSTF0.2660.165−0.0410.2420.162−0.0640.2450.006−0.062
FSDAF0.2850.185−0.0580.2750.201−0.0810.1080.178−0.083
SFSDAF0.3320.187−0.0610.2790.231−0.0820.1440.216−0.083
SPSTFM0.2870.1800.0100.2880.222−0.0460.0950.198−0.028
EBSCDL0.2840.180−0.0440.2860.222−0.0710.1210.197−0.065
CSSF0.2830.180−0.0240.2780.221−0.0480.1030.196−0.047
Table 16. Evaluation on the FY4ASRcolor dataset (reference is 5:30 and target is 6:30).
Table 16. Evaluation on the FY4ASRcolor dataset (reference is 5:30 and target is 6:30).
Metric
Band
RMSESAMRASEERGASnQ4nSSIMndEdgendLBP
BlueGreen–RedVNIR
STARFM2864.63065.12695.60.1570.3500.3560.0850.3650.4320.087
Fit−FC2750.32964.62578.20.1460.3360.3420.0800.3300.364−0.027
VIPSTF4435.24572.04125.80.1840.5320.5410.1870.6330.4280.001
FSDAF2909.63127.02714.70.1480.3550.3610.0890.3650.416−0.019
SFSDAF2735.62983.12519.50.1640.3340.3410.0780.2800.411−0.024
SPSTFM4434.94568.34109.40.1920.5320.5410.1870.6350.4280.008
EBSCDL4309.04456.24004.40.1770.5180.5260.1780.6010.428−0.022
CSSF3685.43610.63282.80.1820.4290.4370.1240.4520.446−0.004
BiaSTF2753.23009.72548.00.1630.3370.3430.0780.2060.420−0.011
DMNet2566.02775.42403.10.1540.3140.3200.0690.2250.272−0.039
EDCSTFN2039.02241.01894.00.1320.2510.2550.0440.1290.375−0.043
GANSTFM2043.32246.51901.20.1310.2510.2560.0450.1320.317−0.048
MOST1982.32177.01842.80.1290.2440.2480.0410.0980.046−0.044
SSTSTF2382.62565.22230.00.1370.2910.2960.0590.2180.173−0.043
Table 17. Evaluation on the FY4ASRcolor dataset (reference is 6:30 and target is 11:30).
Table 17. Evaluation on the FY4ASRcolor dataset (reference is 6:30 and target is 11:30).
Metric
Band
RMSESAMRASEERGASnQ4nSSIMndEdgendLBP
BlueGreen–RedVNIR
STARFM1913.62161.62159.10.3720.5810.5900.0680.7810.642−0.113
Fit−FC1230.11495.91623.70.2520.4070.4170.0330.4240.237−0.032
VIPSTF9217.29451.010261.00.5562.6962.7350.8870.9590.713−0.211
FSDAF1830.82116.41864.80.3580.5420.5470.0600.6310.508−0.075
SFSDAF1866.92013.12032.50.3620.5510.5580.0630.7590.645−0.103
SPSTFM8065.07849.57527.70.4272.1832.1920.8640.9740.7030.031
EBSCDL8793.39020.69797.10.5392.5732.6100.8510.9550.713−0.215
CSSF2485.92764.12588.60.2750.7300.7380.1090.8460.653−0.012
BiaSTF1838.52017.01710.10.2850.5190.5210.0550.6740.561−0.047
DMNet682.1693.3667.20.2950.1900.1910.0070.500−0.045−0.185
EDCSTFN653.1665.3621.30.2460.1810.1820.0060.265−0.007−0.170
GANSTFM613.7591.5548.30.2360.1630.1640.0050.097−0.006−0.218
MOST973.61166.91225.60.2530.3150.3210.0190.328−0.047−0.202
SSTSTF1188.11451.81580.00.3040.3950.4050.0310.5740.116−0.212
Table 18. Change detection for spatiotemporally fused images of the CIA dataset.
Table 18. Change detection for spatiotemporally fused images of the CIA dataset.
target9 November 200125 November 2001
reference18 April 200218 April 2002
metricIOUF1-scoreprecisionrecallOAIOUF1-scoreprecisionrecallOA
STARFM0.8000.8890.8800.8980.9450.6990.8230.9000.7580.926
Fit-FC0.2620.4160.6620.3030.7920.4020.5740.6800.4960.832
VIPSTF0.2770.4340.6120.3370.7850.5150.6800.8020.5900.873
FSDAF0.4420.6130.7930.5000.8460.6320.7750.8680.7000.907
SFSDAF0.5140.6790.7960.5920.8630.5810.7350.8080.6740.889
SPSTFM0.3060.4680.7310.3450.8090.4840.6520.8870.5160.875
EBSCDL0.4840.6520.8020.5500.8570.6060.7550.8470.6810.899
CSSF0.5300.6930.8580.5810.8740.6080.7560.8670.6700.901
BiaSTF0.4290.6000.7500.5010.8370.5510.7100.8250.6240.884
DMNet0.3300.4960.7260.3770.8130.4760.6450.8670.5140.871
EDCSTFN0.3600.5300.7430.4120.8210.5290.6920.7700.6280.872
GANSTFM0.1670.2860.4030.2210.7290.3820.5520.7460.4380.838
MOST0.2710.4260.6360.3200.7890.5340.6960.7840.6260.876
SSTSTF0.2110.3480.3740.3260.7020.3990.5710.6030.5420.814
Table 19. Change detection for spatiotemporally fused images of the LGC dataset.
Table 19. Change detection for spatiotemporally fused images of the LGC dataset.
target28 December 2004
reference13 January 2005
metricIOUF1-scoreprecisionrecallOA
STARFM0.8150.8980.9380.8610.938
Fit-FC0.8430.9150.9450.8870.948
VIPSTF0.8940.9440.9660.9230.966
FSDAF0.8880.9400.9640.9180.963
SFSDAF0.8270.9050.9540.8620.943
SPSTFM0.4250.5960.8120.4710.799
EBSCDL0.8330.9090.9530.8690.945
CSSF0.8120.8960.9550.8450.938
BiaSTF0.8870.9400.9440.9360.962
DMNet0.8750.9330.9320.9350.958
EDCSTFN0.8000.8890.9610.8260.935
GANSTFM0.8680.9290.9600.9000.957
MOST0.8070.8930.9170.8710.934
SSTSTF0.8210.9020.9070.8960.938
Table 20. Performance boundaries for tested Landsat datasets.
Table 20. Performance boundaries for tested Landsat datasets.
MaximumMinimumAverageBand for
Largest Errors
Band for
Least Errors
CIA31.10%10.80%19.70%redgreen
LGC14.00%9.20%11.30%redgreen
L8JX18.40%9.10%13.70%redNIR
L7STARFM35.40%8.80%16.20%redNIR
Table 21. Relationship between best categories and training scales.
Table 21. Relationship between best categories and training scales.
Image SizeRank FirstRank SecondOffline Training Pairs
CIA1408 × 1824weighthybrid9
LGC3200 × 2720CNNhybrid8
L8JX5792 × 5488CNNhybrid0
L7STARFM1200 × 1200weightdictionary0
FY4Acolor10,992 × 4368CNNhybrid0
Table 22. Average scores of Fit-FC and FSDAF for the CIA (homogeneous) dataset.
Table 22. Average scores of Fit-FC and FSDAF for the CIA (homogeneous) dataset.
Image12345
Interval (days)32103176160144
Average CC0.8290.4180.2370.3410.365
Average RMSEFit-FC156.7187.7206.7176.6206.6
FSDAF179.8232.0231.3203.2224.6
Note: Averages are from the values of blue, green, and red bands.
Table 23. Average scores of Fit-FC and FSDAF for the LGC (heterogeneous) dataset.
Table 23. Average scores of Fit-FC and FSDAF for the LGC (heterogeneous) dataset.
Image1234
Interval (days)336961632
Average CC0.5500.4650.7800.747
Average RMSEFit-FC187.6188.2125.1164.8
FSDAF166.4185.4137.5169.6
Note: Averages are from the values of blue, green, and red bands.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wei, J.; Chen, L.; Chen, Z.; Huang, Y. An Experimental Study of the Accuracy and Change Detection Potential of Blending Time Series Remote Sensing Images with Spatiotemporal Fusion. Remote Sens. 2023, 15, 3763. https://doi.org/10.3390/rs15153763

AMA Style

Wei J, Chen L, Chen Z, Huang Y. An Experimental Study of the Accuracy and Change Detection Potential of Blending Time Series Remote Sensing Images with Spatiotemporal Fusion. Remote Sensing. 2023; 15(15):3763. https://doi.org/10.3390/rs15153763

Chicago/Turabian Style

Wei, Jingbo, Lei Chen, Zhou Chen, and Yukun Huang. 2023. "An Experimental Study of the Accuracy and Change Detection Potential of Blending Time Series Remote Sensing Images with Spatiotemporal Fusion" Remote Sensing 15, no. 15: 3763. https://doi.org/10.3390/rs15153763

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop