Comparison of Spatiotemporal Fusion Models: A Review

Simultaneously capturing spatial and temporal dynamics is always a challenge for the remote sensing community. Spatiotemporal fusion has gained wide interest in various applications for its superiority in integrating both fine spatial resolution and frequent temporal coverage. Though many advances have been made in spatiotemporal fusion model development and applications in the past decade, a unified comparison among existing fusion models is still limited. In this research, we classify the models into three categories: transformation-based, reconstruction-based, and learning-based models. The objective of this study is to (i) compare four fusion models (STARFM, ESTARFM, ISTAFM, and SPSTFM) under a one Landsat-MODIS (L-M) pair prediction mode and two L-M pair prediction mode using time-series datasets from the Coleambally irrigation area and Poyang Lake wetland; (ii) quantitatively assess prediction accuracy considering spatiotemporal comparability, landscape heterogeneity, and model parameter selection; and (iii) discuss the advantages and disadvantages of the three categories of spatiotemporal fusion models.


Introduction
The quantity of remotely sensed data acquired from satellite instruments has greatly increased, contributing to multi-source and multi-resolution image sequences at regional or global scales. However, given the tradeoff between spatial resolution and temporal revisiting cycles [1,2], there is so far no single satellite sensor that can produce images with both fine spatial and temporal resolutions. A comparison of existing sensors and their spatial and temporal resolutions is given in Table 1. For example, SPOT series and Landsat TM/ETM+ multispectral data with spatial resolutions from 6 to 30 m have proven useful in monitoring forest and ecosystem dynamics [3][4][5][6][7], land cover classifications [8,9], and land cover/use change detection [10,11]. However, long revisit frequencies (16 days for Landsat series and 26 days for SPOT), frequent cloud contamination [12], complex topographic effects, and equipment failure make it difficult to acquire sequenced remotely sensed data that target the same region, and have prevented us from applying it to rapid monitoring of disturbance and change detection [2]. In contrast, the NOAA Advanced Very High Resolution Radiometer (AVHRR), SPOT-Vegetation (SOPT-VGT), and Moderate Resolution Imaging Spectroradiometer (MODIS) provide global multispectral imagery of land surface at 1-2-day revisit frequencies [13]. However, their relatively low spatial resolutions, from 250 m to 1 km, are not sufficient for quantitative monitoring of landscape changes that occur at sub-pixel resolutions. A possible cost-effective approach is to generate synthetic data with both fine spatial resolution and temporal frequency by blending the multi-sensor spatial and temporal characteristics. It has aroused great interest within the remote sensing community [2,[14][15][16].
Image fusion can be generally divided into spatial and spectral fusion, and spatial and temporal fusion [17]. Spatiospectral fusion, or pan-sharpening, blends a lower spatial resolution multispectral image with a higher spatial resolution panchromatic image. Many spatiospectral fusion models have been developed and have matured during the past three decades. However, spatiospectral fusion methods are not efficient in enhancing the spatial resolution and temporal frequency simultaneously. Spatiotemporal fusion is a relatively new concept addressing this problem. Several spatiotemporal fusion models have been recently proposed. Based on the characteristics of the model framework and procedures for implementing the models, we classify them into three categories: (i) transformation-based; (ii) reconstruction-based; and (iii) learning-based models.
Transformation-based methods include wavelet and tasseled cap transformations [18,19]. Acerbi-Junior et al. [20] increased the spatial resolution of MODIS by integrating Landsat imagery using a three-level coupled wavelet decomposition scheme. Tasseled cap transformation [19] is widely used in detecting land cover change and phenology disturbances [6]. It has been regarded as a standard technique for spectral variation based on the three axes of brightness, greenness, and wetness. Hilker et al. [21] used a tasseled cap transformation of both Landsat TM/ETM+ and MODIS reflectance data to capture change information with a fine spatial resolution in a transformed space.
In reconstruction-based methods, synthetic fusion data are generated by a weighted sum of the spectrally similar neighboring information from fine spatial but coarse temporal resolution, and fine temporal but coarse spatial resolution data [2,15,21]. Gao et al. [2] proposed a spatial and temporal adaptive reflection fusion model (STARFM) to blend Landsat and MODIS imagery to generate daily synthetic Landsat-like data with 30 m spatial resolution. Several improved models have since been developed. Hilker et al. [21] presented a spatial and temporal adaptive algorithm for mapping reflectance change (STAARCH) to identify highly detailed spatiotemporal patterns in land cover change. STAARCH also produces synthetic Landsat-like images for each available date of MODIS data based on an extended STARFM [21]. However, the prediction accuracy of the STARFM and STARRCH models is sensitively correlated with landscape heterogeneity [15]. Based on STARFM, Zhu et al. [15] developed an enhanced spatial and temporal adaptive reflectance fusion model (ESTARFM), considering conversion coefficients, so that homogeneous and heterogeneous pixels have different conversion coefficients in the prediction [22]. Shen et al. [23] proposed an extended spatiotemporal method for reflectance blending prediction within the STARFM framework. It took sensor observation differences for varied land cover types into consideration. However, it requires a prior unsupervised classification for the fine spatial resolution data. A customized blending model was developed by Michishita et al. based on the ESTARFM. Reflectance of the moderate-resolution image pixels on the target dates can be predicted more accurately by the proposed customized model than the original ESTARFM [15]. In another branch of reconstruction-based methods, Hansen et al. [24] integrated Landsat and MODIS imagery on a 16-day repeat time to monitor forest cover change in the Congo Basin using a regression tree method. Roy et al. [25] proposed a semi-physical approach that integrates a bidirectional reflectance distribution function spectral model for Landsat gap-filling and relative normalization production. Zurita-Milla et al. [26] proposed an unmixing-based data fusion technique to reconstruct synthetic images with the spectral and temporal resolution provided by Medium Resolution Imaging Spectrometer (MERIS), but a Landsat-like spatial resolution. Zurita-Milla et al. [27] then applied a linear mixing model to a time series of MERIS images to produce synthetic fused images. Nonetheless, these proposed methods required a prior unsupervised classification for input high/medium spatial resolution remotely sensed images, or a high spatial resolution land use database as auxiliary materials for pixel unmixing.
In learning-based methods, compressive sensing and sparse representation have garnered wide interest in various fields in the last decade, especially in image processing. Yang et al. [28] presented a new super-resolution method to generate high-resolution images based on sparse representation. Huang and Song [29] developed a sparse representation based on a spatiotemporal reflectance fusion model (SPSTFM) to produce a synthetic prediction using both prior and posterior pairs of Landsat and MODIS images and one MODIS image on the prediction date. Song and Huang [30] further presented a spatiotemporal fusion model using one-pair image learning. This model was implemented in two stages: first, the spatial resolution of MODIS data on prior and posterior dates is improved through sparse representation; second, the observed Landsat and enhanced MODIS data were fused via a high-pass modulation [30].
Spatiotemporal fusion was originally designed for blending shortwave reflectance bandwidths from Landsat and MODIS data to produce daily Landsat-like surface reflectance [2,[31][32][33]. However, it appears to hold great utility and potential for interdisciplinary fields that require fine resolution data. Anderson et al. [33] used STARFM, infusing Landsat thermal infrared (TIR) with MODIS TIR, to get daily evapotranspiration mapping with the ALEXI, which demonstrated its reliable application in fine resolution evapotranspiration mapping. Watts et al. [34] used synthetic data derived from STARFM to improve the classification accuracy of conservation arable land and to produce high frequency data series compensating for degraded synthetic spectral values when classifying field-based tillage. Liu and Weng [35] applied STARFM to simulate a series of ASTER-like datasets to derive the urban variables of the normalized difference vegetation index (NDVI), normalized difference water index (NDWI), and land surface temperature (LST), and to quantitatively assess the effects of urban characteristics on West Nile Virus dissemination. Huang et al. [36], Weng et al. [31] and Wu et al. [32] also improved STARFM to accurately derive the LST. A complete summary of researches and their applications are provided in Appendix Table A1. Note: Spatial resolution: high *** (<5 m); medium ** (5-30 m); low * (>30 m); Temporal resolution: high *** (<3 days); medium ** (3-15 days); low * (>15 days).
Many advances have been made in the past decade in both algorithm development and practical application of spatiotemporal fusion. However, there have been limited studies making unified comparisons of the existing spatiotemporal fusion models. The commonly used statistical scores, such as the correlation coefficient (CC), root-mean-square error (RMSE), average absolute difference (AAD), and Quality Index [37], are affected and constrained by selection of the individual study site. If a study site has not been observed concurrently in the input Landsat-MODIS (L-M) image pairs, unresolved spatiotemporal variances will cause biases in the predictions. Emelyanova et al. [38] performed a definitive assessment of the prediction performance of STARFM and ESTARFM against spatial and temporal variances. Jarihani et al. [39] evaluated the accuracy of STARFM and ESTARFM through testing two "Index-then-Blend" and "Blend-then-Index" approaches, and provided an assessment of the order for data blending and index calculation. Landscape heterogeneity also influences the prediction performance. Although ESTARFM aims to predict surface reflectance accurately in heterogeneous regions, it has not yielded a standard criteria for landscape heterogeneity [40]. However, there is no unified comparison work that has been made. This research will compare existing spatiotemporal fusion models based on the performance of spatiotemporal comparability, landscape heterogeneity, and model parameter selection.
We compared three reconstruction-based models and one learning-based algorithm using two time-series datasets, the Coleambally irrigation area (CIA) in Australia, and the Poyang Lake wetland in China. The transformation-based approach cannot lend itself directly to spatiotemporal data fusion without combining it with another blending framework. Therefore, we did not include this category in the comparison work. The primary objectives of this study are to (i) compare the performance of the four spatiotemporal fusion models under two prediction modes; (ii) quantitatively assess the prediction accuracy based on spatiotemporal comparability, landscape heterogeneity, and model parameter selection; and (iii) summarize the advantages and weaknesses of the existing models.
The remainder of this review is organized as follows: Section 2 describes the materials and methods used in this study. The assessment is provided in Section 3. Section 4 discusses the results, and major findings are concluded in Section 5.

Study Site Description and Data Preparation
Two study sites were selected in this research ( Figure 1). The CIA was chosen as the first validation site. The CIA datasets were provided by the Commonwealth Scientific and Industrial Research Organization, Australia. The CIA is a rice irrigation field located in southern New South Wales, Australia (145°04′E, 34°00′S). The site has been extensively used for time-series remote sensing research [38][39][40][41][42]. We used 17 cloud-free L-M pairs over the CIA during the austral summer growing season in 2001-2002. They are the same time-series L-M datasets as used by Emelyanova et al. [38] and Jarihani et al. [39]. Due to the existing gaps with null values, we selected the main irrigation area of 625 km 2 (1000 rows by 1000 columns at 25 m resolution). The images were acquired by Landsat-7 ETM+ and atmospherically corrected using MODTRAN4 [43]. The CIA is located entirely in the east-west overlap of two adjacent paths in the Landsat World Reference-2 system (paths/rows 92/84 and 93/84), which allows for an 8-day repeat cycle [38,42]. The corresponding MODIS Terra MOD09GA Collection data were resampled to 25 m resolution using a nearest neighbor algorithm to match the Landsat data resolution after a geometrical transformation. Due to the dimensionality of remotely sensed data [44] and the computation cost of processing all bands, we selected the Landsat red wavelength band (B3), near infrared wavelength band (B4), and mid-infrared I wavelength band (B5). They comprised sufficiently rich information. The corresponding bands for the MODIS imagery were bands 1, 2, and 6.
Poyang Lake, the largest freshwater lake in China, was chosen as the second testing site. It has fluctuating water levels throughout the year. Between March and June, water flows into the lake from five surrounding rivers. It reaches its peak level between July and September, due to the high precipitation in the summer and backflow flood from the Yangtze River. Between October and November, the water subsides, and vast areas covered with wetland vegetation emerge. From December to February, the water level decreases significantly and several small disconnected lakes are visible. To ensure monitoring rapid and significant phenological changes in both the spatial and temporal domains, we specifically chose the southeastern part (116°37′E, 28°33′N) of the water body, because the surface reflectance of the area has been reported to vary significantly throughout the year [45]. Ten cloud-free L-M pairs were available in 2004. The Poyang Lake site covers 3600 km 2 (2000 rows by 2000 columns at 30 m resolution). All of the Landsat images were acquired by Landsat-5. The digital numbers from the Landsat level 1 product were calibrated and atmospherically corrected using fast line-of-sight atmospheric analysis of hypercubes (FLAASH) [46]. The acquired MODIS daily surface reflectance (MOD09GA) data were reprojected and resampled to the Landsat resolution and extent. As FLAASH uses a similar 6S (Second Simulation of the Satellite Signal in the Solar Spectrum) [47] atmospheric correction approach to the MODIS surface reflectance product, the two sensors' reflectance were consistent and comparable [7]. Dates of the acquired L-M pairs for the Poyang Lake and CIA sites are given in Table 2. For Poyang Lake, less cloud-cover MODIS data on the closest dates were substituted when uncontaminated MODIS images were unavailable on the targeted dates. For the CIA site, the acquired L-M pair dates were well correlated. Table 2. Dates of the acquired L-M pairs for the Poyang Lake wetland (PLW) and CIA sites. For the Poyang Lake site, less cloud-cover MODIS data on the closest dates were substituted when uncontaminated MODIS images were unavailable on the targeted dates. For the CIA site, the acquired L-M pair dates were well correlated.

STARFM
STARFM [2] needs at least one pair of fine-and coarse-resolution data on the prior or posterior date and one set of coarse-resolution data on the predicted date. It predicts the surface reflectance using a combined weight function, incorporating spectral information from both the fine-and coarse-resolution data. Its implementation is as follows.
(i). One fine-resolution image is used to select candidate similar neighbor pixels using a threshold method. The threshold is determined by the standard deviation of the fine-resolution images and the estimated number of land-cover types.
(ii). Sample filtering is applied to remove poor quality observations from the candidates, which introduces constraint functions to ensure the quality of the selected similar pixels. (iii). The weights corresponding to each similar pixel are computed with a combined function using the spectral difference, temporal difference, and distance difference. (iv). The final surface reflectance on the targeted date is predicted with the incorporation of the fine-and coarse-resolution data through the proposed weight function.

ESTARFM
ESTARFM [15] needs at least two pairs of fine-and coarse-resolution data on prior and posterior dates and one coarse-resolution data on the predicted date. It predicts the surface reflectance of the targeted date using a linear combination of spectral information from both the fine-and coarse-resolution data based on the concept of spectral unmixing, incorporating a conversion coefficient and a weight coefficient. Its implementation is as follows.
(i). Similar neighbor pixels are selected from the fine-resolution data on both the prior and posterior dates using the same threshold method as STARFM. The final set of similar pixels is determined by an intersection operation of the results derived from the individual selection in the initial step. (ii). The weights for all of the similar pixels are determined by the spectral similarity and geographic distance between the targeted pixel and similar pixels. (iii). The conversion coefficients are computed from the surface reflectance of the fine-and coarse-resolution data during the observation period using a linear regression. (iv). The two transition images on the targeted date are predicted using the combined function of the fine-and coarse-resolution data and the weight and conversion coefficients. (v). The final fine-resolution prediction is calculated by incorporating the two transition images in step (iv) through a weight function, which depends on the spectral difference of the coarse-resolution data on the base date and the predicted date.

ISTARFM
ISTARFM [48] provides two prediction modes, in which one or two pairs of base L-M images are used in the blending process. Its implementation is as follows.
(i). Adaptively choose blending modes. ISTARFM first performs a choice for prediction modes according to the number of input L-M pairs within a time-window. (ii). Similar neighboring pixels are selected from the fine-resolution data through local rules with a logistic constraint function. For one-pair prediction mode, the final similar pixels are retrieved from its individual selection; for multi-pair prediction mode, the final set of similar pixels is retrieved by an intersection operation on the results derived from the individual selection. (iii). The weights for all similar pixels are determined by four factors: fine-coarse resolution data difference, spectral similarity for fine-resolution data, selective temporal difference and spatial difference. (iv). The final fine-resolution prediction is calculated by incorporating observed fine-and coarse-resolution data through a weight function in step (iii).

SPSTFM
SPSTFM [29] also requires two pairs of fine-and coarse-resolution data. It predicts fine-resolution reflectance by establishing the correspondence of structures between the fine-and coarse-resolution images via sparse representation. Its implementation is as follows.
(i). High-frequency patches are extracted for dictionary learning. The difference images of the fine-and coarse-resolution data on the prior and posterior dates are extracted for jointly training two dictionaries of high-frequency feature patches. (ii). Dictionary-pair learning is conducted with the two input difference images using an optimization equation under the theoretical basis of sparse representation and sparse coding. The optimal solution to obtain the best dictionary sets Dl and Dm is K-SVD [49]. (iii). The fine-resolution patches are reconstructed using the enforced same sparse coefficient and the dictionary set Dl, after obtaining the sparse coefficient of the coarse-resolution patch with respect to the dictionary set Dm. (iv). The fine-resolution reflectance is predicted. Considering the heterogeneity of local changes in actual remote sensing images, the general reconstruction is converted from the image scale to the patch scale using different local weights. The NDVI and normalized difference built-up index (NDBI) are also taken into consideration to measure the change information.

Comparison Type Setting
As each algorithm needs a different number of base L-M pairs, we divided the prediction patterns into a one L-M pair prediction mode and a two L-M pair prediction mode. Under the given comparison type, we performed blending tests using STARFM (STARFM-One) and ISTARFM (ISTARFM-One) with one L-M pair. All of the models were also used with two L-M pairs (STARFM-Two, ESTARFM, ISTARFM-Two, and SPSTFM). The two prediction modes were conducted on both study sites by producing a Landsat-like image on the targeted date. Specifically, we used one prior closest L-M pair for the one L-M pair prediction, and used one prior and one posterior images that were nearest temporal neighbors to the targeted date for the two L-M pairs prediction. The corresponding actual Landsat observation was also required for validation.

Quantifying Spatiotemporal Comparability
Due to the compatibility of satellite transit and sensor bandwidth for both onboard Landsat and MODIS, they have similar orbital parameters, such as near-nadir viewing and solar geometries [2,15]. In previous blending applications, there is always an assumption that the Landsat and MODIS data acquired at a given site on the same date will be spatially and temporally comparable after radiometric calibration, geometric referencing, and atmospheric correction [2,7,15]. However, how much the spatiotemporal comparability between the input Landsat and MODIS images affects final prediction accuracy has not yet been addressed. Therefore, we calculated the correlation coefficient of each selected band between the Landsat and MODIS images to denote the spatiotemporal comparability of input L-M pairs.

Quantifying Landscape Heterogeneity
Study site heterogeneity greatly affects spatiotemporal blending results [2,15]. Characterizing the sensitivity between the landscape heterogeneity and prediction performance requires a robust quantitative metric. We used our newly proposed landscape heterogeneity index (LHI) [40], to quantify the landscape heterogeneity and time-series variances at the two study sites. The newly presented LHI considers the individual pattern of both horizontal and vertical textures of landscape, and employs two threshold strategies to detect whether the neighboring ground pixels differ from each other significantly in both horizontal and vertical directions [40].

Assessing Prediction Accuracy
The models' prediction performance was quantitatively evaluated with representative metrics and direct visual inspection. The CC was used to measure correlation between the predicted and actual reflectance. The AAD between the predicted and actual reflectance was calculated to verify the deviation between the simulations and observations. The RMSE and peak signal to noise ratio (PSNR), which are widely used in the quantitative assessment of image quality, were chosen to reflect the overall bias between the simulated and observed reflectance. The Kling-Gupta efficiency (KGE) [50] was used as a compound measure to evaluate the model performance. The KGE accounts for the correlation, variability, and bias, and incorporates these measures into a single multi-objective index. The formula is given below: where ED denotes the Euclidian distance from the ideal point, r is the CC between the predicted and observed reflectance, α and β denote a ratio of the relative deviation and mean variability of the predicted and observed reflectance, σp and σo denote the standard deviation of the predicted and observed reflectance, respectively, and μp and μo denote the mean value of the predicted and observed reflectance, respectively. Thus, the ideal KGE equals 1.
The performance for each prediction mode was compared. Selected comparisons between the prediction modes were also made to validate how the number of L-M pairs affected the prediction results.

Spatiotemporal Comparability
The spatiotemporal comparability of L-M pairs acquired on each available date for both the CIA and Poyang Lake site is shown in Figure 2.   Figure 2b shows that B4 and B5 were more comparable than B3. However, the position contrast of selected bands in Figure 3 shows that B3 has the highest level of overlapped bandwidth with respect to its corresponding L-M bands, B4 has the second highest, and B5 has the lowest. Besides the geometric difference mentioned in Section 2.4, the critical factor that impacts spatiotemporal comparability is the spectral response associated with various landscapes. Since the CIA is a rice irrigation field and B4 is most sensitive to green vegetation, the high variance of surface reflectance in B4 results in its lower comparability than the other two bands in the CIA site. Due to the vast water body that is predominant in the Poyang Lake site, the surface reflectance of water in the near-infrared and mid-infrared band is much lower than that in the visible bands. Moreover, the chlorophyll content in the aquaria environment results in more variability of reflectance in the visible bands. Figure 4 shows that the selected ETM+ scenes corresponded to significant changes in the landscape heterogeneity for both sites. For example, significant seasonal phenology changes at the CIA site occurred from 9 November 2001 to 13 February 2002, but the overall landscape distribution was stable. From 13 February to 27 April 2002, we can clearly see that the land cover changed and became more heterogeneous, being consistent with the computed LHI changes. The changes in the water area at the Poyang Lake site dominated the landscape heterogeneity variance.  in (a,b), for the CIA site (c) and the Poyang Lake site (d). the CIA and Poyang Lake sites, respectively. The NDVI difference images between the prior and predicted dates, and between the predicted and posterior dates, at the Landsat resolution were also shown in Figures 5 and 6. A visual comparison of each prediction and its corresponding observed Landsat data could be made. All of the predictions (Figures 5a-f and 6a-f) using the selected blending models captured the general changing information during the prediction period seen in the actual observations (Figures 5g and 6g). This demonstrates the possibility and utility of these spatiotemporal blending applications. (e,f) Blended images under the one L-M pair prediction mode using STARFM-One and ISTARFM-One; (g) Observed ETM+ image; The NDVI difference images (h) between the prior (T1) and predicted (T2) date, and (i) between the predicted (T2) and posterior (T3) dates, respectively, in which darker regions represent smaller changes and lighter regions denote larger changes. (j)The NDVI difference contrast of (h) and (i).     The scale factor of reflectance is 10,000, which was also used for the quantitative assessment.

Accuracy Assessment
Many previous blending studies have performed accuracy assessments using tables and scatter plots of actual observations versus predicted Landsat-like data, such as those in Figures 7 and 8. We made time-series predictions for the three selected bands at two sites under two prediction modes. This produces 426 possible sets of assessment: 276 for the two L-M pair prediction mode (180 for the CIA site and 96 for the Poyang Lake site) and 150 for the one L-M pair prediction mode (96 for the CIA site and 54 for the Poyang Lake site). We employed a "curve" visualization to compare the prediction accuracy of selected bands of different fusion models on time-series date sequences. The spatiotemporal fusion models under the two L-M pair prediction mode were depicted using solid lines with different colors, while the models under the one L-M pair prediction mode were denoted using dashed lines. Each assessment measure shown in separated rows, i.e., CC, AAD, RMSE, and PSNR, was displayed up to bottom, respectively. The larger CC and PSNR denote higher correlations between the observed and predicted reflectance. The smaller AAD and RMSE denote minor bias. Therefore, the four measures can be considered at the same time. Figure 9 depicts time-series changes of selected measures with respect to the four blending models for the CIA site. The CC curve chart (Figure 9a-c) shows the relative stability and superiority of ESTARFM and ISTARFM-Two over the other models. We can draw the same conclusion from the AAD curve chart (Figure 9d-f); however, larger biases can be seen for SPSTFM and ISTARFM-One, especially for the short-wave near-infrared band (B5). Correspondingly, Figure 10 depicts time-series changes of the selected measures with respect to the four blending models for the Poyang Lake site. The four types of curve charts ( Figure 10) show that ISTARFM-Two outperforms the other models, and ESTARFM ranks at the second place. However, larger biases can be seen for STARFM-One during the period from date #3 to #5 from the curve chart ( Figure 10). SPSTFM also produces large biases during the period from date #3 to #4. Figure 11a-c (the left column) depict the time-series variability of the KGE for B3, B4, and B5 at the CIA site, respectively, and Figure 11d-f (the right column) represent the KGE variability of the same bands at the Poyang Lake site, respectively. It has shown that one model may yield different accuracy for individual band. The ISTARFM-Two and ESTARFM perform more stably for blending.

Selected Blending Models Performance
The objective of this research was to evaluate spatiotemporal blending models under two prediction modes using L-M data at the CIA and Poyang Lake sites. The selected blending models had minimum input requirements and no ancillary data were used, such as land-cover classification result or phenology timetables.
More input L-M pairs did not always ensure higher prediction accuracy because the phenology predicted for the Poyang Lake site using only one prior L-M pair was closer to the actual observations than that using two L-M pairs, based on both visual and quantitative comparisons. We took the blending model using two L-M pairs as an example, assuming that Lpm and Lpn represent the transient Landsat-like data at the date tp, predicted from the input L-M pair at tm and tn, respectively. The synthetic prediction of the Landsat-like data at tp can be obtained by the compound weighting function: Consequently, the range of Lp should satisfy min , max , where and denote the weights of the transient predictions and , respectively min • and max • denote the minimizing and maximizing operations. However, when one transient prediction shows a large bias due to a large spectral contrast between the base and predicted dates, the final prediction deviation will be large, according to Equation (5). Therefore, we could use the idea proposed in [48] that L-M pair pre-selection based on the CC of coarse-resolution data between the base and predicted dates should be performed when more than two L-M pairs exist.
For this purpose, we tested STARFM-One and ESTARFM on the CIA site, with base dates preselected using fewer MODIS images for the predicted date.  Table 3 shows that the CC between the base date and predicted date image with the order from small to large is #1, #2, and #3 for the STARFM-One model test. For the ESTARFM model test, the corresponding order is #7, #8, #6, and #5. With a comparison between Table 3 and the quantitative assessment results (Tables 4 and 5), it reveals that the model performance is consistent with the CC value of MODIS images between the base and predicted dates. Taking STARFM-One with B3 for instance, the CC value increases from 0.42 to 0.58, and reaches 0.79, while the KGE criteria between the predicted and observed reflectance correspondingly increases from 0.50 to 0.65, and reaches 0.86. The same conclusion can be verified from another measure (i.e., CC). For the ESTARFM model test, with the CC decreased from 0.76 to 0.39 with B3, 0.75 to 0.10 with B4, and 0.89 to 0.73 with B5, its model performance saw a slight decrease rather than producing large bias.  Consequently, the CC value of coarse-resolution data between base and predicted dates should be an important reference for selecting input L-M pairs when more than two L-M pairs exist, especially for the STARFM model.

Model Parameter Selection
Spatiotemporal blending results are sensitive to prerequisites from both input data and parameter setting. Two major types of parameter differences were considered in our study: (i) the spatiotemporal comparability of the input L-M pairs, the landscape heterogeneity and the spatiotemporal variances of study sites; and (ii) preset parameters, such as the moving window size for reconstruction-based methods and the dictionary/patch size for learning-based methods.
The spatiotemporal comparability of the L-M pair on the same date may not be the critical factor impacting prediction accuracy. Figure 2b (Poyang Lake) shows overall higher spatiotemporal comparability than Figure 2a (CIA). However, the KGE criteria in Figure 11 show that overall prediction accuracy at the Poyang Lake site (the right column) is not obviously higher than that at the CIA site (the left column), even though the KGE value with B3 at the Poyang Lake site (Figure 11b) is lower than that at the CIA site. On the other hand, for the same study site, spatiotemporal comparability should also be regarded as an optional reference. As mentioned in Section 3.1, B4 and B5 are more comparable than B3 at the Poyang Lake site from Figure 2b. It is interesting to note that the KGE result with respect to the Poyang Lake site shows that B3 produces larger bias than the other two bands. Based on this aspect, spatiotemporal comparability seems to have the utility to account for prediction accuracy difference regarding band spectrums.
Landscape heterogeneity can also affect the predicted Landsat-like image produced. Figure 3c shows a significant increase in the LHI for the CIA site from 10 March 2002 (#11), which corresponded to a decrease in the prediction accuracy of all of the blending models in Figure 9a-c. However, by comparing the LHI and KGE measures, we could conclude that ESTARFM was superior to the other blending models, due to the use of the conversion coefficient. Quantitatively, ESTARFM produced acceptable predictions when the LHI was below 0.65.
Spatial and temporal variances for landscape are strongly associated with performance of the spatiotemporal fusion model [38,39]. Similar to the procedure performed by Emelyanova et al. [38], we partitioned the overall variance of the L-M datasets for the CIA and Poyang Lake sites into individual spatial and temporal variances using the approach proposed by Sun et al. [51], aiming to analyze all models' performance based on the spatial and temporal variances.
The curves in Figure 12 show that the spatial and temporal variances at both the Landsat and MODIS resolutions have consistent changing trends. Greater spatiotemporal variance at the Landsat resolution than at the MODIS resolution was witnessed. Obvious differences between the bands can be seen in Figure 12, mainly due to the different spectral response associated with different landscape. The spatial variance of B5 at the CIA site was largest throughout the study period, due to the sharp spectral contrast between the irrigated fields and the fallow fields, dry land pastures, and woodlands [38]. Between acquisition dates 2 and 6, the spatial variance of B5 was relatively high, whereas the spatial variance of B3 decreased. All three bands had low temporal variance at the CIA site, except for an obvious fluctuation between acquisition dates 3 and 7. The spatial variance of B4 at the Poyang Lake site was largest due to the spectral response of the water areas and the subsequent vegetation growth. The fluctuating water levels throughout the year resulted in a large spatial rather than temporal variance. Particularly between acquisition dates 2 and 5, the spatial variance of B4 and B5 stayed high, whereas the temporal variance stayed low.
The partitioned spatial variance was larger than the temporal variance at both the CIA and Poyang sites, especially for the Poyang Lake site that the spatial variance was highly dominant due to its fluctuating water level throughout the year. The quantitative measures showed that ESTARFM produced smaller errors than STARFM-Two at most dates for both sites. It supported the conclusion reached by Emelyanova et al. [38] that ESTARM was superior when spatial variance was dominant. As ISTARFM was developed based on the STAFRM-like framework, it worked better when temporal variances was dominant. However, ISTARFM could perform better than STARFM in predicting situation where significant spatial variance occurred, for its combination with a time-window and pre-selection of input L-M pairs [48]. SPSTFM does not seem to be sensitive to land cover spatiotemporal variance, since its prediction framework was based on dictionary learning. From the KGE assessment results in Figure 11, ESTARFM and ISTARFM-Two produced relatively more stable blending accuracies than the other models. One model could also produce different blending accuracies for each band, due to the different spectral responses to the ground surface.
A comparison of selected blending models should be performed under a unified framework, especially when setting autologous model parameters such as the moving window size of reconstruction-based methods. The larger the moving window size, the more spectral and texture information from neighboring pixels will be introduced into the estimated reflectance of the central pixel. The computation will also increase exponentially. We selected three L-M pairs (#7, #8, and #9 in Table 2) from the CIA site as a case study to validate these parameters. The effect of the moving window size on predictions was analyzed with respect to ESTARFM. The size was sampled from 5 × 5, 15 × 15, …, 55 × 55. The dictionary and patch size are also trade-off factors for learning-based methods. The patch size directly affects the spectral and texture information contained in each sampled patch. We tested the patch size using seven patch sizes (2 × 2, 3 × 3, …, 8 × 8), holding the dictionary size at 512. We then analyzed how the dictionary size affected the blending accuracy by testing five dictionary sizes (64, 128, 256, 512, and 1024), holding the patch size at 4 × 4. We evaluated the KGE and computation cost in both tests, and a Windows PC with 3.40-GHz Intel Core 5 CPU and 8 GB RAM was employed as processor in this study.   Figure 13a shows that ESTARFM produced a better KGE as the window size increased with an interval step of 10. The KGE then decreased after reaching its maximum. Figure 13b shows that the performance of SPSTFM with B3 and B4 was kept stable as the patch size increased while the SPSTFM performance with B5 improved as patch size increased. The model performance first improved as the dictionary size increased, then fluctuated. The increase in the parameter sizes led to an increase in the computation time, especially for the moving window size and dictionary size (Table 6). However, the larger computation costs with larger window size and dictionary size did not ensure a continuous increase in the prediction accuracy. The trade-off between accuracy and computation and optimal parameter setting for blending models are two key areas for any spatiotemporal fusion procedure.

Problems with Existing Blending Models
Transformation-based methods mainly focus on the integration of spatial and spectral information for image enhancement. However, they do not construct a distinct blending relationship between spatial and temporal information. Acerbi-Junior et al. [20] attempted to enhance the spatial resolution of MODIS with Landsat data using a wavelet transformation. They performed the spatiotemporal information enhancement of the L-M data on the base date and could not predict the synthetic Landsat-like data on a targeted date. However, when considering multi-information fusion including spectral details, the transformation-based method is recommended, either alone or combined with other blending frameworks.
Reconstruction-based methods have gained notice since the proposal of STARFM. The spatial and temporal adaptive fusion framework provides us with an excellent fusion approach for blending data that have a high spatial resolution but low temporal resolution with data that have a high temporal resolution but low spatial resolution. It has proven useful in dynamic monitoring and phenology disturbance detection over short periods or not significantly changeable landscapes. The biggest obstacle for the development of reconstruction-based methods is how to deal with the assumption that "the land cover type and sensor calibration do not change between the prior and predicted date" [2,15,48].
The learning-based method is a recent development. Since the rise of the concept of compressed sensing, sparse representation technology has been widely used in image processing, such as image compression, image restoration, and super-resolution image construction. Although the learning-based method is a good approach for implementing the production and servitization of data fusion, it has limitations. The selection of learning samples and the design of over-complete dictionaries needs more research attention to develop methods for improving the capturing of structural and textural information while preserving details. Further, learning-based methods can handle both temporal reflectance changes, phenology changes (e.g., seasonal changes in vegetation), and type changes (e.g., the conversion of farmland to built-up areas), but which prediction types are best suited has not yet been tested.

Conclusions
We compared four spatiotemporal blending models, ESTARFM, STARFM, ISTARFM, and SPSTFM, in two prediction modes using L-M datasets at the CIA and Poyang Lake sites. Four commonly used measures, CC, AAD, RMSE, and PNSR, and a compound assessment measure, KGE, were used to evaluate the models' performance. The results showed that the four selected models       Note: SPFMOL* denotes the spatiotemporal fusion model through one image pair learning in [30]; R 2 denotes R-Square; AD denotes absolute difference; AAD denotes average absolute difference; CC denotes correlation coefficient; STD denotes standard deviation; RMSE denotes root-mean-square-error; VOE denotes the variance of errors; ERGAS denotes erreur relative global adimensionnelle de synthèse; SSIM denotes structural similarity.