An Enhanced Linear Spatio-Temporal Fusion Method for Blending Landsat and MODIS Data to Synthesize Landsat-Like Imagery

Landsat and MODIS data have been widely utilized in many remote sensing applications, however, the trade-off between the spatial resolution and temporal frequency has limited their capacities in monitoring detailed spatio-temporal dynamics. Spatio-temporal fusion methods based on a linear model that considers the differences between fineand coarse-spatial-resolution images as linear can effectively solve this trade-off problem, yet the existing linear fusion methods either regard the coefficients of the linear model as constants or have adopted regression methods to calculate the coefficients, both of which may introduce some errors in the fusion process. In this paper, we proposed an enhanced linear spatio-temporal fusion method (ELSTFM) to improve the data fusion accuracy. In the ELSTFM, it is not necessary to calculate the slope of the linear model, and the intercept, which can be deemed as the residual caused by systematic biases, is calculated based on spectral unmixing theory. Additionally, spectrally similar pixels in a given fine-spatial-resolution pixel’s neighborhood and their corresponding weights were used in the proposed method to mitigate block effects. Landsat-7/ETM+ and 8-day composite MODIS reflectance data covering two study sites with heterogeneous and homogenous landscapes were selected to validate the proposed method. Compared to three other typical spatio-temporal fusion methods visually and quantitatively, the predicted images obtained from ELSTFM could acquire better results for the two selected study sites. Furthermore, the resampling methods used to resample MODIS to the same spatial resolution of Landsat could slightly, but did not significantly influence the fusion accuracy, and the distributions of slopes of different bands for the two study sites could all be deemed as normal distributions with a mean value close to 1. The performance of ELSTFM depends on the accuracy of residual calculation at fine-resolution and large landscape changes may influence the fusion accuracy.


Introduction
Landsat data at 30 m resolution have been widely applied for ecological system variation studies [1], environmental monitoring [2], and detailed land use and land cover mapping [3,4] because of their rich archives and free availability (more than four decades from the launch of Landsat-1 in 1972 to Landsat-8 with the OLI/TIRS imager launched on 11 February 2013).However, their 16-day revisit-cycles, even longer because of frequent cloud contamination and other inferior atmospheric conditions, limit their applications in dynamics monitoring.Leckie [5] found that for a given temporal coverage, the probability of obtaining Landsat images with less than 10% cloud cover could be as low the coefficients of linear model; (2) spatial filtering was used to deal with the blocky artifacts problem created in the first step; and (3) residual compensation was used to distribute the residuals generated in the first step.
Alternatively, the unmixing-based model is also a significant component in spatio-temporal fusion.Zurita-Milla et al. [24] used the unmixing-based method to blend TM and medium resolution imaging spectrometer (MERIS) data.Generally, the land-use databases on the study areas are necessary for the unmixing-based model, however, the land-use databases are not always available or have not been updated for specific regions, hence, some unmixing-based methods use unsupervised classification methods to generate classification maps.Based on land-use databases or classification maps, the abundances of all classes in a given coarse-spatial-resolution pixel and its neighboring pixels can be obtained.By using the least square method, the spectral values of endmembers (i.e., fine-spatial-resolution pixels) located in the central coarse-spatial-resolution pixels can be calculated.Some improved unmixing-based methods have been proposed to enhance the fusion accuracy.Instead of unmixing the coarse-spatial-resolution pixels, Gevaert and Garcia-Haro [25] estimated the changes of endmembers by directly unmixing the changes of coarse-spatial-resolution pixels and Bayesian theory was used to constrain the estimation to avoid the outliers.Xie et al. [26] first used unmixing-based methods to downscale the two coarse-spatial-resolution images at the reference and prediction dates, and then based on the downscaled coarse-spatial-resolution images, the STARFM was employed to synthesize the fine-spatial-resolution image.Wu et al. [27] proposed the spatial temporal data fusion approach (STDFA), which regarded the temporal variation properties of pixels belonging to the identical class as unchanged.Then, Wu et al. [28] proposed an improved version of STDFA, which took the spatial-variability adjustment and sensor-difference adjustment into account.Zhu et al. [29] first used the "purest" coarse-spatial-resolution pixels to calculate the temporal changes of endmembers by using the unmixing method.Then, the errors from homogeneous and heterogeneous regions were considered separately and through the homogeneity index, these two types of errors were combined to modify the temporal changes obtained in the first step.Finally, the neighboring spectrally similar pixels were used to mitigate block effects.We refer to this method as FSDAF in this paper.By introducing the temporal changes modification, FSDAF could better predict the areas where land cover type changes between the reference and prediction dates.The unmixing-based methods have also been used in many applications such as NDVI series generating [30,31] and crop monitoring [32].Additionally, some learning-based methods based on the mechanism of machine learning have nowadays been proposed including mainly sparse representation [33,34], extreme learning machine [35], and deep learning [36].
All of the above spatio-temporal fusion approaches can obtain accurate synthetic data, but some errors still inevitably exist in the fused process.Generally speaking, for the homogeneous fine-and coarse-spatial-resolution pixels at the same location, the relationship between them can be described using the linear model as: where L and M refer to the fine-and coarse-spatial-resolution pixels (in this paper, they refer to Landsat and MODIS data, respectively) and a and b are two coefficients of the linear model.How to accurately predict these two coefficients is essential for the fusion methods based on the linear model.For STARFM and ESTARFM, coefficient a is considered to be equal to either 1 or a constant, however, the coefficients a and b may vary at different locations.Hence, the assumption of STARFM and ESTARFM may introduce some errors in fusion.On the other hand, similar to the ideas of some linear models [12,23], the coefficients a and b can be calculated by using the least square method.Generally, the pixels obtained from coarse-spatial-resolution images at the reference and prediction dates in a moving window are usually used to get the coefficients of the central pixel.Even though the regression model can accurately predict the coefficients in some regions, some fitting errors inevitably exist in the data fitting process.Especially, when the differences of pixels used to calculate the coefficients in the moving window are relatively large, caused probably by the temporal changes of landscapes, the fitting coefficients may result in great errors between the fitting values and the actual values.Theoretically, the more coarse-spatial-resolution pixels are considered, the lower the fitting errors that will be generated.Hence, we proposed an enhanced linear spatio-temporal fusion model (ELSTFM).The ELSTFM is based on the linear model, but it is not necessary to calculate coefficient a and coefficient b is acquired based on the unmixing-based theory (discussed in Section 2).There are two main advances of the ELSTFM: (1) the two coefficients a and b are not invariable and can be determined locally; and (2) the fitting process is not necessary, so the fitting errors can be partly depressed, meanwhile, the subjective window size setting in the fitting process also becomes an unnecessary step.Therefore, the ELSTFM can enhance the accuracy of spatio-temporal fusion for different landscapes.In the remainder of the paper, the ELSTFM will be introduced in Section 2; then the experiments and results will be described in Section 3; the details of ELSTFM will be discussed in Section 4; finally, the paper will be concluded in the last section.

Theoretical Basis of the ELSTFM
The ELSTFM requires one fine-spatial-resolution image at the reference date and two coarse-spatial-resolution images at the reference and prediction dates.Additionally, assume that the two coarse-spatial-resolution images have been resampled and clipped to the same spatial resolution and extent of the fine-spatial-resolution image.The difference of reflectance between a given fine-spatial-resolution pixel obtained from homogeneous landscapes and its corresponding resampled coarse-spatial-resolution pixel only depends on the systematic biases [11].Hence, the relationship between the fine-and coarse-spatial-resolution reflectance can be described as: where L, M denote the fine-and coarse-spatial-resolution reflectance; a is the slope of the linear model; b can be regarded as the residual caused by the systematic biases; m and n are the coordinates of a given pixel for both coarse-and fine-spatial-resolution images; t 0 denotes the reference date; and B is a given band.The coefficients a and b may be not unique at different locations and need to be calculated locally due to the differences of atmospheric condition, solar angle, and altitude [11].These two coefficients can be regarded as invariable in a short period, hence, for the prediction date, the linear model can be expressed as: where t k is the prediction date.Based on Equation ( 2), the coefficient a can be computed as: Thus, substitute coefficient a in Equation (3) with Equation ( 4) and the final prediction can be calculated as: According to Equation ( 5), the reflectance of a given fine-spatial-resolution pixel at the prediction date is the sum of the reflectance of fine-spatial-resolution pixel at the reference date and the product of that reflectance and a coefficient that only depends on the resampled coarse-spatial-resolution images at the reference and prediction dates as well as the residual of the corresponding pixel.
In reality, the coarse-spatial-resolution pixels are often not homogeneous in most regions, i.e., there may be more than one land-cover type that exists inside a given coarse-spatial-resolution pixel.To obtain a more accurate prediction, like most linear model methods such as STARFM, ESTARFM, Fit_FC, the neighboring spectrally similar pixels are introduced in the ELSTFM.Thus, the final prediction of a fine-spatial-resolution pixel can be calculated as: where ω denotes the searching window size, and was set to 1500 m; and W i,j refers to the weight of the similar pixel (i, j).

The Similar Pixels Determination and Weights Computation
The similar pixels determined by spectral differences in the neighborhood can offer additional information for central pixel prediction.The spectral difference between a given pixel (m, n) and its neighboring pixel at (x, y) is calculated as: where num_band is the total number of bands.Similar to previous studies [12,29], in the local window with ω by ω fine pixels, the first k pixels with the smallest D were identified as spectrally similar neighbors.In this paper, the value of k was set to 30 and the value of k for the other methods that were used to compare with the proposed method, such as FSDAF, was also set to 30.The contributions of similar pixels for the central pixel prediction are weighted by the spatial distances.Therefore, the spatial distances between the similar pixels and the central pixel were calculated first, then the normalized distances were used as the weights and assigned to the corresponding similar pixels: where d i,j refers to the spatial distance between the central pixel (m, n) and the similar pixel (i, j), and the constants 1 and ω/2 are used to ensure that the distances are constrained from 1 to 1 + √ 2; ns is the number of similar pixels in the neighborhood; and S is the set of similar pixels in the neighborhood.

The Residuals Computation
The residuals are essential for the ELSTFM and can influence the fusion accuracy.It is difficult to directly obtain residuals based on the linear model as the coefficient a is unknown.In this study, we used the unmixing-based theory to get the residuals at fine-resolution.
Based on the unmixing theory, for a given original coarse-spatial-resolution pixel (i.e., coarse-spatial-resolution pixel without resample), its value is equal to the sum of the values of all fine-spatial-resolution pixels inside it and the corresponding residual of that coarse-spatial-resolution pixel [13].Therefore, the relationship can be written as: L(p, q, t 0 , B) + ξ (10) where Mo(M, N, t 0 , B) is the origin coarse-spatial-resolution pixel at (M, N); nf and L(p, q, t 0 , B) are the number of fine-spatial-resolution pixels and the reflectance of fine-spatial-resolution pixel at (p, q) inside the pixel (M, N), respectively; and ξ is the corresponding residual of pixel (M, N).First, the residual of a given coarse-spatial-resolution pixel can be considered unchanged in a short period [29].Hence, the residuals obtained at the reference and prediction dates can be deemed as equal.Additionally, the residuals that can be deemed as the system difference between two sensors only results from the differences in solar geometry and bandwidth [13], hence, we can assume that there is no significant change in the system difference within one coarse-spatial-resolution pixel.Therefore, the residuals of fine-spatial-resolution pixels inside a given coarse-spatial-resolution pixel should be identical.The coefficient b can be calculated as: where b (p, q, t 0 , B) and b (p, q, t k , B) denote the residuals of the fine-spatial-resolution pixel (p, q) located in the corresponding coarse-spatial-resolution pixel (M, N) at the reference and prediction dates.
It is notable that the residuals of fine-spatial-resolution pixels are only identical in the corresponding coarse-spatial-resolution pixels.

Implementation Process of ELSTFM
In practice, two coarse-spatial-resolution images obtained at the reference date (t 0 ) and prediction date (t k ) as well as one fine-spatial-resolution image acquired at t 0 are necessary for fine-spatial-resolution image prediction.ELSTFM implementation mainly contains three steps: (1) the fine-spatial-resolution image at t 0 is first resampled to the same spatial resolution of coarse-spatial-resolution image by utilizing the aggregate resampling approach, then the residuals of fine-spatial-resolution pixels can be computed based on Equations ( 10) and ( 11); (2) for a given fine-spatial-resolution pixel, the neighboring similar pixels and their corresponding weights are calculated, and the coarse-spatial-resolution images at t 0 and t k are resampled to the same spatial resolution of the fine-spatial-resolution image; and (3) the ELSTFM is utilized to synthesize the final fine-spatial-resolution image at t k based on Equation (6).The flowchart of the ELSTFM is shown in Figure 1.
Remote Sens. 2018, 10, x FOR PEER REVIEW 6 of 18 where Mo(M, N, t0, B) is the origin coarse-spatial-resolution pixel at (M, N); nf and L(p, q, t0, B) are the number of fine-spatial-resolution pixels and the reflectance of fine-spatial-resolution pixel at (p, q) inside the pixel (M, N), respectively; and ξ is the corresponding residual of pixel (M, N).First, the residual of a given coarse-spatial-resolution pixel can be considered unchanged in a short period [29].Hence, the residuals obtained at the reference and prediction dates can be deemed as equal.Additionally, the residuals that can be deemed as the system difference between two sensors only results from the differences in solar geometry and bandwidth [13], hence, we can assume that there is no significant change in the system difference within one coarse-spatial-resolution pixel.Therefore, the residuals of fine-spatial-resolution pixels inside a given coarse-spatial-resolution pixel should be identical.The coefficient b can be calculated as: where b (p, q, t0, B) and b (p, q, tk, B) denote the residuals of the fine-spatial-resolution pixel (p, q) located in the corresponding coarse-spatial-resolution pixel (M, N) at the reference and prediction dates.It is notable that the residuals of fine-spatial-resolution pixels are only identical in the corresponding coarse-spatial-resolution pixels.

Implementation Process of ELSTFM
In practice, two coarse-spatial-resolution images obtained at the reference date (t0) and prediction date (tk) as well as one fine-spatial-resolution image acquired at t0 are necessary for finespatial-resolution image prediction.ELSTFM implementation mainly contains three steps: (1) the fine-spatial-resolution image at t0 is first resampled to the same spatial resolution of coarse-spatialresolution image by utilizing the aggregate resampling approach, then the residuals of fine-spatialresolution pixels can be computed based on Equations ( 10) and ( 11); (2) for a given fine-spatialresolution pixel, the neighboring similar pixels and their corresponding weights are calculated, and the coarse-spatial-resolution images at t0 and tk are resampled to the same spatial resolution of the fine-spatial-resolution image; and (3) the ELSTFM is utilized to synthesize the final fine-spatialresolution image at tk based on Equation (6).The flowchart of the ELSTFM is shown in Figure 1.

Experimental Data and Pre-Processing
To better compare with the existing fusion methods, similar to previous studies, two typical study sites of Landsat-7/ETM+ and 8-day composite 500-m resolution MODIS surface reflectance products (MOD09A1) data which have been used to validate STARFM, ESTARFM, and FSDAF algorithms [29,37] were selected.Landsat-7/ETM+ and MODIS sensors can acquire images of a roughly 185 km × 185 km area and 2330 km (cross track) × 10 km (along track at nadir) area, respectively.The 8-day composite MODIS data instead of daily MODIS data were used because of their relatively high data quality.These two kinds of data can be downloaded from the United States Geological Survey (USGS) EarthExplorer (http://glovis.usgs.gov/)and the National Aeronautics and Space Administration (NASA) (https://modis.gsfc.nasa.gov/),respectively.
The NIR-red-green as the RGB composite of Landsat-7/ETM+ images and their corresponding MODIS images of the first site with heterogeneous landscape is shown in Figure 2. We used two cloud-free Landsat-7/ETM+ images (Path/Row 93/84) acquired on 11 January (Figure 2a, used as input fine-spatial-resolution image) and 12 February 2002 (Figure 2b, used to validate the experimental accuracy) with six bands (the 1st-5th bands and the 7th band) covering an area of approximately 24 × 15 km 2 (800 × 496 fine-spatial-resolution pixels) in southern New South Wales, Australia (145 • E, 34 • S).This heterogonous site can be deemed as a benchmark for comparing or testing spatio-temporal fusion methods and has been widely used in spatio-temporal fusion studies.The major land cover types in this area are irrigated rice cropland, dryland agriculture, and woodlands.Meanwhile, two MOD09A1 images acquired on 9 January (Figure 2c, used as the coarse-spatial-resolution image at the reference date) and 10 February 2002 (Figure 2d, used as the coarse-spatial-resolution image at the prediction date) with six bands (the 1st-4th bands and 6th and 7th bands) covering the same area (h29v12) were selected.
Remote Sens. 2018, 10, x FOR PEER REVIEW 7 of 18 To better compare with the existing fusion methods, similar to previous studies, two typical study sites of Landsat-7/ETM+ and 8-day composite 500-m resolution MODIS surface reflectance products (MOD09A1) data which have been used to validate STARFM, ESTARFM, and FSDAF algorithms [29,37] were selected.Landsat-7/ETM+ and MODIS sensors can acquire images of a roughly 185 km × 185 km area and 2330 km (cross track) × 10 km (along track at nadir) area, respectively.The 8-day composite MODIS data instead of daily MODIS data were used because of their relatively high data quality.These two kinds of data can be downloaded from the United States Geological Survey (USGS) EarthExplorer (http://glovis.usgs.gov/)and the National Aeronautics and Space Administration (NASA) (https://modis.gsfc.nasa.gov/),respectively.
The NIR-red-green as the RGB composite of Landsat-7/ETM+ images and their corresponding MODIS images of the first site with heterogeneous landscape is shown in Figure 2. We used two cloud-free Landsat-7/ETM+ images (Path/Row 93/84) acquired on 11 January (Figure 2a, used as input fine-spatial-resolution image) and 12 February 2002 (Figure 2b, used to validate the experimental accuracy) with six bands (the 1st-5th bands and the 7th band) covering an area of approximately 24 × 15 km 2 (800 × 496 fine-spatial-resolution pixels) in southern New South Wales, Australia (145°E, 34°S).This heterogonous site can be deemed as a benchmark for comparing or testing spatio-temporal fusion methods and has been widely used in spatio-temporal fusion studies.The major land cover types in this area are irrigated rice cropland, dryland agriculture, and woodlands.Meanwhile, two MOD09A1 images acquired on 9 January (Figure 2c, used as the coarse-spatial-resolution image at the reference date) and 10 February 2002 (Figure 2d, used as the coarse-spatial-resolution image at the prediction date) with six bands (the 1st-4th bands and 6th and 7th bands) covering the same area (h29v12) were selected.The second site was located in Canada (54 • N, 104 • W), covering an area of approximate 11.5 × 17.7 km 2 (384 × 592 fine-spatial-resolution pixels), where the growing season is short and phenology changes are extreme [13].The major land-cover in this region is forest, with subsidiary fen and patches of sparse vegetation (soil, rock, burn scars).This site has been used to validate the STARFM method and can be regarded as a homogenous area.The NIR-red-green as the RGB composite of the Landsat-7/ETM+ images and their corresponding MODIS images are shown in Figure 3. Two Landsat-7/ETM+ images (Path/Row 36/22) acquired on 17 May (Figure 3a, used as input fine-spatial-resolution image) and 2 June 2001 (Figure 3b, used to validate the experimental accuracy) with six bands (the 1st-5th bands and the 7th band) utilized.Additionally, the two input MOD09A1 images (h11v03) with six bands (the 1st-4th bands and the 6th and 7th bands) at the reference and prediction dates were acquired on 17 May (Figure 3c) and 2 June 2001 (Figure 3d), respectively.The second site was located in Canada (54°N, 104°W), covering an area of approximate 11.5 × 17.7 km 2 (384 × 592 fine-spatial-resolution pixels), where the growing season is short and phenology changes are extreme [13].The major land-cover in this region is forest, with subsidiary fen and patches of sparse vegetation (soil, rock, burn scars).This site has been used to validate the STARFM method and can be regarded as a homogenous area.The NIR-red-green as the RGB composite of the Landsat-7/ETM+ images and their corresponding MODIS images are shown in Figure 3. Two Landsat-7/ETM+ images (Path/Row 36/22) acquired on 17 May (Figure 3a, used as input fine-spatial-resolution image) and 2 June 2001 (Figure 3b, used to validate the experimental accuracy) with six bands (the 1st-5th bands and the 7th band) utilized.Additionally, the two input MOD09A1 images (h11v03) with six bands (the 1st-4th bands and the 6th and 7th bands) at the reference and prediction dates were acquired on 17 May (Figure 3c) and 2 June 2001 (Figure 3d), respectively.All experimental fine-and coarse-spatial-resolution images were preprocessed before predicting the synthetic image.The Landsat ecosystem disturbance adaptive processing system (LEDAPS) was employed to atmospherically correct and calibrate the digital numbers from the Landsat level 1 product.The Landsat-7/ETM+ and MODIS surface reflectance data can be deemed as consistent and comparable because of the same atmospheric correction approach used for Landsat and MODIS data [13,29].Additionally, the ISODATA method was used in this study to generate the classification maps of the two study sites to meet the requirements of the FSDAF algorithm.
By utilizing the MODIS reprojection tool (MRT), the MOD09A1 images were first reprojected to the universal transverse mercator (UTM) projection, then resampled to a 30 m spatial resolution based on the nearest neighbor method, and finally clipped to the same extent of the Landsat-7/ETM+ All experimental fine-and coarse-spatial-resolution images were preprocessed before predicting the synthetic image.The Landsat ecosystem disturbance adaptive processing system (LEDAPS) was employed to atmospherically correct and calibrate the digital numbers from the Landsat level 1 product.The Landsat-7/ETM+ and MODIS surface reflectance data can be deemed as consistent and comparable because of the same atmospheric correction approach used for Landsat and MODIS data [13,29].Additionally, the ISODATA method was used in this study to generate the classification maps of the two study sites to meet the requirements of the FSDAF algorithm.
By utilizing the MODIS reprojection tool (MRT), the MOD09A1 images were first reprojected to the universal transverse mercator (UTM) projection, then resampled to a 30 m spatial resolution based on the nearest neighbor method, and finally clipped to the same extent of the Landsat-7/ETM+ images for the two study sites.For convenience, to match 16 times the spatial resolution of the Landsat images, the MODIS at 500 m resolution were resampled into 480 m by using the nearest neighbor approach.The bandwidths of MODIS and Landsat-7/ETM+ data were similar even though the bandwidths of MODIS data were narrower than those of Landsat-7/ETM+ data (Table 1).

Quantitative Comparison
The performance of ELSTFM was quantitatively and visually compared with the Fit_FC algorithm [12], FSDAF algorithm [29], and STARFM algorithm [13].These algorithms are based on either a linear model or the unmixing model and have been widely used.Five indices including the correlation coefficient (γ), root mean square error (RMSE), structural similarity index (SSIM), average absolute difference (AAD), and ERGAS were used to evaluate these four fusion methods.The higher 7 indicates better linear correlativity between the predicted and actual data; SSIM can assess the similarity of the overall structure (the best value is 1); lower RMSE and AAD values represent better fusion performance of the corresponding method; and a lower ERGAS value means better image fusion quality.These metrics can be calculated as follows: where A denotes the total pixels number of fine-spatial-resolution image; k a and l a are the values of the ath synthetic and actual fine-spatial-resolution pixels; k and l are the mean values of the synthetic and actual images; r h and r l represent the resolutions of high and low spatial resolution images (in this paper, the spatial resolutions of Landsat-7/ETM+ data and MODIS data are set to 30 m and 480 m); RMSE B is the RMSE value of the Bth band between the synthetic and actual images; l B is the average value of the Bth band of the actual image; σ k and σ l are variances of the synthetic and actual images; σ kl denotes the covariance of the synthetic and actual images; C 1 , C 2 are two small constants to avoid outliers, and in this study, they are both equal to 0.001.

Results for Experimental Site 1
Figure 4b-e shows the predicted images on 12 February 2002 generated by the four data fusion methods and Figure 4a is the actual Landsat-7/ETM+ image acquired on 12 February 2002.A zoom-in area marked by the yellow square in Figure 4a was also used to highlight the difference between the predicted images and the actual image (Figure 4f-j).As a whole, from the visual comparison, the predicted images were all similar to the actual image in most regions.However, when compared with the other three methods, the predicted image obtained by ELSTFM was closer to the actual image with regard to the spatial details for some special regions such as the zoom-in images in Figure 4f-j.We can see that the images obtained by the FSDAF and STARFM methods could capture the main spatial details, however, some pixels were not appropriately predicted; the image obtained by the Fit_FC method seemed to be a little over-smooth such that some spatial details were missing, which may have resulted from the window size setting for coefficients fitting in Fit_FC.

Results for Experimental Site 1
Figure 4b-e shows the predicted images on 12 February 2002 generated by the four data fusion methods and Figure 4a is the actual Landsat-7/ETM+ image acquired on 12 February 2002.A zoomin area marked by the yellow square in Figure 4a was also used to highlight the difference between the predicted images and the actual image (Figure 4f-j).As a whole, from the visual comparison, the predicted images were all similar to the actual image in most regions.However, when compared with the other three methods, the predicted image obtained by ELSTFM was closer to the actual image with regard to the spatial details for some special regions such as the zoom-in images in Figure 4f-j.We can see that the images obtained by the FSDAF and STARFM methods could capture the main spatial details, however, some pixels were not appropriately predicted; the image obtained by the Fit_FC method seemed to be a little over-smooth such that some spatial details were missing, which may have resulted from the window size setting for coefficients fitting in Fit_FC.We used the first band (blue band) of the Landsat-7/ETM+ data as an example to analyze the differences between the predicted values and actual values pixel-by-pixel.As shown in Figure 5, the scatter plots by the four methods confirmed that the values predicted by ELSTFM and Fit-FC seemed to be more accurate than the other two methods, which corresponded to the statistical results (Table 2) that ELSTFM and Fit-FC were the better fusion methods for the objective image.However, when compared with the other three methods, the ELSTFM could best predict the lower reflectance values, which can partly explain the best fusion accuracy obtained by ELSTFM method.
Table 2 shows the five quantitative indices calculated by comparing the predicted results based on the ELSTFM, Fit_FC, FSDAF, and STARFM methods with the actual Landsat-7/ETM+ image acquired on 12 February 2002.We can see that with the exception of band 2, the results of ELSTFM had the highest γ and smallest RMSE; for all six bands, the results of ELSTFM had the highest SSIM and smallest AAD; ERGASs obtained by ELSTFM, Fit_FC, FSDAF, and STARFM methods were 1.4943, 1.7561, 1.9920, and 1.9630, respectively.All statistical results suggested that ELSTFM could acquire the best fusion accuracy among these methods for site 1.Similar to previous studies [29], the NIR band had the largest difference in accuracy between the ELSTFM and the other three methods (γ 0.5786 vs. 0.3787, 0.4248 and 0.5262, SSIM 0.5931 vs. 0.4019, 0.4664, and 0.5598, RMSE 0.0655 vs.We used the first band (blue band) of the Landsat-7/ETM+ data as an example to analyze the differences between the predicted values and actual values pixel-by-pixel.As shown in Figure 5, the scatter plots by the four methods confirmed that the values predicted by ELSTFM and Fit-FC seemed to be more accurate than the other two methods, which corresponded to the statistical results (Table 2) that ELSTFM and Fit-FC were the better fusion methods for the objective image.However, when compared with the other three methods, the ELSTFM could best predict the lower reflectance values, which can partly explain the best fusion accuracy obtained by ELSTFM method.
Table 2 shows the five quantitative indices calculated by comparing the predicted results based on the ELSTFM, Fit_FC, FSDAF, and STARFM methods with the actual Landsat-7/ETM+ image acquired on 12 February 2002.We can see that with the exception of band 2, the results of ELSTFM had the highest γ and smallest RMSE; for all six bands, the results of ELSTFM had the highest SSIM and smallest AAD; ERGASs obtained by ELSTFM, Fit_FC, FSDAF, and STARFM methods were 1.4943, 1.7561, 1.9920, and 1.9630, respectively.All statistical results suggested that ELSTFM could acquire the best fusion accuracy among these methods for site 1.Similar to previous studies [29], the NIR band had the largest difference in accuracy between the ELSTFM and the other three methods (γ 0.5786 vs. 0.3787, 0.4248 and 0.5262, SSIM 0.5931 vs. 0.4019, 0.4664, and 0.5598, RMSE 0.0655 vs. 0.0750, 0.0774, and 0.0719, AAD 0.0487 vs. 0.0569, 0.0591, and 0.0535).For all six bands, by using the ELSTFM method, the average enhancements of γ as well as SSIM were 0.0991, 0.0456, and 0.0462 as well as 0.0899, 0.0648, and 0.0496; and meanwhile, the mean decreases of RMSE as well as AAD were 0.0072, 0.0089, and 0.0077 as well as 0.0070, 0.0079, and 0.0063.Additionally, because actual MODIS data were utilized in this paper, the radiometric and geometric inconsistencies may lead to some differences between the results obtained in this study and a previous study which used simulated MODIS data [29].
Remote Sens. 2018, 10, x FOR PEER REVIEW 11 of 18 0.0750, 0.0774, and 0.0719, AAD 0.0487 vs. 0.0569, 0.0591, and 0.0535).For all six bands, by using the ELSTFM method, the average enhancements of γ as well as SSIM were 0.0991, 0.0456, and 0.0462 as well as 0.0899, 0.0648, and 0.0496; and meanwhile, the mean decreases of RMSE as well as AAD were 0.0072, 0.0089, and 0.0077 as well as 0.0070, 0.0079, and 0.0063.Additionally, because actual MODIS data were utilized in this paper, the radiometric and geometric inconsistencies may lead to some differences between the results obtained in this study and a previous study which used simulated MODIS data [29].Table 2.The fusion accuracies obtained from four data fusion methods for study site 1 (Figure 4).The units are reflectance.Table 2.The fusion accuracies obtained from four data fusion methods for study site 1 (Figure 4).The units are reflectance.A zoom-in area marked by the yellow square in Figure 6a was used to highlight the difference between the predicted images and the actual image (Figure 6f-j).Compared with site 1, site 2 seems to be more homogeneous.From the visual comparison, all these methods could acquire similar fusion results, but like the conclusions about site 1, the fused image obtained by the Fit_FC method seemed to be a little over-smooth and some details were missing; the differences between the results obtained by ELSTFM, FSDAF, and STARFM were not obvious (Figure 6f-j), suggesting that for the homogeneous region, all these methods could generate visually similar results.

Results for Experimental Site 2
Figure 6b-e shows the predicted images for experimental site 2 on 2 June 2001 generated by the four data fusion methods and Figure 6a is the actual Landsat-7/ETM+ image acquired on 2 June 2001.A zoom-in area marked by the yellow square in Figure 6a was used to highlight the difference between the predicted images and the actual image (Figure 6f-j).Compared with site 1, site 2 seems to be more homogeneous.From the visual comparison, all these methods could acquire similar fusion results, but like the conclusions about site 1, the fused image obtained by the Fit_FC method seemed to be a little over-smooth and some details were missing; the differences between the results obtained by ELSTFM, FSDAF, and STARFM were not obvious (Figure 6f-j), suggesting that for the homogeneous region, all these methods could generate visually similar results.Figure 7 shows the scatter plots of the actual vs. predicted blue band of the Landsat-7/ETM+ data by the four methods.As shown in Figure 7, the ELSTFM, FSDAF, and STARFM methods could obtain similar distributions, however, some predicted values from Fit_FC seemed to be a little lower than the actual values.Table 3 gives the accuracies obtained by the four fusion methods.For all six bands, the results of the ELSTFM had the highest and SSIM, suggesting that the ELSTFM method acquired the most accurate predicted images.Even though for bands 2, 3, and 7, the results of Fit_FC had the smallest RMSE and AAD, the differences between those values obtained by ELSTFM and Fit_FC were not obvious.Additionally, the ERGASs obtained by ELSTFM, Fit_FC, FSDAF, and STARFM were 0.9636, 1.0053, 1.0278, and 1.0638, respectively.In addition, by using the ELSTFM, the average enhancements of as well as SSIM were 0.0769, 0.0319, and 0.0222 as well as 0.0128, 0.0111, and 0.0092; the mean decreases of RMSE as well as AAD were 0.0002, 0.0007, and 0.0010 as well as 0.0005, 0.0010, and 0.0008.Figure 7 shows the scatter plots of the actual vs. predicted blue band of the Landsat-7/ETM+ data by the four methods.As shown in Figure 7, the ELSTFM, FSDAF, and STARFM methods could obtain similar distributions, however, some predicted values from Fit_FC seemed to be a little lower than the actual values.Table 3 gives the accuracies obtained by the four fusion methods.For all six bands, the results of the ELSTFM had the highest 7 and SSIM, suggesting that the ELSTFM method acquired the most accurate predicted images.Even though for bands 2, 3, and 7, the results of Fit_FC had the smallest RMSE and AAD, the differences between those values obtained by ELSTFM and Fit_FC were not obvious.Additionally, the ERGASs obtained by ELSTFM, Fit_FC, FSDAF, and STARFM were 0.9636, 1.0053, 1.0278, and 1.0638, respectively.In addition, by using the ELSTFM, the average enhancements of 7 as well as SSIM were 0.0769, 0.0319, and 0.0222 as well as 0.0128, 0.0111, and 0.0092; the mean decreases of RMSE as well as AAD were 0.0002, 0.0007, and 0.0010 as well as 0.0005, 0.0010, and 0.0008.Table 3.The fusion accuracies obtained from four data fusion methods for study site 2 (Figure 6).The units are reflectance.

The Influences of the Resampling Algorithm
It is necessary to resample the original MODIS data to the same spatial resolution extent of the fine-spatial-resolution image in the ELSTFM method.Hence, the resampling method has a direct influence on the fusion accuracy of ELSTFM.In this part, we used the resampled 480 m spatial resolution MODIS images at the reference and prediction dates as the original input coarse-spatialresolution images.Then, the thin plate spline (TPS) method, which has been used in a previous data fusion study [29], and the nearest neighbor interpolation (NNI) method were used to downscale the

3.
The fusion accuracies obtained from four data fusion methods for study site 2 (Figure 6).The units are reflectance.

The Influences of the Resampling Algorithm
It is necessary to resample the original MODIS data to the same spatial resolution extent of the fine-spatial-resolution image in the ELSTFM method.Hence, the resampling method has a direct influence on the fusion accuracy of ELSTFM.In this part, we used the resampled Remote Sens. 2018, 10, 881 14 of 18 480 m spatial resolution MODIS images at the reference and prediction dates as the original input coarse-spatial-resolution images.Then, the thin plate spline (TPS) method, which has been used in a previous data fusion study [29], and the nearest neighbor interpolation (NNI) method were used to downscale the coarse-spatial-resolution images.Finally, the fusion accuracies based on different resampling methods were compared.
Table 4 gives the correlation coefficients between the resampled 30 m spatial resolution MODIS images obtained by the two resampling methods and the corresponding actual Landsat-7/ETM+ images for site 1 and site 2. We can see that for the two study sites, the correlation coefficients of all six bands obtained by the TPS method were all higher than those obtained by the NNI method.Additionally, the correlation coefficients of all six bands for site 2 were relatively higher than for site 1, suggesting that these two resampling methods could obtain better downscaled images for the homogeneous area.Table 5 gives the accuracies of the ELSTFM applied to the resampled MODIS data obtained by the TPS and NNI methods for site 1 and site 2. The ERGASs using the TPS and NNI resampling methods were 1.5960 and 1.6132 for site 1, and 0.9883 and 0.9990 for site 2. We can see that for the two study sites, the results of all six bands based on the TPS resampling method were slightly better than the NNI method, suggesting that the resampling accuracy had positive effects on data fusion; on the other hand, the average differences between the results obtained by the two resampling methods were relatively small (γ: 0.0033 and 0.0032, SSIM: 0.0026 and 0.0015, RMSE: 0.0003 and 0.0002, AAD: 0.0002 and 0.0001), suggesting that the resampling method could slightly, but not significantly influence fusion accuracy.

The Distributions of Slopes in ELSTFM
The slope of the linear model formed using Landsat and MODIS data should be close to 1 because these two datasets are consistent and comparable after the same atmospheric correction process [13,29].Hence, in this part, we checked the distributions of the slopes of all six bands for two study sites to validate the ELSTFM indirectly.According to Equation (2), the slope of a pixel (m, n) can be calculated as: The mean values of the slopes of all six bands for site 1 were 1.1243, 1.1683, 1.4271, 1.0572, 1.2441, and 1.3384, respectively; and for site 2 were 0.6697, 0.9752, 0.9020, 1.1212, 1.1378, and 0.9735, respectively.The distributions of the slopes of all six bands for the two sites are shown in Figure 8.As shown in Figure 8a, all distributions of the slopes seemed to be normal distributions with the mean value of 1; similarly, for site 2, except for band 1, the distributions of the slopes of the other five bands were also normal distributions with a mean value close to 1.Both the mean values and the distributions of the slopes could confirm that the slope of line model was close to 1 because of the consistency between the Landsat and MODIS data; meanwhile, as the ELSTFM could get better fusion accuracies at different study sites, the changeable slope may enhance the fusion methods.Furthermore, we could see that the slopes of all six bands for the heterogeneous region (site 1) were higher than for the homogenous region, suggesting that the differences between Landsat-7/ETM+ and MODIS data may be relatively large in heterogeneous landscapes.

The Distributions of Slopes in ELSTFM
The slope of the linear model formed using Landsat and MODIS data should be close to 1 because these two datasets are consistent and comparable after the same atmospheric correction process [13,29].Hence, in this part, we checked the distributions of the slopes of all six bands for two study sites to validate the ELSTFM indirectly.According to Equation (2), the slope of a pixel (m, n) can be calculated as: The mean values of the slopes of all six bands for site 1 were 1.1243, 1.1683, 1.4271, 1.0572, 1.2441, and 1.3384, respectively; and for site 2 were 0.6697, 0.9752, 0.9020, 1.1212, 1.1378, and 0.9735, respectively.The distributions of the slopes of all six bands for the two sites are shown in Figure 8.As shown in Figure 8a, all distributions of the slopes seemed to be normal distributions with the mean value of 1; similarly, for site 2, except for band 1, the distributions of the slopes of the other five bands were also normal distributions with a mean value close to 1.Both the mean values and the distributions of the slopes could confirm that the slope of line model was close to 1 because of the consistency between the Landsat and MODIS data; meanwhile, as the ELSTFM could get better fusion accuracies at different study sites, the changeable slope may enhance the fusion methods.Furthermore, we could see that the slopes of all six bands for the heterogeneous region (site 1) were higher than for the homogenous region, suggesting that the differences between Landsat-7/ETM+ and MODIS data may be relatively large in heterogeneous landscapes.

The Applicability of ELSTFM
Landsat-7/ETM+ data are seriously influenced by the scan line corrector problem after 2003 and currently, the eighth satellite in the series of Landsat satellite, Landsat-8, will continue and advance the collection of Landsat data.Hence, even though Landsat-7/ETM+ and MODIS data were used in this paper to validate the ELSTFM method, the ELSTFM can also be used for other satellite data such as Landsat-8/OLI and Sentinel series data.Additionally, because ELSTFM is a pixel-based method, even though the reflectance data were used to validate the proposed method, we believe it can also accurately predict any products derived from reflectance data, such as the leaf area index (LAI), normalized difference vegetation index (NDVI), and the normalized difference water index (NDWI).
The ELSTFM is based on a linear model, yet, different from the STARFM and ESTARFM, which deem the slope as either 1 or a constant, the slope of the linear model in the ELSTFM is changeable.However, if we assume the slope is equal to 1 and invariable, by subtraction of Equations ( 2) and (3), the intercept of the linear model can be eliminated, hence the proposed method can be transformed to the STARFM method.Through the experiments in the two study sites, the ELSTFM method could

The Applicability of ELSTFM
Landsat-7/ETM+ data are seriously influenced by the scan line corrector problem after 2003 and currently, the eighth satellite in the series of Landsat satellite, Landsat-8, will continue and advance the collection of Landsat data.Hence, even though Landsat-7/ETM+ and MODIS data were used in this paper to validate the ELSTFM method, the ELSTFM can also be used for other satellite data such as Landsat-8/OLI and Sentinel series data.Additionally, because ELSTFM is a pixel-based method, even though the reflectance data were used to validate the proposed method, we believe it can also accurately predict any products derived from reflectance data, such as the leaf area index (LAI), normalized difference vegetation index (NDVI), and the normalized difference water index (NDWI).
The ELSTFM is based on a linear model, yet, different from the STARFM and ESTARFM, which deem the slope as either 1 or a constant, the slope of the linear model in the ELSTFM is changeable.However, if we assume the slope is equal to 1 and invariable, by subtraction of Equations ( 2) and (3), the intercept of the linear model can be eliminated, hence the proposed method can be transformed to the STARFM method.Through the experiments in the two study sites, the ELSTFM method could acquire better fusion results than the STARFM method.Hence, the ELSTFM can be deemed as an enhanced version of STARFM.

Conclusions
An enhanced linear spatio-temporal fusion model (ELSTFM) was proposed in this paper to generate data with both high spatial resolution and temporal frequency.The residuals of the fine-spatial-resolution pixels were first calculated based on spectral unmixing theory; the similar pixels and their corresponding weights were then computed; after that, the coarse-spatial-resolution images at the reference and prediction dates were resampled to the same spatial resolution and extent of the fine-spatial-resolution images; finally, the ELSTFM was employed to synthesize the fine-spatial-resolution image at the prediction date.Through the comparisons between ELSTFM and three other typical spatio-temporal fusion methods (Fit_FC, FSDAF, and STARFM), for the two study sites, the ELSTFM could acquire better fusion accuracies.
Additionally, the performances of the ELSTFM based on different resampling methods for the two study sites were examined.The resampling methods can slightly, but not significantly influence the ELSTFM method.Moreover, for most bands, the distributions of the slopes of the linear model for the two study sites were normal distributions with a mean value of 1.In addition, the ELSTFM was designed to synthesize not only the reflectance data, but also the satellite products derived from the reflectance data.
Some factors that can limit the performance of ELSTFM still exist and need to be overcome.The fine-spatial-resolution pixels are predicted pixel-by-pixel in the ELSTFM, and even though the neighboring similar pixels are used to decrease the influences of heterogeneity, the radiometric and atmospheric errors can seriously affect fusion accuracy.In addition, the residuals of the fine-spatial-resolution pixels are calculated by using the spectral unmixing theory in the ELSTFM, yet some errors still exist.Some other methods that can enhance the accuracies of the residuals can definitely improve the performance of the ELSTFM method.
Additionally, this research aimed to enhance the spatio-temporal fusion based on the linear model and the coefficients a and b were determined locally in this paper, which means they were regarded as temporally invariable in a short period.However, when the temporal gap between the acquisitions of input images is relatively large, which may result in large landscapes changes, these two coefficients should be determined locally and temporally.

Figure 4 .
Figure 4. Results obtained from the (b) ELSTFM algorithm; (c) Fit_FC algorithm; (d) FSDAF algorithm; and (e) STARFM algorithm for site 1 (NIR, red, green bands as RGB) on 12 February 2002.(a,f) are the actual Landsat-7/ETM+ image and the subarea marked in yellow in (a) on 12 February 2002.(g-j) are the corresponding results on 12 February 2002 for the subarea marked in yellow in (a).

Figure 4 .
Figure 4. Results obtained from the (b) ELSTFM algorithm; (c) Fit_FC algorithm; (d) FSDAF algorithm; and (e) STARFM algorithm for site 1 (NIR, red, green bands as RGB) on 12 February 2002.(a,f) are the actual Landsat-7/ETM+ image and the subarea marked in yellow in (a) on 12 February 2002.(g-j) are the corresponding results on 12 February 2002 for the subarea marked in yellow in (a).

Figure 5 .
Figure 5. Scatter plots of actual and predicted values obtained by using (a) ELSTFM; (b) Fit_FC; (c) FSDAF; and (d) STARFM for the blue band of Figure 4a, and the line is the 1:1 line.

Figure 5 .
Figure 5. Scatter plots of actual and predicted values obtained by using (a) ELSTFM; (b) Fit_FC; (c) FSDAF; and (d) STARFM for the blue band of Figure 4a, and the line is the 1:1 line.

Figure 6b -
Figure 6b-e shows the predicted images for experimental site 2 on 2 June 2001 generated by the four data fusion methods and Figure 6a is the actual Landsat-7/ETM+ image acquired on 2 June 2001.A zoom-in area marked by the yellow square in Figure6awas used to highlight the difference between the predicted images and the actual image (Figure6f-j).Compared with site 1, site 2 seems to be more homogeneous.From the visual comparison, all these methods could acquire similar fusion results, but like the conclusions about site 1, the fused image obtained by the Fit_FC method seemed to be a little over-smooth and some details were missing; the differences between the results obtained by ELSTFM, FSDAF, and STARFM were not obvious (Figure6f-j), suggesting that for the homogeneous region, all these methods could generate visually similar results.

Figure 6 .
Figure 6.Results obtained from the (b) ELSTFM algorithm, (c) Fit_FC algorithm, (d) FSDAF algorithm, and (e) STARFM algorithm for site 2 (NIR, red, green bands as RGB) on 2 June 2001.(a,f) are the actual Landsat-7/ETM+ image and the subarea marked in yellow on 2 June 2001.(g-j) are the corresponding results on 2 June 2001 for the subarea marked in yellow in (a).

Figure 6 .
Figure 6.Results obtained from the (b) ELSTFM algorithm, (c) Fit_FC algorithm, (d) FSDAF algorithm, and (e) STARFM algorithm for site 2 (NIR, red, green bands as RGB) on 2 June 2001.(a,f) are the actual Landsat-7/ETM+ image and the subarea marked in yellow on 2 June 2001.(g-j) are the corresponding results on 2 June 2001 for the subarea marked in yellow in (a).

Figure 7 .
Figure 7. Scatter plots of actual and predicted values obtained by using (a) ELSTFM; (b) Fit_FC; (c) FSDAF; and (d) STARFM for the blue band of Figure 6a, and the line is the 1:1 line.

Figure 7 .
Figure 7. Scatter plots of actual and predicted values obtained by using (a) ELSTFM; (b) Fit_FC; (c) FSDAF; and (d) STARFM for the blue band of Figure 6a, and the line is the 1:1 line.

Figure 8 .
Figure 8.The distributions of the slopes of all six bands for site 1 (a) and site 2 (b).

Figure 8 .
Figure 8.The distributions of the slopes of all six bands for site 1 (a) and site 2 (b).

Table 4 .
The correlation coefficients between the resampled 30 m spatial resolution MODIS images obtained by two resampling methods and the corresponding actual Landsat-7/ETM+ images for site 1 and site 2. The units are reflectance.ref and pre represent the Landsat-7/ETM+ images at the reference and prediction dates.

Table 5 .
Accuracy assessment of ELSTFM applied to the resampled MODIS data obtained by thin plate spline (TPS) and nearest neighbor interpolation (NNI) methods for site 1 and site 2. The units are reflectance.