A Comprehensive and Automated Fusion Method: The Enhanced Flexible Spatiotemporal DAta Fusion Model for Monitoring Dynamic Changes of Land Surface

: Spatiotemporal fusion methods provide an e ﬀ ective way to generate both high temporal and high spatial resolution data for monitoring dynamic changes of land surface. But existing fusion methods face two main challenges of monitoring the abrupt change events and accurately preserving the spatial details of objects. The Flexible Spatiotemporal DAta Fusion method (FSDAF) was proposed, which can monitor the abrupt change events, but its predicted images lacked intra-class variability and spatial details. To overcome the above limitations, this study proposed a comprehensive and automated fusion method, the Enhanced FSDAF (EFSDAF) method and tested it for Landsat–MODIS image fusion. Compared with FSDAF, the EFSDAF has the following strengths: (1) it considers the mixed pixels phenomenon of a Landsat image, and the predicted images by EFSDAF have more intra-class variability and spatial details; (2) it adjusts the di ﬀ erences between Landsat images and MODIS images; and (3) it improves the fusion accuracy in the abrupt change area by introducing a new residual index ( RI ). Vegetation phenology and ﬂood events were selected to evaluate the performance of EFSDAF. Its performance was compared with the Spatial and Temporal Adaptive Reﬂectance Fusion Model (STARFM), the Spatial and Temporal Reﬂectance Unmixing Model (STRUM), and FSDAF. Results show that EFSDAF can monitor the changes of vegetation (gradual change) and ﬂood (abrupt change), and the fusion images by EFSDAF are the best from both visual and quantitative evaluations. More importantly, EFSDAF can accurately generate the spatial details of the object and has strong robustness. Due to the above advantages of EFSDAF, it has great potential to monitor long-term dynamic changes of land surface.


Introduction
High spatial resolution remote sensing images with highly frequent observations are significant for monitoring dynamic changes of land surface [1][2][3][4], especially for monitoring short-term and abrupt change events such as floods and forest fires [5,6]. However, due to technical and budget limitations, existing single satellite sensors cannot simultaneously generate high temporal and high spatial resolution images [2,7,8]. For example, the remote sensing images acquired from Landsat, SPOT, and Sentinel 2 satellites have a high spatial resolution (10-30 m), but they have a long coverage period of 5-30 days (hereinafter called "fine resolution" images). The other satellites images (e.g., MODIS and VIIRS) have a daily sampling frequency, but spatial resolution ranges between 250 and 1000 m (hereinafter called "coarse resolution" images). To overcome this constraint, spatiotemporal fusion methods of remote sensing data have been developed to synthesize high spatial and temporal resolution images for monitoring the dynamic changes of land surface by fusing coarse resolution images and fine resolution images [9,10]. In the past decade, the synthetic high spatiotemporal resolution images have been widely used in vegetation phenology monitoring [9,11], urban surface temperatures [12][13][14], urbanization [15], crop yield estimating [16,17], and monitoring sudden and short-term change events (e.g., flood) [18].
According to the literature, the existing spatiotemporal fusion methods can be classified into three categories [19][20][21]: weight function-based, unmixing-based, learning-based. Weight function-based methods estimate the fine resolution images by weighting the similar neighboring information of all input images. The Spatial and Temporal Adaptive Reflectance Fusion Model (STARFM) [7] was the earliest proposed and most widely used spatiotemporal fusion method, which blended the Landsat and MODIS imagery to synthesize Landsat-like images by using a weighted function, but STARFM cannot predict heterogeneous landscape. Then, the Spatial Temporal Adaptive Algorithm for mapping Reflectance Change (STAARCH) [2] and the Enhanced STARFM (ESTARFM) [8] were developed to improve STARFM's performance in disturbance areas and heterogeneous areas, respectively. However, both STAARCH and ESTARFM required two pairs of MODIS and Landsat images for input data, which are less suitable in cloudy places. Unmixing-based methods predict the fine resolution images by unmixing the coarse resolution images based on the spectral mixing theory. The Multi-sensor Multi-resolution Technique (MMT) [22] was originally proposed based on unmixing, and existing fusion methods based on unmixing were an improvement to MMT. Methods, such as the Spatial Temporal Data Fusion Approach (STDFA) [23] and the Spatial and Temporal Reflectance Unmixing Model (STRUM) [24], were typical unmixing-based methods. The advantages of unmixing-based methods have high computational efficiency, which can generate the time series of images with high spatiotemporal resolution. However, the fusion methods based on unmixing only unmix the coarse resolution images, which did not consider mixed pixels phenomenon of fine resolution images, and its predicted fine resolution images lacked intra-class variability and spatial details [19,25]. In addition, most of both the weight function-based methods and unmixing-based methods supposed the land cover will not change between base date and prediction date, which cannot monitor the abrupt change events.
Learning-based methods have been proposed in recent years, which employ the machine learning algorithms to perform feature learning between coarse resolution images and fine resolution images [26], and simulating the fine resolution images based on the structural similarity of the input images. The SParse-representation-based SpatioTemporal reflectance Fusion Model (SPSTFM) [27] was the first learning-based algorithm, which established a correspondence between the fine resolution and coarse resolution images through the dictionary pair learning to generate the predicted fine resolution images. Following SPSTFM, Song [28] developed another dictionary-based algorithm that only required a pair of coarse resolution and fine resolution images. One of the important improvements of the learning-based methods is that it can predict land cover change events. Although many fusion methods of learning based have recently been proposed, such as Error-Bound-regularized SParse coding (EBSPTM) [29], Extreme Learning Machine (ELM) [30], and Hybrid Color Mapping (HCM) [31], the learning-based fusion methods are relatively novel, which have not been widely used [32], In addition, learning-based methods only consider the statistical relationships between the fine resolution images and coarse resolution images instead of any physical properties of remote sensing signals [28], which predicted fine resolution images cannot exactly remain spatial detail feature and shapes of objects.
Previous literature shows that existing fusion methods have two main improvements including monitoring the abrupt change events and accurately capturing the spatial details of land surface. To solve the problems, hybrid methods were proposed in recent years, and the Flexible Spatiotemporal DAta Fusion method (FSDAF) [32] was a typical hybrid method. FSDAF was based on spectral unmixing analysis and a thin plate spline (TPS) interpolator and combined the weighted function to predict fine resolution images. Because the information of land cover change can be captured by using a TPS interpolator for the MODIS images at the prediction date, the FSDAF can monitor abrupt change events and has been widely used recently. However, the FSDAF did not consider mixed pixels for Landsat images, and its predicted fine resolution images lacked intra-class variability and spatial details. Specifically, the FSDAF has the following three limitations: (1) it directly performs the hard-classification for Landsat image, and its predicted Landsat-like images lack intra-class variability and spatial details; (2) it directly assigns the temporal changes of coarse resolution images to fine resolution images without considering the differences between Landsat and MODIS images; and (3) it introduces a homogeneity index (HI) to guide the residual distribution, which is derived from the classification map at the base date. The HI will not be suitable for guiding the residual distribution when there are land cover changes and misclassifications [33]. In addition, more input parameters were required to be set before the operation of FSDAF, which increased the computational complexity.
To solve the above limitations of the FSDAF method, this study developed a comprehensive and automated fusion method, an enhanced FSDAF (EFSDAF), and tested it by fusing the Landsat images and MODIS images. Compared with FSDAF, EFSDAF has the following strengths: (1) it considers the mixed pixels in Landsat images and uses a globally representative spectral linear mixture model (SVD) of the Landsat pixels for spectral unmixing; (2) it adjusts the differences between Landsat images and MODIS images using a linear model; and (3) it proposes a new residual index (RI) to guide the residual distribution based on the interpolator results of MODIS images at the base date and prediction date. In addition, EFSDAF is an automated spatiotemporal fusion method that does not require additional input parameters compared to FSDAF. In this study, we tested the EFSDAF in the areas of vegetation phenology and flood events and compared it with three other popular fusion methods: STARFM, STRUM, and FSDAF.

Definitions and Notations
Before introducing the details of EFSDAF, some important definitions and notations are given here for convenience and clarity.
T B and T P define the base date and the prediction date, respectively. S 2 is the number of fine resolution pixels within one coarse resolution pixel. m represents the m-th endmembers within one coarse resolution pixel or one fine resolution pixel. (x i , y i ) and x ij , y ij denote the coordinate index of the i-th coarse resolution pixels and the j-th fine pixels within the i-th coarse pixels, respectively, and j = 1 . . . s 2 . The coarse resolution images observed at T B and T P are stored in C B (x i , y i , b) and C P (x i , y i , b), and the fine resolution images observed at T B and T P are stored in F B x ij , y ij , b and F P x ij , y ij , b . b is the number of bands for coarse resolution images and fine resolution images. A B F x ij , y ij , m and A P F x ij , y ij , m denote the abundances of the fine resolution pixels at T B and T P , and A B C (x i , y i , m) and A P C (x i , y i , m) denote the abundances of the coarse resolution pixels at T B and T P , respectively.

Theoretical Basis of EFSDAF
Like FSDAF, the input images of EFSDAF include, a MODIS image at T B , a Landsat image at T B and a MODIS image at T P . The output of EFSDAF is a predicted fine resolution image (e.g., Landsat-like) at T P . Unlike FSDAF, EFSDAF considers more than one land cover type in both the Landsat images (30 m) and MODIS images (500 m). Therefore, the endmembers determination and abundances extraction of the Landsat image at T B are firstly carried out in EFSDAF. EFSDAF includes five main steps: (1) endmember determination and spectral unmixing for Landsat image at T B ; (2) temporal prediction (F TP ) for no land cover change from T B to T P ; (3) spatial prediction (F SP ) for land cover change at T P ; (4) residual distribution by using a new residual index (RI); and (5) final prediction of a Landsat-like image using neighborhood in a sliding window. The workflow of EFSDAF is outlined by the flowchart in Figure 1. Detailed descriptions of each step in EFSDAF are given below. The code of EFSDAF can be found at the URL https://github.com/max19951001/EFSDAF.git.

Endmember Determination and Spectral Unmixing for Landsat Images at T B
Endmember extraction and spectral unmixing of Landsat images at T B are the first step and key step for EFSDAF in this paper due to the heterogeneity of land surface with more than one land cover type in Landsat pixels. In this study, a globally representative spectral linear mixture model (SVD model) [34][35][36] shared by Small for Landsat images was selected as endmembers that were classified into three types: substrate (S), vegetation (V), and dark surface (D) [20], and as a linear method, the fully constrained least squares (FCLS) [37] method was applied to the spectral unmixing. The abundance A B F x ij , y ij , m calculated by FCLS varies from 0 to 1, and the sum of A B F x ij , y ij , m is 1. The temporal prediction of EFSDAF assumes no land cover change from T B to T P . In other words, the endmembers and abundances of Landsat pixels will not change from T B to T P . According to the linear mixture model [38], the values of Landsat pixels are a linear mixture of endmembers' values and abundances. Hence, the values of Landsat pixels at T B and T P can be expressed as:

Temporal Prediction (F
where n m defines the number of endmembers and m is the m-th endmembers for one Landsat pixel, R B F (m, b) and R P F (m, b) denote the reflectance values of each endmember for all Landsat image bands at T B and T P , respectively. ϕ is the system error. According to the above assumption, the endmembers and abundances of Landsat pixels will not change from T B to T P , i.e., A B F = A P F , and ϕ is constant. From Equations (1) and (2), we have: where ∆F denotes the temporal changes of the Landsat images. ∆R F are the reflectance changes of each endmember for all Landsat image bands. Among the variables above, if ∆R F is known, ∆F can be calculated by a linear mixture model. Therefore, the key is to solve ∆R F . Similar to the temporal changes of Landsat images, the temporal changes of MODIS images from T B to T P can be expressed as: where ∆C denotes the temporal changes of the MODIS images. ∆R C are the reflectance changes of each endmember for all MODIS image bands. Among the variables above, ∆C could be calculated through ∆C = C P − C B . Due to the previous assumption of no land cover change from T B to T P , the types and spatial distributions of the endmembers should be the same for Landsat and MODIS images in the same regions. Hence, A B C can be aggregated from A B F with: where is calculated by averaging the endmember abundances of all Landsat pixels in one MODIS pixel. However, is the ratio of the number of Landsat pixels for class m to (4) and (5)  In this study, assuming that the land cover types are the same for a small area according to the first law of geography [39] stating that near things are more related than distant things, it is reasonable that ∆R C (m, b) are the same in a small area [20]. However, FSDAF assumes that the temporal changes of each class are the same among all MODIS pixels, which does not match the actual situation. A sliding window of MODIS pixels is introduced to perform spatial unmixing in this study, and the size of the window should be larger than the number of endmembers [40]. The equations [32] are as follows: . . .
where ∆R C (m, b) can be calculated by computing a least squares best fit solution. Selecting k of the purest MODIS pixels of each endmember in the sliding window for least squares best fit aims to reduce the errors by collinearity [32]. The final determined MODIS pixels for least squares best fit should be smaller than the size of the sliding window and larger than the number of endmembers. •

Adjustment of the differences between the Landsat images and MODIS images
Theoretically speaking, the reflectance values of the Landsat pixels corresponding to the MODIS pixels are the same. However, due to the physical differences of the sensors, the effects of bandwidth, weather conditions, and atmospheric correction, differences between MODIS images and Landsat images are inevitable. As with previous studies [20,41], a linear model is introduced in this study to adjust the differences between Landsat images and MODIS images.
where a 1 (b) and b 1 are the slope and interception of the linear model, respectively. From Equations (7) and (8), the differences between ∆R C (m, b) and ∆R F (m, b) can be calculated as the following Equation (9).
∆R F (m, b) can be obtained from Equations (5)- (9) in this study. However, the ∆R F (m, b) of FSDAF were calculated by assigning the temporal changes of MODIS images to Landsat images without considering the differences of Landsat and MODIS images.

•
Temporal prediction for the fine resolution image at T P ∆F is linearly mixed with A B F from Equation (3), and the final temporal prediction can be calculated by Equation (10) as follows: Although the final temporal prediction can be calculated by Equation (10), all of the above calculations are based on the assumption of no land cover change from T B to T P . Hence, the temporal prediction cannot accurately predict abrupt change events from T B to T P . In addition, because this prediction only uses the information of the temporal changes between T B and T P , we call Section 2.2.2 the temporal prediction.

Spatial Prediction (F SP ) for Land Cover Change at T P •
Analysis and calculation of the residual The temporal prediction from Section 2.2.2 is not an accurate prediction when there are land cover changes. Specifically, there is a certain deviation between the true values (F p ) and the temporal prediction values (F TP ). This study introduces a residual R between the F P and the F TP as follows: Furthermore, the value of one MODIS pixel is equal to the average of all of Landsat pixels within one MODIS pixels and a system deviation ϕ [7]. Hence, the values of MODIS pixels at T B and T P can be expressed as: From Equations (11)-(13), we have: As can be seen from the above Equations (11)-(14), the calculation and distribution of the residual R is a key step for obtaining the final prediction at T P .
• Spatial prediction of the MODIS image at T P As described above, the source of the residual R is in the area of land cover change. However, the information of land cover change can only be obtained from the MODIS image at T P . Therefore, downscaling the MODIS image to Landsat image level is the vital step to obtain the information of the land cover change at T P . In EFSDAF, a Thin Plate Spline (TPS) interpolation method is applied to downscale the MODIS image at T P . TPS is a spatial interpolation technique for point data based on spatial dependence [42], which was used to the spatiotemporal fusion method because it has high interpolation accuracy [32,43]. The interpolation result of the MODIS image at T P will be marked as F SP P x ij , y ij , b , which is another prediction of the Landsat image at T P . Unlike FSDAF, the MODIS image at T B is also interpolated using the TPS method to calculate the residual index (RI) in Section 2.2.4 for EFSDAF, and its result will be marked as F SP B x ij , y ij , b . Because the spatial information is obtained from MODIS images using the TPS interpolator method, we call Section 2.2.3 spatial prediction.

Residual Distribution by Using a New Residual Index (RI)
From Equation (11), the predicted Landsat-like image is calculated by adding a reasonable residual based on the temporal prediction. Therefore, the residual distribution is the key step for the final prediction. In FSDAF, a homogeneity index (HI) derived from the classification map of the Landsat image at T B was introduced to guide the residual distribution. Because the HI only depends on the classification map at T B , it will be unreasonable when the land cover type has changed from T B to T P or the classification map is wrong [33]. Hence, a new residual index (RI) is proposed to reasonably guide the residual distribution in this study, and more residual will be assigned to the area of land cover change. According to Section 2.2.3, the spatial variation ∆F BP x ij , y ij , b can be expressed as: where ∆F BP x ij , y ij , b defines the changes of interpolation results of MODIS images from T B to T P , and the larger ∆F BP x ij , y ij , b , the larger the changes of land cover type. In EFSDAF, we assume the changes of land cover type occur in a small area. If ∆F BP x ij , y ij , b in one Landsat pixel is greater than the average of the ∆F BP , and we think the Landsat pixel x ij , y ij is the area of land cover change.
The RI can be calculated as follows: where I k is equal to 1 when the k-th Landsat pixels is determined to the pixel of land cover change within a sliding window; otherwise, I k is equal to 0. n w defines the window size. The RI is the ratio of the numbers of I k = 1 in the window to n w . The range of RI is 0-1, and a larger RI indicates more residual to the area of land cover change.
Then we use the same weighting function to distribute the final residual as FSDAF. The weighting function is defined as follow: where E he x ij , y ij , b = R(x i , y i , b), and E he x ij , y ij , b is the residual for the area of land cover change.
, and E ho x ij , y ij , b is the residual for area of no land cover change.
The final residual r x ij , y ij , b can be calculated as follows: where W x ij , y ij , b is the normalized CW x ij , y ij , b . According to Equations (10) and (11), the final changes of the Landsat images between T B and T P can be obtained from Equation (19): where ∆F f x ij , y ij , b and ∆F x ij , y ij , b are the final changes of Landsat images and the changes of temporal prediction from T B to T P , respectively.

Final Prediction of the Landsat-Like Image Using Neighborhood in a Sliding Window
Theoretically, the final prediction of Landsat-like image F tp x ij , y ij , b at T P can be obtained by the following equation.
However, uncertain errors are inevitable in EFSDAF because of the complexity of the calculation process and the characteristics of pixel-by-pixel. In addition, the final prediction will lead to block effects because the residual distribution is for the MODIS pixels. In this study, EFSDAF adopts the same solution as STARFM, which uses the information of neighborhood to reduce the uncertainties and smooth final prediction in a sliding window. The final predicted Landsat-like image F f p x ij , y ij , b can be calculated by Equation (21), and detailed information can be found in STARFM [7].
where n is the number of similar pixels for Landsat pixels in a sliding window, and w k is the weight for each similar pixel.

Study Area and Data
The EFSDAF is proposed to accurately monitor dynamic changes of land surface including gradual change (e.g., vegetation phenology) and abrupt change events (e.g., flood). In this study, vegetation phenology and flood events are selected to evaluate the performance of EFSDAF.
The selected two datasets were from the datasets shared by Emelyanova [10], which were collected in the Lower Gwydir Catchment in Northern New South Wales, Australia (149.2815 • E, 29.0855 • S, hereinafter called "Gwydir") and have been widely applied in the evaluation of spatiotemporal fusion methods [10,20,21,32,44]. All images were atmospherically corrected and geographically co-registered, which was resampled to 25 m with the nearest neighbor method by Emelyanova. In EFSDAF, we aggregated the MODIS pixels to 500 m resolution in the process of residual distribution. Two pairs of Landsat and MODIS images of the first dataset (vegetation phenology) were acquired on 5 July 2004 and 6 August 2004 in Gwydir, and the size of the selected images is 1400 × 1400 pixels. The first dataset was a typical heterogeneous region, and temporal dynamic changes were mainly caused by phenological change from 5 July to 6

Comparison and Evaluation of EFSDAF with STARFM, STRUM, and FSDAF
In this study, we compared EFSDAF with three other fusion methods: STARFM, STRUM, and FSDAF. Because these three fusion methods have been widely used, and each of them only needs one pair of Landsat and MODIS images at T B and one MODIS image at T P as input data.
All fusion images are qualitatively and quantitatively evaluated by comparing the predicted Landsat-like image with true images at T P . The qualitative evaluation is to visually compare the predicted images with the true images, and the differences between the predicted images of the four fusion methods are magnified and highlighted by using yellow elliptical circles and black squares. Representative assessment indexes as quantitative evaluation are used to objectively evaluate the predicted images and true images pixel by pixel, and they are Average Difference (AD), Root Mean Square Error (RMSE), Correlation Coefficient (CC), and Structure Similarity (SSIM) [45], respectively. The evaluation criteria are as follows: the closer AD and RMSE are to 0, the closer the predicted value to the true value; the closer CC and SSIM are to 1, the more similar the predicted images are to the true images.

Experiment in Gradual Change Area (Vegetation Phenology)
From Figure 4, the predicted Landsat-like images by four methods can monitor the changes of vegetation growth, and the results of the four fusion methods are similar to the true image on the whole. However, the yellow elliptical circles in Figure 4 show the differences of the four methods, and we can get that STARFM blurs the boundaries of different objects, STRUM produces block effects, FSDAF cannot retain the spatial details, and EFSDAF is closest to the true image. Especially, EFSDAF can accurately retain the boundaries of objects from Figure 5, and it can capture more spatial details of the object than FSDAF.  From quantitative indices of four methods in Table 1, the fusion results by four methods can achieve high accuracy. For all six bands and the mean, it can be seen that the accuracy of EFSDAF is the highest by comparing it with the other three methods, which have lower AD and RMSE, as well as higher CC and SSIM. Taking the mean of all bands as an example, the CC values of the four methods are 0.807, 0.826, 0.858, and 0.883 from left to right, respectively. In addition, because the vegetation phenology change happened in dataset 1, the prediction accuracy of band 4 (NIR) can better illustrate the performance of the four methods. From the scatter plot of Figure 6, it can be confirmed that the synthetic image values by EFSDAF are closer to the true values than the other three methods.

Experiment in Abrupt Change Area (Flood)
From the visual comparison, only FSDAF and EFSDAF can capture the information of land cover change in Figure 7, it can be more clearly and accurately seen from the zoom-in black area in Figure 8. In addition, the spatial details restored by EFSDAF are most complete and most similar with the true image from the yellow elliptical circles of Figure 7. From the yellow elliptical circles of Figure 8, the small flood change information can be accurately captured by EFSDAF, but it cannot be captured by FSDAF.  Comparing the results of the quantitative indices in Table 2, STARFM and STRUM present similar results, and EFSDAF is better than FSDAF (CC 0.808 vs. 0.827, SSIM 0.796 vs. 0.815 for band 4). Furthermore, since the water body information shows lower reflectivity for band 4 (NIR), we compare the scatter plots of band 4 for the four methods in Figure 9. It can be concluded that FSDAF and EFSDAF are far superior to STARFM and STRUM, and EFSDAF is better than FSDAF.

Discussion
The proposed EFSDAF method in this paper was an improvement on FSDAF. EFSDAF considered the mixed pixels of both Landsat images and MODIS images, and it predicted the Landsat-like images by combining spectral unmixing analysis and TPS interpolator. The experiment results showed that the fusion results by EFSDAF were better than FSDAF.

Improvements of EFSDAF Compared with FSDAF
The fusion results of EFSDAF were better than FSDAF due to the following reasons. Firstly, EFSDAF considered the mixed pixels phenomenon of the fine resolution image (e.g., Landsat), which was the most significant improvement of EFSDAF. In this study, the SVD model was introduced to perform the spectral unmixing by using a fully constrained least squares method. Moreover, the abundances of MODIS pixels were calculated by averaging the abundances of all Landsat pixels in one MODIS pixel, and the final changes of Landsat images were solved by a linear mixture that combined the abundances of endmembers and the variations of endmembers for Landsat images from T B to T p . Hence, the proposed EFSDAF considered the intra-class variability, and it can reserve more spatial details than FSDAF, which can be seen in the yellow elliptical circles of Figures 4 and 7. Secondly, the spatial unmixing of the coarse resolution image (e.g., MODIS) was performed in a sliding window, which was different from the spatial unmixing of FSDAF using the whole MODIS image. It was more in line with the actual situation by using sliding windows for spatial unmixing due to the heterogeneity of land surface. The size of the sliding window for spatial unmixing was recommended to 7 × 7 MODIS pixels because the fusion accuracy is the best. Furthermore, selecting k purest MODIS pixels of a sliding window for spatial unmixing ensured the minimal impacts of collinearity [32]. Thirdly, the differences between the Landsat and MODIS images were considered, which included the differences of the sensors and the pre-processing errors. In this study, we used a linear model to adjust the differences between the Landsat and MODIS images, but FSDAF did not consider the differences. Finally, a new residual index (RI) was introduced to guide the residual distribution instead of the homogeneity index (HI) of FSDAF. The HI will not be suitable to guide the residual distribution when there are land cover changes or misclassification [33] because it was calculated from the classification map at T B . The proposed RI adequately considered the source of the errors, and it was based on the differences of the interpolator results between the MODIS image at T B and the MODIS image at T P . Greater differences represent more significant land cover change and will have a larger RI value. The RI was calculated in a sliding window, and the size of the window should not be too large because the result will not be accurate for the edge of different land cover types. The recommended size was 51 × 51 Landsat pixels according to Section 2.2.5 in this study. More information of land cover change can be monitored by EFSDAF from the yellow elliptical circles in Figure 8e,f. In addition, EFSDAF is an automated fusion method compared to FSDAF because additional input parameters were not required.

Influence of Endmember Variability on EFSDAF
The endmember variability directly affected the abundances of the Landsat image which may further affect the final fusion precision of EFSDAF. Hence, we evaluated the performance of EFSDAF by using different endmember numbers. In this study, the SVD model shared by Small and 3~5 endmembers extracted from the Landsat image at T B were selected to test the influence of endmember variability on EFSDAF, and the final results of the two datasets were shown in Table 3 below (three bands for testing, NIR-red-green). As can be seen from Table 3, the predicted results of EFSDAF using four different endmembers are close in the two datasets. Therefore, the influence of endmember variability on the final fusion results can be negligible, and EFSDAF is robust to endmember variability. For the automated operation of EFSDAF, the SVD model of Landsat image was used to perform the spectral unmixing in this study.

The Effect of Input Images on the Predicted Values of EFSDAF
Like FSDAF, the minimum input images were used to predict the final Landsat-like images in EFSDAF. Hence, the fusion accuracy of EFSDAF was directly influenced by the input images [10,19]. Particularly, the spatial information and temporal information of the input images would be fully utilized in EFSDAF. Figure 10 showed the differences between the predicted Landsat-like image of EFSDAF and input images in dataset 2. From Figure 10b-d, it can be seen that the land cover type has changed, and the information of land cover change can be captured by comparing Figure 10b with Figure 10c. However, there was an obvious difference between the predicted image in the yellow ellipse area of Figure 10b and the true image (Figure 10a). After comparing and analyzing the input images and the predicted image in Figure 10a,d, it can be concluded that the input MODIS image at T P lacks the information of land cover change in the yellow ellipse from Figure 10c; therefore, the final predicted image in Figure 10a was different from the true image at T P from Figure 10a. However, comparing the abrupt change area of MODIS images from T B to T P in Figure 10c,d, EFSDAF can completely capture the spatial details in Figure 10b. In other words, EFSDAF can only accurately monitor the information of abrupt change in MODIS images. Hence, the quality of the input images is crucial to the fusion accuracy of EFSDAF.

Applications of EFSDAF to other Remote Sensing Products and Sensors
The EFSDAF in this study was tested using surface reflectance values of MODIS and Landsat images, but the purpose of EFSDAF is to study various complex surface dynamics. Therefore, EFSDAF can be applied to vegetation monitoring (e.g., NDVI), crop growth, and other sudden events such as urbanization, forest fires, and landslides. In addition, with the frequent launch of satellites for the Earth Observation System in recent years, existing satellite sensors such as Sentinel 2 MSI (10, 20, 60 m) images can replace Landsat images as fine resolution images, and like Sentinel 3 OLCI (300 m), NPP-VIIRS images (375/750 m) can replace MODIS images as coarse resolution images. We are looking forward to testing EFSDAF for other applications as well as satellite sensors.

Limitations of EFSDAF
EFSDAF can predict both gradual change and abrupt change events, but it cannot accurately monitor tiny changes (e.g., small flood information in this study) because tiny changes are not shown in one MODIS pixel. In other words, the changes of 20 × 20 Landsat pixels can be displayed in one MODIS pixel in this study. Therefore, the prediction accuracy of tiny changes can be improved by blending Landsat images (30 m) and Sentinel 3 OLCI images (300 m), but the predicted images still cannot monitor all actual changes. In addition, the computational efficiency of EFSDAF is the slowest compared with other three methods due to the complexity of the calculation process. However, the computational efficiency of fusion methods will be ignored in the future with the rapid development of high-performance computing platforms like Google Earth Engine (GEE) [46].

Conclusions
In order to improve the shortcomings of FSDAF, this study developed the Enhanced FSDAF (EFSDAF) to accurately monitor dynamic changes of land surface. EFSDAF improved FSDAF in the following aspects: (1) unlike existing spatiotemporal fusion methods of unmixing based (e.g., STRUM, FSDAF), EFSDAF considered the mixed pixels of fine resolution images (Landsat), and the fusion results by EFSDAF takes into account intra-class variability and can accurately reserve the spatial details of land surface; (2) the differences between the coarse resolution images and the fine resolution images were adjusted; (3) a new residual index (RI) was proposed to guide the residual distribution, which improved the fusion accuracy in the abrupt change area. Experimental results demonstrated that: (1) EFSDAF can accurately monitor both the gradual change and abrupt change events. More importantly, EFSDAF can reserve more spatial details of land surface and has a stronger robustness than FSDAF. (2) EFSDAF can monitor more information of land cover change than FSDAF by introducing a new residual index (RI) to guide residual distribution because the proposed RI considers the actual source of residual.
(3) EFSDAF is an automated fusion method because it does not need additional input parameters, which has great potential to monitor long-term dynamic changes of land surface using high spatiotemporal images.
In addition, we also expect that EFSDAF can be applied to other products of remote sensing and other satellite sensors.