An E ﬀ ective High Spatiotemporal Resolution NDVI Fusion Model Based on Histogram Clustering

: The normalized di ﬀ erence vegetation index (NDVI) is a powerful tool for understanding past vegetation, monitoring the current state, and predicting its future. Due to technological and budget limitations, the existing global NDVI time-series data cannot simultaneously meet the needs of high spatial and temporal resolution. This study proposes a high spatiotemporal resolution NDVI fusion model based on histogram clustering (NDVI_FMHC), which uses a new spatiotemporal fusion framework to predict phenological and shape changes. Meanwhile, this model also uses four strategies to reduce error, including the construction of an overdetermined linear mixed model, multiscale prediction, residual distribution, and Gaussian ﬁltering. Five groups of real MODIS_NDVI and Landsat_NDVI datasets were used to verify the predictive performance of the NDVI_FMHC. The results indicate that NDVI_FMHC has higher accuracy and robustness in forest areas (r = 0.9488 and ADD = 0.0229) and cultivated land areas (r = 0.9493 and ADD = 0.0605), while the prediction e ﬀ ect is relatively weak in areas subject to shape changes, such as ﬂooded areas (r = 0.8450 and ADD = 0.0968), urban areas (r = 0.8855 and ADD = 0.0756), and ﬁre areas (r = 0.8417 and ADD = 0.0749). Compared with ESTARFM, NDVI_LMGM, and FSDAF, NDVI_FMHC has the highest prediction accuracy, the best spatial detail retention, and the strongest ability to capture shape changes. Therefore, the NDVI_FMHC can obtain NDVI time-series data with high spatiotemporal resolution, which can be used to realize long-term land surface dynamic process research in a complex environment. This study proposes an e ﬀ ective high spatiotemporal NDVI fusion model based on histogram feature clustering to obtain NDVI data with high spatiotemporal resolution. Compared with the three fusion models of ESTARFM, NDVI-LMGM, and FSDAF, we also have found that except for ﬁre areas, the NDVI_FMHC has the highest prediction accuracy, the best spatial detail retention, and the strongest ability to capture shape changes. The new spatiotemporal fusion framework proposed by the NDVI_FMHC can better predict phenological changes and shape changes because classiﬁcation map 2 is obtained by combining HS t 0 and HS tp 1 , which can distinguish shape changes that occurred from those that did not occur during the prediction period. Meanwhile, the NDVI_FMHC also uses four strategies to reduce the error, including the construction of the overdetermined linear mixed model, Multiscale prediction, residual distribution, and Gaussian ﬁltering.


Introduction
The normalized difference vegetation index (NDVI) is one of the commonly used indicators to detect and indicate the status and dynamics of vegetation cover. NDVI time-series data are derived from a wide range of sources, such as MODIS, AVHRR, SeaWiFS, ASTER, Landsat (TM (the Thematic Mapper), ETM + (the Enhanced Thematic Mapper Plus), OLI (the Operational Land Imager)), and sentinel-2 MSI, which have been widely investigated and applied from regional to global scale [1][2][3][4][5][6][7][8][9]. Due to technological and budget limitations, the current NDVI datasets cannot methods all adopt the window strategy to realize the prediction. Due to the grid effect of the window and the residual distribution error, the final prediction result will contain block effects, such as the ESTARFM [11]. (3) Two or more pairs of cloudless basic NDVI data are needed, such as NDVI-LMGM and STFMF. Although NDVI-LMGM can complete NDVI data fusion by reusing one pair of images, the prediction accuracy is low [31].
The majority of remote sensing data pixels are composed of several ground objects, and the spectral characteristics of most pixels are also composed of spectral values of multiple features. By consequence, the reflectance of pixels can be expressed as a function of the spectral characteristics of the endmembers and their area percentage (abundance). Assuming that the same ground object has the same spectral characteristics and linear additivity, the pixel between low-resolution and high-resolution data can be linked together with linear mixing theory [10,11]. In recent years, the vegetation index has also been used to predict the time change by linear mixing theory [18,34]. This study proposes a high spatiotemporal resolution NDVI fusion model based on histogram clustering (NDVI_FMHC) to generate high spatiotemporal resolution NDVI by using a linear mixing theory. In summary, the main contributions of this study are the following: (1) The NDVI_FMHC constructs an effective fusion framework using two kinds of classification maps which are automatically generated by histogram clustering method. Due to the classification map containing coarse-resolution information of the surface changes on the prediction date, the NDVI_FMHC forms an important contribution to improve the prediction accuracy of phenological and shape changes. (2) To reduce error, the NDVI_FMHC approach uses four strategies, namely, the construction of an overdetermined linear mixed model, multiscale prediction, residual distribution, and Gaussian filtering. (3) We have designed a friendly and concise software interface for the NDVI_FMHC, which will be shared with interested users. The NDVI_FMHC software can directly generate NDVI time-series data with a high spatial resolution by using one pair of cloudless basic NDVI data but with less computational costs and prior knowledge.

Method
Before introducing the method, a brief description of the symbols used in the model is provided to facilitate the reader's understanding (Table 1). Landsat_NDVI data at t0 LS t0 MODIS_NDVI data at t0 LS tp MODIS_NDVI data at tp HS u0 Landsat resolution data generated from MODIS_NDVI data sampling at t0 HS up Landsat resolution data generated from MODIS_NDVI data sampling at tp HS tp1 Predicted Landsat_NDVI data based on classification map 1 HS tp2 Predicted Landsat_NDVI data based on classification map 2

Spatiotemporal Fusion Framework
As the input data of the spatiotemporal fusion model, the quality of the classification map is crucial to the accuracy and robustness of the fusion results. Most models assume that the type of land cover cannot change during spatiotemporal fusion; however, this assumption is unreasonable in some situations, such as floods and fires. This study uses hierarchical clustering to provide land cover for NDVI data fusion through local histogram features and generates two kinds of classification Remote Sens. 2020, 12, 3774 4 of 21 maps, in which classification map 1 is used for predicting HS tp1 , which is directly generated from HS t0 , and classification map 2 is used for predicting HS tp2 , which is generated from HS tp1 and HS t0 . An effective high spatiotemporal NDVI data fusion framework based on histogram feature clustering is constructed using the above two classification maps.
Based on the HS t0 , LS t0 and LS tp , the NDVI_FMHC approach constructs an effective fusion framework using two kinds of classification maps to predicting high-resolution NDVI data (HS tp2 ). HS t0 , LS t0 , and LS tp all need to be preprocessed with the same geometric correction and atmospheric correction. The spatiotemporal fusion framework consists of five steps: 1 Classification map 1 is generated by hierarchical clustering according to the local histogram features based on HS t0 ; 2 Combined with classification map 1, LS t0 and LS tp are up-sampled using a linear mixing model to obtain the HS u0 and HS up with the Landsat resolution as HS t0 ; 3 Based on HS t0 , HS u0 , and HS up , using classification map 1, the first high spatial resolution NDVI data at tp (HS tp1 ) are predicted by the linear mixing model; 4 Combined with HS t0 and HS tp1 , classification map 2 is generated by hierarchical clustering according to the local histogram features; 5 Based on HS t0 , HS u0 , and HS up , using classification map 2, the second high spatial resolution NDVI data at tp (HS tp2 ) are predicted through the linear mixing model. In the above five steps, the HS t0 , HS u0 , and HS up , HS tp1 , and HS tp2 are all processed by Gaussian filtering.

Feature Description of Land Cover Classification
The NDVI_FMHC automatically generated a classification map by the hierarchical clustering method, in which a local histogram is used as the feature to classify pixels. In the first three steps of the NDVI_FMHC, due to the lack of high spatial resolution information at tp, a 3 × 3 moving window based on only HS t0 is used to build classification map 1 produced by a 16-dimensional local histogram feature. After obtaining HS tp1 , classification map 2 is generated based on HS t0 and HS tp1 , in which each moving window at the corresponding position generates a 16-dimensional local histogram, and then expands to a 32-dimensional local histogram. Therefore, classification map 2 contains both phenological information without shape change and land cover change information with the shape change at t0 and tp, which helps improve the accuracy of the fusion result.

Estimation of NDVI Change with a Linear Mixture Model
Although the variation in NDVI over time is complex, it satisfies the linear change hypothesis in the short term [18]. The NDVI increment of the coarse pixel (x 1 , y 1 ) is defined as According to linear mixing theory, the NDVI of the coarse pixel is a linear superposition of the fine pixel, from which the relationship between the short-term change of the coarse pixel and the fine pixel NDVI can be obtained, as follows: In Equation (2), l is the count of classes and ∆NDVI C (x i , y i ) is the increment of the coarse pixel NDVI from t0 to tp. It should be noted that the coarse pixel here is not the original low-resolution data pixel, but a new coarse pixel composed of a certain number of fine pixels based on HS u0 and HS up (detailed in Section 2.5). f l (x i , y i ) is the abundance of the l category in the coarse pixels (x i , y i ). When obtaining HS u0 , HS up , and HS tp1 , the abundance is calculated from classification map 1, Remote Sens. 2020, 12, 3774 5 of 21 and when predicting HS tp2 , the abundance is calculated from classification map 2. ∆NDVI F,C is the increment of the fine pixel NDVI of category l from t0 to tp. The least-squares method is used to solve Equation (2) to obtain the optimal solution of ∆NDVI F,C .

Residuals and Its Allocation Formed by Predicting Time Changes
Previous studies [24] have shown that if there is no land cover change from t0 to tp, the value of NDVI fine pixels can be accurately estimated at tp, while the prediction is less accurate where land cover change has occurred and large within-class variability exists. In addition, it is difficult to construct a positive definite linear mixing model due to the non-equilibrium of ground feature distribution. To eliminate the influence of multiple solutions under positive definite equations, a large window is usually used to construct the equations of the overdetermined linear mixing model. After the optimal solution is obtained by the least-squares method, although the residual error is very small, it is still not zero. The residual estimation method of the NDVI_FMHC for the prediction and true value of fine pixels NDVI is consistent with that of FSDAF [24], and the residual distribution method is consistent with that of IFSDAF [31,32].

Gaussian Filtering
First, the remote sensing data inevitably contains noise due to the influence of sensor material properties and the working environment. Second, while solving the linear mixed model using the least squares method, additional noise is generated. Therefore, The NDVI_FMHC uses Gaussian filtering to remove the noise of HS t0 , HS u0 , HS up , HS tp1 , and HS tp2 . The formula is as follows: Gaussian filtering is mainly affected by two parameters when removing noise, which are Gaussian kernel size and standard deviation (σ). Gaussian kernel size is a 3 × 3 fine pixel window, and σ is determined through multiple experiments. In addition, G(x i , y i ) is the NDVI(x i , y i ) value of the fine pixels at (x i , y i ) after Gaussian filtering.

Block Effect Elimination and Final Prediction
The spatiotemporal fusion methods all adopt the window strategy to realize the prediction. Due to the grid effect of the window and the residual distribution error, the final prediction result will contain block effects, similar to the ESTARFM algorithm [11]. To eliminate the block effect, the NDVI_FMHC uses multiple different numbers of fine pixels to form new coarse pixels based on HS u0 and HS up , and uses the arithmetic average of several prediction results as the final prediction result (hereafter called "multiscale prediction"). Specifically, the NDVI_FMHC introduces a new parameter (coarse pixel width) to determine the size of a new coarse pixel. If the input value of the coarse pixel width is k (k ≥ 3), the size of the new coarse pixel has three values, and the resolutions are k × 30 m, (k + 2) × 30 m, and (k + 4) × 30 m (fine pixel size is 30 m). Values above the three coarse pixel width are used to make three predictions respectively, obtaining three parallel HS tp2 , which can be marked as HS 1 tp2 , HS 2 tp2 , and HS 3 tp2 . For instance, when the value of k is 4 and the fine pixel size is 30 m, HS 1 tp2 is predicted using the new coarse pixel of 120 × 120 m according to the framework of the NDVI_FMHC (Figure 1), then HS 2 tp2 and HS 3 tp2 are predicted using the new coarse pixel of 180 × 180 m and 240 × 240 m, respectively. The final prediction result of Landsat_NDVI is as follows: Landsat_NDVI P is the final prediction result.

Test Dataset Preprocessing
To evaluate the performance of the NDVI_FMHC, five group datasets of which the highresolution (Landsat TM\ETM+\OLI) and low-resolution (MOD09GQ) datasets are used to test the prediction ability with different land cover characteristics. All the data were downloaded from the USGS Earth Explorer website (https://earthexplorer.usgs.gov/). After FLAASH atmospheric correction for Landsat reflectance data, the Landsat_NDVI is calculated based on the near-infrared and red bands with 30 m spatial resolution. After pre-processing by the MODIS Reprojection Tool Swath (MRT Swath) software for MODIS reflectance data, MODIS_NDVI is calculated based on the near-infrared band and red band with 250 m spatial resolution and one-day temporal resolution and resampled to 30 m using bilinear interpolation. Both testing dates adopted the Universal Transverse Mercator (UTM) coordinate system under the WGS-84 datum, in which the coverage area is576 km 2 from the first to fourth group dataset, while the fifth dataset is 4444.64 km 2 .

Study Area and Data
The first site is located in Saskatoon, Canada (54°N, 106°W), which has been used to test many spatiotemporal fusion methods [10,11,25,35,36]. The type of land cover in this area is mainly forest, such as spruce, pine, and aspen, supplemented by swamps and sparse vegetation patches. The test datasets were obtained on 11 July ( ) and 12 August 2001 ( 0), using MODIS_NDVI on 11 July 2001 for prediction ( Figure 2). During the prediction period, land cover change is very small, but the growing season is short, and the phenology changes significantly [10]. Therefore, this site can be considered a forest area with phenological changes.

Test Dataset Preprocessing
To evaluate the performance of the NDVI_FMHC, five group datasets of which the high-resolution (Landsat TM\ETM+\OLI) and low-resolution (MOD09GQ) datasets are used to test the prediction ability with different land cover characteristics. All the data were downloaded from the USGS Earth Explorer website (https://earthexplorer.usgs.gov/). After FLAASH atmospheric correction for Landsat reflectance data, the Landsat_NDVI is calculated based on the near-infrared and red bands with 30 m spatial resolution. After pre-processing by the MODIS Reprojection Tool Swath (MRT Swath) software for MODIS reflectance data, MODIS_NDVI is calculated based on the near-infrared band and red band with 250 m spatial resolution and one-day temporal resolution and resampled to 30 m using bilinear interpolation. Both testing dates adopted the Universal Transverse Mercator (UTM) coordinate system under the WGS-84 datum, in which the coverage area is576 km 2 from the first to fourth group dataset, while the fifth dataset is 4444.64 km 2 .

Study Area and Data
The first site is located in Saskatoon, Canada (54 • N, 106 • W), which has been used to test many spatiotemporal fusion methods [10,11,25,35,36]. The type of land cover in this area is mainly forest, such as spruce, pine, and aspen, supplemented by swamps and sparse vegetation patches. The test datasets were obtained on 11 July (tp) and 12 August 2001 (t0), using MODIS_NDVI on 11 July 2001 for prediction ( Figure 2). During the prediction period, land cover change is very small, but the growing season is short, and the phenology changes significantly [10]. Therefore, this site can be considered a forest area with phenological changes. The second site is located in southern New South Wales, Australia (29°S, 150°E), and the main type of land cover is irrigation fields, and a small number of drylands and forests are present [37]. The test datasets were obtained on 5 July ( 0) and 22 August 2004 ( ), using MODIS_NDVI on 22 August 2004 for prediction ( Figure 3). In this area, crop planting or harvesting is the main activity, leading to drastic short-term changes in non-shape land cover. Owing to the crop type differences in each patch, this site has the characteristics of high spatial heterogeneity. Therefore, this site is an example of a cultivated land area with high spatial heterogeneity and mainly occurs as phenological change and land cover change without shape changes (hereafter called "non-shape change"). The second site is located in southern New South Wales, Australia (29 • S, 150 • E), and the main type of land cover is irrigation fields, and a small number of drylands and forests are present [37]. The test datasets were obtained on 5 July (t0) and 22 August 2004 (tp), using MODIS_NDVI on 22 August 2004 for prediction ( Figure 3). In this area, crop planting or harvesting is the main activity, leading to drastic short-term changes in non-shape land cover. Owing to the crop type differences in each patch, this site has the characteristics of high spatial heterogeneity. Therefore, this site is an example of a cultivated land area with high spatial heterogeneity and mainly occurs as phenological change and land cover change without shape changes (hereafter called "non-shape change").
The third site is located in northern New South Wales, Australia (29 • S, 149 • E), which is used in many spatiotemporal fusion methods [12,27], and the main types of land covers are cultivated land, water, and forest. The test datasets were obtained on 26 November (t0) and 12 December 2004 (tp), using MODIS_NDVI on 12 December 2004 for prediction ( Figure 4). A large flood event occurred in December 2004, causing a large amount of land to be inundated, and the type of land cover changed drastically. Therefore, this site is an example of a flood area with shape changes caused by floods.

of 21
The test datasets were obtained on 5 July ( 0) and 22 August 2004 ( ), using MODIS_NDVI on 22 August 2004 for prediction ( Figure 3). In this area, crop planting or harvesting is the main activity, leading to drastic short-term changes in non-shape land cover. Owing to the crop type differences in each patch, this site has the characteristics of high spatial heterogeneity. Therefore, this site is an example of a cultivated land area with high spatial heterogeneity and mainly occurs as phenological change and land cover change without shape changes (hereafter called "non-shape change").  The third site is located in northern New South Wales, Australia (29°S, 149°E), which is used in many spatiotemporal fusion methods [12,27], and the main types of land covers are cultivated land, water, and forest. The test datasets were obtained on 26 November ( 0) and 12 December 2004 ( ), using MODIS_NDVI on 12 December 2004 for prediction ( Figure 4). A large flood event occurred in December 2004, causing a large amount of land to be inundated, and the type of land cover changed drastically. Therefore, this site is an example of a flood area with shape changes caused by floods. The fourth site is located in Shenzhen, China (22°N, 114°E), and has been used to test many fusion methods [28,38], and the types of land covers are mainly urban land, water, and forest. The test datasets were obtained on 1 November 2000 ( 0) and 7 November 2002 ( ), using MODIS_NDVI on 7 November 2002 for prediction ( Figure 5). Because Shenzhen developed rapidly, the city has expanded vigorously from 2000 to 2002, causing a large area of vegetation to be transformed into urban areas. Therefore, this site is an example of an urban area with shape changes caused by urban expansion. The fourth site is located in Shenzhen, China (22 • N, 114 • E), and has been used to test many fusion methods [28,38], and the types of land covers are mainly urban land, water, and forest. The test datasets were obtained on 1 November 2000 (t0) and 7 November 2002 (tp), using MODIS_NDVI on Remote Sens. 2020, 12, 3774 9 of 21 7 November 2002 for prediction ( Figure 5). Because Shenzhen developed rapidly, the city has expanded vigorously from 2000 to 2002, causing a large area of vegetation to be transformed into urban areas. Therefore, this site is an example of an urban area with shape changes caused by urban expansion.
Remote Sens. 2020, 12, x FOR PEER REVIEW 9 of 21 The fifth site is located in Kangaroo Island, Australia (36°S, 137°E), in which the main types of land cover are cultivated land, forest, grasslands, and water. The test datasets were acquired on 22 January 2019 ( 0) and 10 February 2020 ( ), using MODIS_NDVI on 10 February 2020 for prediction ( Figure 6). A major forest fire broke out in Kangaroo Island during the New Year of 2020, causing a large area of forest, grassland, and cultivated land to become bare soil. Therefore, this site is an example of a fire area with land cover from multi-to single type.

Quality Evaluation Index
For these five sites, ESTARFM [11], NDVI-LMGM [18], and FSDAF [24] are also applied to the same datasets for comparison, to verify that the NDVI_FMHC has a good ability to predict non-shape and shape changes. To ensure fairness, these algorithms all use the default parameters given by the authors. In this study, the Pearson correlation coefficient (r), root mean square error (RMSE), average absolute difference (AAD), average error (AE), and structure similarity (SSIM) [39] are used as quantitative measures to evaluate the spectral and structural similarity between the predicted and original NDVI. The fifth site is located in Kangaroo Island, Australia (36 • S, 137 • E), in which the main types of land cover are cultivated land, forest, grasslands, and water. The test datasets were acquired on 22 January 2019 (t0) and 10 February 2020 (tp), using MODIS_NDVI on 10 February 2020 for prediction ( Figure 6). A major forest fire broke out in Kangaroo Island during the New Year of 2020, causing a large area of forest, grassland, and cultivated land to become bare soil. Therefore, this site is an example of a fire area with land cover from multi-to single type.

Quality Evaluation Index
For these five sites, ESTARFM [11], NDVI-LMGM [18], and FSDAF [24] are also applied to the same datasets for comparison, to verify that the NDVI_FMHC has a good ability to predict non-shape and shape changes. To ensure fairness, these algorithms all use the default parameters given by the authors. In this study, the Pearson correlation coefficient (r), root mean square error (RMSE), average absolute difference (AAD), average error (AE), and structure similarity (SSIM) [39] are used as quantitative measures to evaluate the spectral and structural similarity between the predicted and original NDVI.

Phenological Changes in Forest Area
The original and predicted Landsat_NDVI of the forest study area on 11 July 2001 by four spatiotemporal fusion models are very similar, indicating that all four models can capture the phenology change and success prediction of Landsat_NDVI (Figures 7 and 8). The main reason is that the correlation and spatial structure between the above original and Landsat_NDVI on 12 August 2001 are very good with lower prediction difficulty (r = 0.9377 and SSIM = 0.9436, Table 2). By analyzing the quantitative indicators, we find that r and SSIM of the NDVI_FMHC are higher than those of ESTARFM, NDVI-LMGM, and FSDAF, and AAD (0.0229), AE (0.0000), and RMSE (0.0373) of the NDVI_FMHC are the smallest (Table 2). In addition, the scatter of all four models is close to the 1:1 line; however, the scatterplots of FSDAF and NDVI_FMHC are more concentrated than those of ESTARFM and NDVI-LMGM ( Figure S1). Therefore, in the case of only phenological changes in

Phenological Changes in Forest Area
The original and predicted Landsat_NDVI of the forest study area on 11 July 2001 by four spatiotemporal fusion models are very similar, indicating that all four models can capture the phenology change and success prediction of Landsat_NDVI (Figures 7 and 8). The main reason is that the correlation and spatial structure between the above original and Landsat_NDVI on 12 August 2001 are very good with lower prediction difficulty (r = 0.9377 and SSIM = 0.9436, Table 2). By analyzing the quantitative indicators, we find that r and SSIM of the NDVI_FMHC are higher than those of ESTARFM, NDVI-LMGM, and FSDAF, and AAD (0.0229), AE (0.0000), and RMSE (0.0373) of the NDVI_FMHC are the smallest (Table 2). In addition, the scatter of all four models is close to the 1:1 line; however, the scatterplots of FSDAF and NDVI_FMHC are more concentrated than those of ESTARFM and NDVI-LMGM ( Figure S1). Therefore, in the case of only phenological changes in the forest area, the FSDAF and NDVI_FMHC have an effective prediction, in which the NDVI_FMHC has the least difference and the best correlation between original and predicted Landsat_NDVI in the forest area.
Remote Sens. 2020, 12, x FOR PEER REVIEW 11 of 21 the forest area, the FSDAF and NDVI_FMHC have an effective prediction, in which the NDVI_FMHC has the least difference and the best correlation between original and predicted Landsat_NDVI in the forest area.

Non-Shape Change in Cultivated Land Area
For the non-shape change, the prediction effect of the NDVI_FMHC is better than that of the other three models (Figure 9). Four models can capture and successfully predict the change in land cover over a large area (elliptical region), however, the prediction effect is poor in small areas (rectangular region), especially ESTARFM ( Figure 10). Due to the low correlation and spatial structural similarity between original Landsat_NDVI on 5 July 2004 and Landsat_NDVI on 22 August 2004 (r = 0.8240 and SSIM = 0.8123, Table 3), it is difficult to predict the change in cultivated land areas, especially for the small area. Compared with the NDVI_FMHC, the values of AAD, AE, and RMSE of ESTARFM, NDVI-LMGM, and FSDAF are higher 3.14% to 11.61%, 4.41% to 22.18%, and 4.26% to 17.91%, while r and SSIM are lower, 0.42% to 2.08% and 0.36% to 2.05%, respectively (Table  3). In addition, all scatter of FSDAF and NDVI_FMHC is closer to the 1:1 line than those of ESTARFM and NDVI-LMGM, indicating that the predicted values of FSDAF and NDVI_FMHC are closer to the actual value ( Figure S2). Thus, compared with the ESTARFM, NDVI-LMGM, and FSDAF, the NDVI_FMHC has the best spatial correlation and structural similarity with the original Landsat_NDVI and the smallest prediction error in cultivated land area. the forest area, the FSDAF and NDVI_FMHC have an effective prediction, in which the NDVI_FMHC has the least difference and the best correlation between original and predicted Landsat_NDVI in the forest area.

Non-Shape Change in Cultivated Land Area
For the non-shape change, the prediction effect of the NDVI_FMHC is better than that of the other three models (Figure 9). Four models can capture and successfully predict the change in land cover over a large area (elliptical region), however, the prediction effect is poor in small areas (rectangular region), especially ESTARFM ( Figure 10). Due to the low correlation and spatial structural similarity between original Landsat_NDVI on 5 July 2004 and Landsat_NDVI on 22 August 2004 (r = 0.8240 and SSIM = 0.8123, Table 3), it is difficult to predict the change in cultivated land areas, especially for the small area. Compared with the NDVI_FMHC, the values of AAD, AE, and RMSE of ESTARFM, NDVI-LMGM, and FSDAF are higher 3.14% to 11.61%, 4.41% to 22.18%, and 4.26% to 17.91%, while r and SSIM are lower, 0.42% to 2.08% and 0.36% to 2.05%, respectively (Table  3). In addition, all scatter of FSDAF and NDVI_FMHC is closer to the 1:1 line than those of ESTARFM and NDVI-LMGM, indicating that the predicted values of FSDAF and NDVI_FMHC are closer to the actual value ( Figure S2). Thus, compared with the ESTARFM, NDVI-LMGM, and FSDAF, the NDVI_FMHC has the best spatial correlation and structural similarity with the original Landsat_NDVI and the smallest prediction error in cultivated land area.

Non-Shape Change in Cultivated Land Area
For the non-shape change, the prediction effect of the NDVI_FMHC is better than that of the other three models (Figure 9). Four models can capture and successfully predict the change in land cover over a large area (elliptical region), however, the prediction effect is poor in small areas (rectangular region), especially ESTARFM ( Figure 10). Due to the low correlation and spatial structural similarity between original Landsat_NDVI on 5 July 2004 and Landsat_NDVI on 22 August 2004 (r = 0.8240 and SSIM = 0.8123, Table 3), it is difficult to predict the change in cultivated land areas, especially for the small area. Compared with the NDVI_FMHC, the values of AAD, AE, and RMSE of ESTARFM, NDVI-LMGM, and FSDAF are higher 3.14% to 11.61%, 4.41% to 22.18%, and 4.26% to 17.91%, while r and SSIM are lower, 0.42% to 2.08% and 0.36% to 2.05%, respectively (Table 3). In addition, all scatter of FSDAF and NDVI_FMHC is closer to the 1:1 line than those of ESTARFM and NDVI-LMGM, indicating that the predicted values of FSDAF and NDVI_FMHC are closer to the actual value ( Figure S2). Thus, compared with the ESTARFM, NDVI-LMGM, and FSDAF, the NDVI_FMHC has the best spatial correlation and structural similarity with the original Landsat_NDVI and the smallest prediction error in cultivated land area.

Shape Change in the Flood Area
The prediction results of the four models are not ideal, and it is difficult to accurately capture the shape changes of flood action (Figures 11 and 12). The main reasons are as follows: first, the correlation and spatial structural similarity between Landsat_NDVI on 26 November and 12 December 2004 are very low with a large shape change range formed by the flood (r = 0.6625 and SSIM = 0.5643, Table 4); second, MODIS and Landsat images were collected at an interval of approximately 0.5 h, and the flood occurred very quickly, leading to different inundation ranges between MODIS_NDVI and Landsat_NDVI on 12 December 2004 (Figure 4b,d); third, NDVI is very sensitive to surface vegetation coverage and less sensitive to the depth of water and turbidity. Overall, the NDVI_FMHC is superior to the other three models because ESTARFM and NDVI-LMGM have obvious block effects, FSDAF has poor continuity, and ESTARFM has the largest difference with the actual flood range (Figure 12). Compared with NDVI_FMHC, AAD, and RMSE of NDVI_FMHC are lower than those for ESTARFM, NDVI-LMGM, and FSDAF, while r and SSIM are higher (Table 4). In addition, all scatter of the NDVI_FMHC is closest to the 1:1 line, and ESTARFM has the largest dispersion and the lowest prediction accuracy ( Figure S3). Thus, although the four models are not ideal in the flood area, the Landsat_NDVI predicted by the NDVI_FMHC is still better than that by ESTARFM and FSDAF, while better than NDVI_LMGM by some but not all measures.

Shape Change in the Flood Area
The prediction results of the four models are not ideal, and it is difficult to accurately capture the shape changes of flood action (Figures 11 and 12). The main reasons are as follows: first, the correlation and spatial structural similarity between Landsat_NDVI on 26 November and 12 December 2004 are very low with a large shape change range formed by the flood (r = 0.6625 and SSIM = 0.5643, Table 4); second, MODIS and Landsat images were collected at an interval of approximately 0.5 h, and the flood occurred very quickly, leading to different inundation ranges between MODIS_NDVI and Landsat_NDVI on 12 December 2004 (Figure 4b,d); third, NDVI is very sensitive to surface vegetation coverage and less sensitive to the depth of water and turbidity. Overall, the NDVI_FMHC is superior to the other three models because ESTARFM and NDVI-LMGM have obvious block effects, FSDAF has poor continuity, and ESTARFM has the largest difference with the actual flood range (Figure 12). Compared with NDVI_FMHC, AAD, and RMSE of NDVI_FMHC are lower than those for ESTARFM, NDVI-LMGM, and FSDAF, while r and SSIM are higher (Table 4). In addition, all scatter of the NDVI_FMHC is closest to the 1:1 line, and ESTARFM has the largest dispersion and the lowest prediction accuracy ( Figure S3). Thus, although the four models are not ideal in the flood area, the Landsat_NDVI predicted by the NDVI_FMHC is still better than that by ESTARFM and FSDAF, while better than NDVI_LMGM by some but not all measures.

Shape Change in the Flood Area
The prediction results of the four models are not ideal, and it is difficult to accurately capture the shape changes of flood action (Figures 11 and 12). The main reasons are as follows: first, the correlation and spatial structural similarity between Landsat_NDVI on 26 November and 12 December 2004 are very low with a large shape change range formed by the flood (r = 0.6625 and SSIM = 0.5643, Table 4); second, MODIS and Landsat images were collected at an interval of approximately 0.5 h, and the flood occurred very quickly, leading to different inundation ranges between MODIS_NDVI and Landsat_NDVI on 12 December 2004 (Figure 4b,d); third, NDVI is very sensitive to surface vegetation coverage and less sensitive to the depth of water and turbidity. Overall, the NDVI_FMHC is superior to the other three models because ESTARFM and NDVI-LMGM have obvious block effects, FSDAF has poor continuity, and ESTARFM has the largest difference with the actual flood range ( Figure 12). Compared with NDVI_FMHC, AAD, and RMSE of NDVI_FMHC are lower than those for ESTARFM, NDVI-LMGM, and FSDAF, while r and SSIM are higher (Table 4). In addition, all scatter of the NDVI_FMHC is closest to the 1:1 line, and ESTARFM has the largest dispersion and the lowest prediction accuracy ( Figure S3). Thus, although the four models are not ideal in the flood area, the Landsat_NDVI predicted by the NDVI_FMHC is still better than that by ESTARFM and FSDAF, while better than NDVI_LMGM by some but not all measures.

Shape Change in Urban Area
The prediction result of the NDVI_FMHC is the closest to the original Landsat_NDVI, followed by NDVI-LMGM and FSDAF, and ESTARFM is the worst (Figure 13). Meanwhile, the four models can only capture large-scale shape changes (small rectangular region) but have difficulty predicting small-scale shape changes (elliptical region) because the partial area produced by urbanization is too small and not recorded in the predicted MODIS_NDVI ( Figure 14). Compared with the NDVI_FMHC, the value of AAD, AE, and RMSE of ESTARFM, NDVI-LMGM, and FSDAF are higher 3.50% to 7.38%, 0.56% to 13.84%, and 5.35% to 10.19%, and r and SSIM are lower 1.53% to 2.99% and 1.37% to 2.89%, respectively (Table 5). In addition, all scatter of the NDVI_FMHC is closest to the 1:1 line, indicating that the predicted value of this model is closest to the actual value ( Figure S4).

Shape Change in Urban Area
The prediction result of the NDVI_FMHC is the closest to the original Landsat_NDVI, followed by NDVI-LMGM and FSDAF, and ESTARFM is the worst (Figure 13). Meanwhile, the four models can only capture large-scale shape changes (small rectangular region) but have difficulty predicting small-scale shape changes (elliptical region) because the partial area produced by urbanization is too small and not recorded in the predicted MODIS_NDVI ( Figure 14). Compared with the NDVI_FMHC, the value of AAD, AE, and RMSE of ESTARFM, NDVI-LMGM, and FSDAF are higher 3.50% to 7.38%, 0.56% to 13.84%, and 5.35% to 10.19%, and r and SSIM are lower 1.53% to 2.99% and 1.37% to 2.89%, respectively (Table 5). In addition, all scatter of the NDVI_FMHC is closest to the 1:1 line, indicating that the predicted value of this model is closest to the actual value ( Figure S4).

Shape Change in Urban Area
The prediction result of the NDVI_FMHC is the closest to the original Landsat_NDVI, followed by NDVI-LMGM and FSDAF, and ESTARFM is the worst (Figure 13). Meanwhile, the four models can only capture large-scale shape changes (small rectangular region) but have difficulty predicting small-scale shape changes (elliptical region) because the partial area produced by urbanization is too small and not recorded in the predicted MODIS_NDVI ( Figure 14). Compared with the NDVI_FMHC, the value of AAD, AE, and RMSE of ESTARFM, NDVI-LMGM, and FSDAF are higher 3.50% to 7.38%, 0.56% to 13.84%, and 5.35% to 10.19%, and r and SSIM are lower 1.53% to 2.99% and 1.37% to 2.89%, respectively (Table 5). In addition, all scatter of the NDVI_FMHC is closest to the 1:1 line, indicating that the predicted value of this model is closest to the actual value ( Figure S4).

Shape Change in Urban Area
The prediction result of the NDVI_FMHC is the closest to the original Landsat_NDVI, followed by NDVI-LMGM and FSDAF, and ESTARFM is the worst (Figure 13). Meanwhile, the four models can only capture large-scale shape changes (small rectangular region) but have difficulty predicting small-scale shape changes (elliptical region) because the partial area produced by urbanization is too small and not recorded in the predicted MODIS_NDVI ( Figure 14). Compared with the NDVI_FMHC, the value of AAD, AE, and RMSE of ESTARFM, NDVI-LMGM, and FSDAF are higher 3.50% to 7.38%, 0.56% to 13.84%, and 5.35% to 10.19%, and r and SSIM are lower 1.53% to 2.99% and 1.37% to 2.89%, respectively (Table 5). In addition, all scatter of the NDVI_FMHC is closest to the 1:1 line, indicating that the predicted value of this model is closest to the actual value ( Figure S4).

Shape Change in the Fire Area
Four models can capture shape changes formed by the fire and successfully predict them, while the texture of predicted and original Landsat_NDVI are not consistent from the inset map, and the prediction effect is not good (Figure 15). First, the correlation and spatial structural similarity between original Landsat_NDVI on 19 January 2019 and 10 February 2020 are extremely low (r = 0.2792 and SSIM = 0.2812), indicating that a large area of forest, grassland, and cultivated land become bare soil due to the fire. Second, the predicted Landsat_NDVI still retains the texture information of basic Landsat_NDVI, leading to poor local effect. The five indicators of ESTARFM, FSDAF, and NDVI_FMHC are basically consistent; however, NDVI-LMGM has lower r (0.8333) and SSIM (0.8274) and larger AAD (0.0780) and RMSE (0.1092), indicating that ESTARFM, FSDAF, and NDVI_FMHC have basically similar effects in predicting shape change of the fire area (Table 6). In addition, the scatter plots of ESTARFM, FSDAF, and NDVI_FMHC are basically consistent, and the prediction results are similar (Supplemental Figure S5). Therefore, for the multiple types of land cover converted into a single type due to the fire, although four models can capture the overall shape change, the local prediction effect is not ideal because the prediction results still retain the texture information of the original Landsat_NDVI.

Shape Change in the Fire Area
Four models can capture shape changes formed by the fire and successfully predict them, while the texture of predicted and original Landsat_NDVI are not consistent from the inset map, and the prediction effect is not good (Figure 15). First, the correlation and spatial structural similarity between original Landsat_NDVI on 19 January 2019 and 10 February 2020 are extremely low (r = 0.2792 and SSIM = 0.2812), indicating that a large area of forest, grassland, and cultivated land become bare soil due to the fire. Second, the predicted Landsat_NDVI still retains the texture information of basic Landsat_NDVI, leading to poor local effect. The five indicators of ESTARFM, FSDAF, and NDVI_FMHC are basically consistent; however, NDVI-LMGM has lower r (0.8333) and SSIM (0.8274) and larger AAD (0.0780) and RMSE (0.1092), indicating that ESTARFM, FSDAF, and NDVI_FMHC have basically similar effects in predicting shape change of the fire area (Table 6). In addition, the scatter plots of ESTARFM, FSDAF, and NDVI_FMHC are basically consistent, and the prediction results are similar (Supplemental Figure S5). Therefore, for the multiple types of land cover converted into a single type due to the fire, although four models can capture the overall shape change, the local prediction effect is not ideal because the prediction results still retain the texture information of the original Landsat_NDVI.

Discussion
The study integrates multi-source remote sensing data (MODIS_NDVI and Landsat_NDVI) to obtain NDVI time-series data with high spatiotemporal resolution and study the long-term dynamic processes of the surface environment. In general, the most common natural changes on the land surface are phenological changes, simultaneously accompanied by dramatic land cover changes. Due to water levels rising or falling, fire hazards, floods, urban expansion, and deforestation, shape changes of land cover also frequently occur. During spatiotemporal fusion, the prediction of phenological changes is relatively easy, while the prediction of these shape changes is extremely difficult. Therefore, it is of great significance to understand the accurate, automatic, and robust prediction of complex shape changes in various landscapes for comprehensive monitoring of the surface dynamics.

Prediction Advantages of the NDVI_FMHC
The changing pattern of NDVI is determined by biotic factors (e.g., vegetation type) and environmental factors (soil, temperature, and precipitation). However, environmental factors play a more significant role than biotic factors, and in a small area, NDVI change can be assumed to be similar within the same type of land cover. Therefore, NDVI-LMGM assumes that within a short period, NDVI values of adjacent pixels of the same land cover exhibit the same linear changes [18]. If only phenological changes occurred, this assumption would be reasonable, but the prediction results will have a large error when shape changes and existing variations occur within the class. The FSDAF [24] has designed a new weighting function reallocating residuals to areas with shape changes reducing errors. The NDVI_FMHC proposes a new spatiotemporal fusion framework to solve this problem based on classification map 1 and classification map 2, in which classification map 1 is obtained by HS t0 and classification map 2 is obtained by combining HS t0 and HS tp1 . Compared with classification map 1, classification map 2 further refines the land cover types and its changing characteristics, which can distinguish shape changes that did and did not occur during the prediction period. Therefore, the abundance calculated by classification map 2 is more reasonable, and the time variation of the terminal element solved by the linear mixture model has better robustness and higher accuracy.
It is necessary to select a suitable classification method to classify and obtain an accurate classification map, which is the basis of the spatiotemporal fusion model. Currently, some spatiotemporal fusion models select similar pixels based on the threshold of spectral distance to obtain classification maps, such as STARFM and ESTARFM [11,24]. Because the threshold is empirical, it is difficult to accurately capture the boundaries of different land cover types owing to different objects with the same spectrum, especially NDVI. There are also some spatiotemporal fusion models using ISODATA [18,24] or the K-Means method [2] to perform unsupervised classification to obtain classification maps. However, the ISODATA method needs many parameters, and the input value is not easy to determine, and K-Means is easily affected by the initial value. Hierarchical clustering is also widely used in image segmentation and classification owing to ease of use and defining similarity rules [40,41], and the NDVI_FMHC selects this method. There are many clustering methods for image classification, each of which has its own advantages and disadvantages. We have designed two features for applying clustering classification methods, including a 16-dimensional local histogram based on HS t0 , and a 32-dimensional local histogram expanded by two 16-dimensional local histograms based on HS t0 and HS tp1 , which not only contains the information of land cover but also reflects the change at two moments.
Due to the grid effect of the window and error of residual distribution, the regional errors of shape changes and variations within the class are large, leading to the block effect of the prediction image. Therefore, it is very important to reduce the error between the true and predicted value for various spatiotemporal fusion models. The NDVI_FMHC uses four strategies to reduce error. First, the overdetermined equations of the linear mixed model are constructed with a large window and the optimal solution is obtained by the least-squares method which can reduce errors. Second, three times parallel HS tp2 prediction results are obtained to calculate the arithmetic average. It should be noted that the multiscale prediction is to fuse high and low frequency information to reduce the residual between the prediction and true value. Third, local variability of temporal change caused by land cover conversions or within-class differences is modelled well through the distribution of residuals. Fourth, Gaussian filter is applied to suppress the noise of HS t0 , HS u0 , HS up , HS tp1 , and HS tp2 . Although the filtering makes the prediction image much blurrier and thus reduces the actual spatial resolution, a smaller standard deviation of Gaussian filter has little influence on the blurriness and can reduce the singular points of the predicted result.
Low computational efficiency is a key factor restricting the widespread application of spatiotemporal fusion models [12]. In terms of the tested fire area (4444.64 km 2 ), NDVI_FMHC and NDVI-LMGM only require 4.10 min and 2.58 min, while ESTARFM and FSDAF require 105.37 min and 111.52 min. Due to multiple parallel processing of NDVI_FMHC, the computational efficiency is lower than that of NDVI-LMGM. However, the calculation efficiency of NDVI_FMHC is much higher than that of ESTARFM and FSDAF for two main reasons, including the use of advanced programming strategies, such as parallel computing, and no need to use ENVI and ArcMap software, reducing the function module calling time in the prediction process.

Optimal Parameters of the NDVI_FMHC
To obtain the best prediction effects and spend the least time, a large number of trial and error experiments were performed for the first four groups of test datasets, and the optimal parameters of the NDVI_FMHC were determined involving four main parameters: count of classes, distance calculation method, coarse pixel width, and Gaussian filter parameters. The optimal parameters of the NDVI_FMHC are based on MODIS and Landsat data while might need to be adjusted by future users to other satellite data spatiotemporal fusion.
The count of classes of hierarchical clustering based on histogram features is determined by users according to experience, and it is valuable to test the influence of count of classes on prediction accuracy. Supplemental Table S1 is the quantitative index evaluation table from 2 to 10 count of classes. In terms of trends, AAD, AE, and RMSE of cultivated land and flood areas decrease rapidly first with the increase in count of classes, and then tend to stabilise, while the changes in r and SSIM are opposite; AAD, AE, and RMSE of forests and urban areas increase slowly at first with the increase in count of classes, and then increase rapidly, while the changes in r and SSIM are opposite. In comparison, when the count of classes is four, the prediction effects of the NDVI_FMHC are relatively satisfactory.
When NDVI is classified by hierarchical clustering based on histogram features, the relationship between feature points in feature space is measured by pairwise distances. Therefore, the definition of distance measurement is of great significance to the study of the structure of feature space. In this study, we have analyzed the performance of six different distance measures including KL, L 1 , L 2 , L ∞ , ∩, and χ 2 [42]. By analyzing the quantitative index evaluation tables of six different distance measures (Table S2), for different distance measures, the difference in prediction results of the NDVI_FMHC is very small, indicating that the different distance measures have little effect on the prediction performance of the new spatiotemporal fusion framework. In addition, because L 1 and χ 2 have the smallest AAD, AE, and RMSE, and the largest r and SSIM, it is recommended that L 1 and χ 2 are the formulas of distance measurement for the NDVI_FMHC.
Coarse pixel width is a parameter to determine the size and change in coarse pixel width, which is also one of the important parameters of the NDVI_FMHC. Supplemental Table S3 is the quantitative index evaluation table under the width from 3 to 10 coarse pixels. In terms of trends, AAD, AE, and RMSE of the four test areas slowly decrease with the increase in the coarse pixel width, and then increase rapidly, while the changes in r and SSIM are opposite. The coarse pixel width may introduce a large error to the prediction result, leading to the decrease in the prediction effect, and the recommended coarse pixel width is 3 or 4.
Gaussian filtering is a kind of linear smoothing filter, and the smoothing degree depends on the standard deviation. The larger the standard deviation, the more dispersed the distribution, and the filter effect will be closer to the mean filter; the smaller the standard deviation, the more concentrated the distribution, and the filter effect will be weaker. Supplemental Table S4 is the quantitative index evaluation table under the change in standard deviation from 0.1 to 1. In terms of trends, AAD, AE, and RMSE of the four test areas first decrease with the standard deviation, then increase, and reach the minimum value at 0.5 or 0.6, while the change rule of r and SSIM is opposite, which means that the standard deviation is too small to eliminate the noise mixed with the prediction results, while the standard deviation is too large, some of the accurate values in the prediction results will also be eliminated as noise. Therefore, the recommended standard deviations are 0.5 and 0.6.

Limitations of the NDVI_FMHC
The NDVI_FMHC shows good robustness and prediction accuracy in predicting phenological changes and shape changes; however, the following aspects still need to be improved. First, in urban expansion, shape changes are generally large in number and small in size, and the low-resolution NDVI of the predicted date and a pair of high and low-resolution NDVI of the basic date did not record these changes, increasing the prediction difficult. Second, because MODIS and Landsat images were collected at an interval of approximately 0.5 h and the flood occurred very quickly, the range of shape changes recorded by two kinds of NDVI is not consistent, causing unsatisfactory fusion results. Third, for the multiple types of land cover converted into a single type due to the fire, although current fusion models can capture the overall shape change, the local prediction effect is not ideal because the prediction results still retain the texture information of the basic Landsat_NDVI. If we can propose a better new framework or add auxiliary data (other high-resolution images), the prediction accuracy of the above complex shape changes may be improved.