A Robust Method for Generating High-Spatiotemporal-Resolution Surface Reflectance by Fusing MODIS and Landsat Data

: The methods for accurately fusing medium- and high-spatial-resolution satellite reflectance are vital for monitoring vegetation biomass, agricultural irrigation, ecological processes and climate change. However, the currently existing fusion methods cannot accurately capture the temporal variation in reflectance for heterogeneous landscapes. In this study, we proposed a new method, the spatial and temporal reflectance fusion method based on the unmixing theory and a fuzzy C-clustering model (FCMSTRFM), to generate Landsat-like time-series surface reflectance. Unlike other data fusion models, the FCMSTRFM improved the similarity of pixels grouped together by combining land cover maps and time-series data cluster algorithms to define endmembers. The proposed method was tested over a 2000 km 2 study area in Heilongjiang Provence, China, in 2017 and 2018 using ten images. The results show that the accuracy of the FCMSTRFM is better than that of the popular enhanced spatial and temporal adaptive reflectance fusion model (ESTARFM) (correlation coefficient ( R ): 0.8413 vs. 0.7589; root mean square error (RMSE): 0.0267 vs. 0.0401) and the spatial-temporal data fusion approach (STDFA) (R: 0.8413 vs. 0.7666; RMSE: 0.0267 vs. 0.0307). Importantly, the FCMSTRFM was able to maintain the details of temporal variations in complicated landscapes. The proposed method provides an alternative method to monitor the dynamics of land surface variables over complicated heterogeneous regions.


Introduction
The surface reflectance, which characterizes the ability of the land surface to reflect solar and sky radiation pertaining to vegetation cover, soil moisture and surface roughness [1][2][3], is a valuable indicator for recognizing ground objects [4,5] and retrieving biophysical variables, e.g., vegetation indices (VIs) [6,7], leaf area index (LAI) [8,9] and biomass [10,11]. The requirement for reflectance data with both high spatial and temporal resolution is increasingly important to simulate the surface energy budget [12][13][14] and monitor ecosystem and hydrologic dynamics [15][16][17][18] at regional and global scales. In satellite design, however, a trade-off must be made between temporal and spatial resolutions [19,20]. As a result, it is difficult to acquire remotely sensed data with both frequent coverage and high spatial resolution [21][22][23].
One technique for increasing the spatial resolution of frequent coverage satellite observations is the blending of images from sensors with complementary temporal and spatial characteristics with the aim of generating synthetic images with both high temporal and spatial resolutions [24][25][26][27]. Recently, several spatial and temporal data fusion methods (STFMs) have been developed to generate images with both high spatial and high temporal resolution. These methods can be classified into two categories: methods based on the spatial and temporal adaptive reflectance fusion model (STARFM) (hereafter referred to as STARFM-based models) and methods based on unmixing theory (hereafter referred to as UMX-based models). STARFM-based models are based on the assumption that the ratio of coarse pixel reflectance to neighboring similar pixels does not change over time [22,25,28]. These methods are particularly useful for preserving spatial detail information [20,28]. Previous scientists have substantially improved the STARFM [29][30][31][32] but have not addressed the following weaknesses: (i) the assumption of STARFM-based models is violated in complex heterogeneous regions due to the temporal variability of surface reflectance [33]; (ii) the models cannot predict short time change events if the changes are not recorded in at least one of the base fine-resolution images [25,28,29]; and (iii) land cover change may lead to low accuracy in STARFM-based methods [25,28,33,34].
For UMX-based methods, the abundances and endmembers are obtained by grouping highresolution pixels, and the average reflectance of endmembers is unmixed from coarse-resolution images based on the assumption that the reflectance of each coarse spatial-resolution pixel is a linear combination of the responses of each endmember contributing to the mixture [19,33,35,36]. Generally, the accurate prediction of surface reflectance is mainly limited by grouping similar pixels and calculating the average reflectance [37,38]. The pixels grouped together should have similar spatial and temporal variability [39,40].
The main advantage of UMX-based methods is that, unlike STARFM-based methods, they define endmembers by clustering similar pixels, and the average reflectance of endmembers is unmixed from the coarse-spatial-resolution images. This allows for three additional possibilities. First, multiphase images can be used to group similar pixels. Second, auxiliary datasets such as land cover maps may supplement the grouping of similar pixels [41]. Third, frequent coverage coarseresolution images can be used to capture phenological variations (i.e., NDVI profiles) [20,34]. However, most UMX-based methods employ only one or two (generally at the beginning and end of the observation period) high-resolution images to cluster similar pixels [33,35,38], which may fail to capture short temporal interval events that are not recorded in base high-resolution images.
To overcome the limitations of UMX-based models and to obtain the accurate prediction of reflectance, even in areas with large temporal and spatial variance, in this study, we have proposed a new data fusion method based on unmixing theory, as well as the spatial and temporal reflectance fusion model based on the fuzzy C-clustering model (FCMSTRFM). The FCMSTRFM improves the similarity of pixels that are grouped together by combining land cover maps and multiphase images to record temporal and spatial variation information from all available high-resolution images, and a new strategy was developed to calculate the average reflectance of each endmember. To evaluate the performances of the FCMSTRFM, we validated the FCMSTRFM with ten Landsat 8 Operational Land Imager (OLI) and MODIS datasets in the HePing Irrigated Area (HPIA), and we also compared our method with two other methods (the spatial-temporal data fusion approach (STDFA) and the enhanced spatial and temporal adaptive reflectance fusion model (ESTARFM). The paper is structured as follows. Section 2 describes the FCMSTRFM method. Section 3 provides experimental data and data processing. Section 4 lists the fused reflectance results derived. Section 5 presents a discussion, and Section 6 explains our conclusion. Figure 1 shows the flowchart of the processing steps implemented in the FCMSTRFM. The algorithm requires multiple pairs of images (a pair of images means fine-and coarse-resolution images obtained on the same date), time-series coarse-resolution images for the expected prediction dates and land cover mapping. Images of all input models are required to be radiometrically calibrated, atmospherically corrected and subjected to image registration. There are four main steps of the FCMSTRFM: (1) class and subclass definition; (2) sensor-bias adjustment; (3) class and subclass average reflectance calculation; and (4) pixel reflectance calculation.

Class and Subclass Definition
To reduce the loss of temporal variability information, a method called pixel hierarchical clustering (PHC) is proposed for clustering pixels that have similar characteristics. PHC employs all available high-resolution images and land cover mapping to cluster similar pixels to ensure that the pixels clustered together have similar temporal variability. There are two steps in the PHC method: time-series Landsat images are clustered by land cover mapping to define the land cover class first, then each land cover class is further subclassified by the FCM to define the subclass.
Land cover data are mapped by dividing pixels in scene into several categories based on the pixels' reflectance, vegetation index and spectral characteristic [42][43][44], which is an effective classification of land surface characteristics. The pixels divided into one land cover type are similar in reflectance in some ways. For example, the vegetation spectral curve has specific peaks and valleys, and the soil reflectance increases with increasing wavelength. Therefore, we can use land cover mapping to coarsely cluster similarity pixels. However, it is possible that differences in the composition of land cover types (each land cover type contains multiple ground objects, e.g., the different crops grown on farmland, impermeable layers containing roads and buildings) or environmental factors (e.g., soil type, altitude) can cause inconsistent temporal variability of pixels within the same land cover class. To eliminate the inconsistency, we further subclassify land cover classes.
Time-series data clustering algorithms (TSDCAs) cluster time-series data by minimizing the dissimilarity of time-series samples in the same cluster while maximizing the dissimilarity of different clusters [45][46][47]. Since the FCM, a TSDCA, has been shown to be effective in time-series data clusters [48][49][50], we employ the FCM to subclassify land cover classes. The FCM clustered the timeseries dataset X={Xdi,di=2,…,n} into c cluster centers V={Vi,i=1,…,c}, c∈{2,3,…k,k<n-1} by maximizing the following membership function: where Uij is the membership of Xdi in cluster centers k, c is the number of cluster centers, w is the weighted index, dij is the Euclidean distance between data point i and the cluster center j, and cij is the total membership of each data point to the cluster center j. In this study, the FCM was used to subclassify each land cover type into 10 subclasses (the standard deviation of reflectance of subclass should less than 10% of the whole band), with a weighted index of 2.5 and termination error of 1 × 10 −4 .

Sensor-bias Adjustment
Because of the differences in sensor systems, such as bidirectional reflectance distribution function (BRDF) effects, acquisition date, bandwidth and spectral response functions, there is a systematic reflectance bias between fine-and coarse-resolution images-that is, for a pure, homogeneous pixel covered by only one object, there is a bias in the reflectance among different sensors. To minimize the sensor bias, the MODIS pixel reflectance RM(x,y,ti,B) (for ease of distinction, in this study, the subscripts M and L represent MODIS and Landsat images, respectively) must be adjusted to Landsat pixel reflectance RL(x,y,ti,B) before blending them.
For a pure pixel, the difference in reflectance between different sensors only results from the sensor bias. Therefore, the relationship between MODIS and Landsat reflectance can be reasonably described by a linear model expressed as: ( , , , , ) = × ( , , , , ) + where RL and RM denote the reflectance of Landsat and MODIS images, respectively; (x,y) is the location of both Landsat and MODIS pixels; ti is the image acquired date; C is the land cover type; B is band of images and a and b are constants related to location and environment. To minimize error, in this study, we assume that different land cover types have different values of a and b.
Assuming the Landsat pixel number of land cover type C is n, then the relationship between all Landsat and MODIS pixel reflectance of land cover type C can be described as (3), and the relationship between Landsat and MODIS average reflectance of land cover type C can be described as (4).
where and denote the average reflectance of Landsat and MODIS images, respectively. From (4), we know that there is also a linear relationship between Landsat and MODIS land cover class average reflectance. If there are multitemporal (ti=t1,t2,…,tn) fine-and coarse-resolution images, a linear regression model can be used to acquire a and b in (4). Equation (4) can be written as Equation (5) Equation (6) shows that the change in fine-resolution land cover class average reflectance from ti to tj equals the scaled change in average reflectance ti to tj given by the coarse-resolution land cover class.

Class and Subclass Average Reflectance Calculation
Although it is mathematically feasible to use as many classes as the number of coarse-resolution images, too many endmembers may lead to ill-posed spectral unmixing [51]. Too few endmembers may lead to a decrease in the similarity of pixels within each endmember. To address this irreconcilable contradiction, we propose an efficient strategy for the unmixing problem that produces realistic estimated average reflectance of endmembers. The average reflectance of the land cover class was unmixed from the coarse-resolution images, and the average reflectance of the subclass was solved by building the time-series relationship between the subclass and land cover class.
In the FCMSTRFM, unmixing is performed by solving the linear mixing model. According to the linear mixing model, the reflectance RM(k,ti,B) of coarse-resolution pixel k that consists of m discrete land cover class C weighted by fractional A(k,C) can be described as Equation (7). If the number of MODIS pixels in the scenes is pn and the number of land cover classes is pm, the relationship between the reflectance of the coarse pixel and the average reflectance of the land cover class can be described as Equation (8).
constraint condition: where A(k,C) is the abundance matrix of mixed pixel k; and ε(C,ti) is the residuals. The time-series average reflectance of each land cover class can be calculated by solving Equation (8) using the ordinary least squares technique.
After obtaining the average reflectance of each land cover class, the average reflectance of the subclass is obtained by building the time-series relationship between the land cover class and subclass. If we assume that f(S,ti,B) is the ratio of the average reflectance of subclass S to the average reflectance of land cover class C (Equation (9)), f(S,ti,B) (Equation (9)) describes how the subclass average reflectance changes in terms of land cover class average reflectance variables. The ratio f(S,ti,B) is greater than 1 when the variation rate of the subclass is greater than that of the land cover class; f(S,ti,B) is less than 1 when the variation rate of the subclass is less than that of the land cover class.
( , , ) = ( , , )/ ( , , ) Because of the spectral temporal variability, f(S,ti,B) changes over time. Fortunately, the assumption that f(S,ti,B) changes linearly during short temporal intervals has been shown to be mathematically reasonable [52]. If date tp is between adjacent dates ti and tj, the relationship among f(S,tp,B), f(S,ti,B) and f(S,tj,B) can be described as Equation (10)-that is, Equation (10) can be used to calculate the ratio of average reflectance between subclass and class at any date among ti and tj. If there are multiphase images, Equation (10) can be used to build the time-series relationship between the subclass and class.

Pixel Reflectance Calculation
Because the pixels within each subclass have similar reflectance variability, the reflectance of each fine-resolution pixel at each time can be calculated by the surface reflectance calculation model (SRCM) proposed by Wu et al. [35]. This assumes that the temporal increment of each fine-resolution pixel in the same class (subclass in this paper) is constant and equal to the average reflectance increment of the class, described as follows: where RL(k,t0,B) and RL(k,tp,B) are the reflectance of pixel k of fine resolution at dates t0 and tp, respectively, and (k, t , B) and (k, t , B) are the average reflectance of subclass S of fine resolution at times t0 and tp, respectively. Combining Equations (6) and (11) Combining Equations (9) and (12) yields Equation (13) Since f(S,tp,B) can be calculated by Equation (9) and Equation (10), ( , , ) and ( , , ) can be calculated by Equation (8), and a can be calculated by Equation (6), we can obtain RL(k,tp,B) with Equation (13)-that is, the time-series reflectance of each fine-resolution pixel can be calculated by Equation (13).

STDFA
The STDFA was proposed by Wu et al. [35] to generate images with MODIS coverage frequency and Landsat spatial resolution using four steps: (1) clustering the high-resolution pixels to define endmembers based on two pairs of images; (2) calculating the abundances of each endmember within each coarse-resolution pixel; (3) unmixing the coarse-resolution pixel based on linear unmixing theory; and (4) calculating the reflectance of each high-resolution pixel.
The STDFA clusters the high-resolution pixels by dividing the increment of two pairs of images (generally at the beginning and end of the observation period) into several equal parts. After that, a sliding window the size of MODIS pixels is applied to the clustered result to record the endmember abundance matrix. Then, the unmixing of coarse-resolution pixels is performed by the linear mixing model introduced in Section 2.1.3 (Equation (8)). Then, the reflectance of each high-resolution pixel is calculated by the theory of SRCM (Equation (11)), which assumes that the pixels belonging to the same class have the same temporal changes.

ESTARFM
The STARFM and the ESTARFM are based on the premise that the changes in reflectance between fine-and coarse-resolution images are consistent. After data processing as described in Section 3.2, the ESTARFM applies the following steps to generate high spatial and temporal resolution images. First, a sliding window is applied to the Landsat image to identify similar neighboring pixels. Considering that the reflectance may vary significantly over time, the ESTARFM uses two fine-resolution images to select similar neighbor pixels and then extracts the intersection of the two dates to obtain a more accurate similar-neighbor pixel. Second, the weight W is assigned to each similar neighbor based on (i) the spatial distance between the central pixel and neighboring pixels and (ii) the spectral similarity between a pair base date image. Third, the conversion coefficient Vi is calculated by linear regression analysis. The final step is the calculation of central pixel reflectance, which can be characterized by Equations (14)(15). For a more theoretical detail of the ESTARFM algorithm, we refer to [28].
where Tz (z=m, n) is a proportion that is used to assign the weight that time tm and tn contribute to predict time tp.

Evaluation Metrics
We selected four different statistical criteria to evaluate the performance of the FCMSTRFM, the ESTARFM and the STDFA: correlation coefficient (R, Equation (16)), root mean square error (RMSE, Equation (17)), the Erreur Relative Globale Adimensionalle de Synthèse (ERGAS, Equation (18)), and the mean absolute difference (MAD, Equation (19)). R indicates the linear correlativity between the observed and predicted reflectance. The RMSE is calculated as the square root of the average value of the differences between the predicted and observed reflectance. The lower the RMSE is, the more reliable the performance of the model. ERGAS demonstrates the quality of the predicted reflectance. MAD reflects the average differences between the predicted and observed reflectance. The metrics are calculated as follows: = 100 1 ( / ) where kai and kpi are the observed and predicted reflectance of pixel i, np is the pixel number of the image, and are the average reflectance of the observed and generated images, H and L are the pixel sizes of the fine-and coarse-resolution images, Nb is the band number, and Mj is the average reflectance of band j.

Study Area
The FCMSTRFM was tested and validated in HPIA ( Figure 2); the longitude of this region ranges from 127°18'40.28" to 127°45'12.22", and the latitude ranges from 46°51'7.83" to 47°4'8.33". The northeastern area of HPIA is a high mountain, while the southwestern area of HPIA is a plain. The main land cover type of the plain is farmland, where the crop growing period is short and phenology changes rapidly. The high mountains are generally covered by broadleaved deciduous forests. The farmland and high mountains are covered by green vegetation from April to November and covered by snow at other times. HPIA is a complex heterogeneous region with diverse land use types, including grassland, road and water. The heterogeneous region is suitable for testing the FCMSTRFM.

Data Preprocessing
Ten Landsat 8 Operational Land Imager (OLI) and MODIS datasets were used in this study (Table 1). Table 2 shows the attributes of the Landsat 8 OLI bands and their matched MODIS bands. All images were acquired in clear sky conditions and provided by the United States Geological Survey (USGS) (https://glovis.usgs.gov/). The Landsat images were atmospherically corrected using the FLAASH Atmospheric Correction Model in ENVI 5.3 software (The Environment for Visualizing Images, provided by Harris Geospatial Solutions, Inc in Florida, USA). Landsat 8 OLI has been geometrically corrected, but to obtain higher accuracy, the Landsat images were georeferenced using a topographic map of 1:10,000 scale based on the nearest neighbor resampling method with a position error within 0.5 pixels. MODIS level L2G product MOD09GA and MODIS level L3 product MOD13Q1 were reprojected to the UTM-WGS84 52N projection using the MODIS Reprojection Tool (MRT), clipped to the extent of the Landsat images, and resampled to Landsat resolution using a nearest neighbor approach. In addition to Landsat and MODIS image data, there is also a need for land cover mapping. The accuracy of land cover mapping has an important influence on the accuracy of the generated images. Although there are many Landsat-scale land cover mapping datasets available, we did not use the downloaded dataset of the land cover data and selected the Landsat 8 OLI image data to classify the area into water, grassland, forest, paddy field, upland field and impervious (Figure 2) based on a support vector machine (SVM). Verification by ground samples indicated that the overall accuracy of the land cover data was 87.33%, and the kappa coefficient was 86.07%.

Evaluation of the FCMSTRFM
The accuracy of the FCMSTRFM was evaluated by comparing the generated images against the observed cloudless images obtained from the Landsat 8 OLI acquired on 01 June 2018, and 30 April 2018. Table 3 shows the accuracy of the generated image of three bands (red, green and NIR bands), indicating that the FCMSTRFM can generate images very similar to the observed images (the FCMSTRFM also can be used in cirrus and thermal bands). Most generated images have a high accuracy with R higher than 0.70 and RMSE less than 0.04. To assess whether the FCMSTRFM is suitable for regions where the land cover type changes, the three reflectance bands generated by the FCMSTRFM were compared with those observed on time series based on the availability of clear scene images (Figure 3). Although the land cover type has suffered complex processes (for example, green vegetation-covered land changed to bare land in December, farmland and grassland was covered by snow in February), it is clear that the images generated by the FCMSTRFM are very similar to those observed.   Figure 4 shows the relationship between the observed and predicted reflectance on June 01, 2018, for the green, red and NIR bands. For the green and red bands, most data points in the scatterplots are close to the 1:1 line, indicating that the three models can generate reflectance very similar to the actual reflectance in these two bands, but only the FCMSTRFM can generate reflectance similar to the actual reflectance in the NIR band. Figure 5 shows absolute value of different between observed and predicted reflectance on June 01, 2018, for the three models. It is clear that the FCMSTRFM can generate reflectance closest to the actual reflectance. Almost all images, especially those of the ESTARFM, in Figure 5 are visually similar to land cover mapping, indicating that the predicted reflectance is strongly influenced by the land use type. For three models, there is a wavelength dependency in the improvements (Figures 4 and 5), particularly in the NIR region. This is caused by the different of reflectance variation range, green and red band change from 0 to 0.2, but NIR from 0 to 0.5. Figure 6 shows the density curve of the difference between the generated image and observed image for the three methods. The curve of the FCMSTRFM is much more concentrated, the bias was close to zero and the difference was constrained to ± 0.01, except for a few outliers. Based on the statistical comparison (Table 4) between the three methods, the correlation coefficient increased by 0.02 and RMSE decreased by 0.01 on average. Almost all statistical results of the FCMSTRFM are greater than those of the ESTARFM and the STDFA, except for a few marked in black (Table 4), which means that the reflectance predicted by the FCMSTRFM was more accurate than that from the other two methods. Figure 7 shows the Landsat-like images generated by the FCMSTRFM, the STDFA and the ESTARFM. It is clear that the image generated by the ESTARFM (Figure 7d) is more similar to the observed image (Figure 7(c)), and the image generated by the STDFA (Figure 7e) contains some reflectance errors caused by cluster errors of similarity pixels. Somewhat "blurry" areas can be seen in the images generated by the FCMSTRFM and the STDFA. The image generated by the ESTARFM (Figure 7f) contains more spatial details, but some reflectance change information has been stretched or compressed.

Influence of Image Registration
Due to geometric correction and coordinate conversion errors, there is deviation in the image registration. To evaluate the influence of image registration, we moved the Landsat image in one direction (Figure 8) to obtain the RMSE change with the movement of the Landsat image. Figure 9 shows the relationship between the number of moving pixels and the RMSE of the generated reflectance. We found that when the number of moving Landsat pixels exceeded six, the accuracy of the generated images began to decrease substantially, indicating that the accuracy of registration strongly influences the performance of the FCMSTRFM. Therefore, the error of image registration should be small within six Landsat pixels.

Influence of the Accuracy of the Land Cover Map
In the FCMSTRFM, one notices that the pixels in the same land cover class have a similarity spectrum within a vegetation growth period, and in PHC, the fine-resolution images are categorized first by land cover mapping. The accuracy of land cover mapping directly determines the accuracy of the first clustering. If the accuracy of the land cover mapping is poor, the similarity of the land cover classes will be low. Therefore, it is clear that the accuracy of land cover mapping strongly influences the performance of the FCMSTRFM. We assess the influence of errors of land cover type classification by forcing randomly convert certain proportion pixels (5%, 10%, 15%, 20% and 25%) to be the wrong type 10 times, such as turning farmland into an impermeable layer or forestland, is used to obtain the RMSE of the generated image change with the conversion of the land cover type. Figure  10 shows the relationship between the converted proportion of land cover type and the average value of RMSEs for all the images generated by the FCMSTRFM. It shows that RMSEs had a positive correlation with the converted proportion of land cover type: the higher the converted proportion of land cover type was, the higher the RMSEs. That means that the lower the accuracy of the land cover map is, the lower the accuracy of the generated images. Figure 10. The relationship between the RMSE of the generated image and the converted proportion of land cover type.

Influence of Temporal and Spatial Heterogeneity
To investigate the sensitivity of the three models to temporal changes in reflectance, we chose the variance of difference between two fine-resolution images ( in Equation (20)) as a measure of the reflectance change intensity, and RMSE as a measure of the accuracy of generated images. The three models have a positive correlation with the RMSE of the generated images ( Figure 11). For almost all models, the stronger the temporal changes in reflectance are, the lower the accuracy of the generated images. Moreover, the slope of the FCMSTRFM is less than that of the ESTARFM and the STDFA. When the reflectance drastically changes over time, the accuracy of the FCMSTRFM is minimally reduced. This means that the FCMSTRFM has an advantage in regions with drastic temporal changes in reflectance.
where np is the number of pixels in the whole fine-resolution image; kig and kjg are the reflectance of pixel g at times ti and tj, respectively; and ui and uj are the average reflectance of the whole image at times ti and tj, respectively. To investigate the sensitivity of the three models to the spatial heterogeneity of reflectance, we chose the variance for all pixels in the actual image as the measure of pixel spatial heterogeneity and RMSE as the measure of accuracy of the generated images. Figure 12 shows that for the three models, the variance for all the pixels in the actual image had a positive correlation with RMSE: the variance increased with the RMSE, indicating that the accuracy of the images generated by STFMs decreased with reflectance spatial heterogeneity. Moreover, the slopes of the FCMSTRFM and the ESTARFM were less than that of the STDFA. When reflectance spatial heterogeneity is complex, the accuracies of the FCMSTRFM and the ESTARFM are minimally reduced.

FCMSTRFM Improvements to Existing Models
Compared with other STFMs, the FCMSTRFM shows the following three distinct advantages. First, the FCMSTRFM develops a new method called PHC that combines land cover map and timeseries data cluster methods to group high-resolution pixels. PHC has the capacity, which is not available in other UMX-based methods such as the STDFA, to easily group together pixels that have similar temporal reflectance changes. The land cover map is a simple and effective division of surface information [53][54][55] and is unchanged in a short time period. Therefore, land cover maps can be used to coarsely classify time-series images [56][57][58]. Time-series images contain more information than a single image, and time-series data cluster methods can maximize the similarity of pixels in the same class while minimizing the similarity of different classes [4,59]. Therefore, time-series data cluster methods can be used to subclassify land cover classes.
The similarity of pixels that are grouped together directly determines the accuracy of UMXbased models [39,52]. To evaluate PHC adopted by the FCMSTRFM, we compared three clustering methods: directly using the FCM, the clustering method applied in the STDFA, and PHC applied in the FCMSTRFM. We applied these three methods to cluster high-resolution data into 50 categories (same to the number of subclass), and the similarity of categories was evaluated by the average variance of each category. Table 5 shows the average variance of 50 categories for each method. It is clear that PHC performs better than the other two methods, which simultaneously demonstrates that the use of land cover maps can improve similarity pixel clusters. PHC can provide a reference for obtaining the endmember fraction matrix. Second, the FCMSTRFM is suitable for regions where land cover types change over time. Many STFMs are based on the assumption that land cover types do not change over time. If the land cover types were to change over time, the accuracy of those STFMs would be substantially reduced [22,25,28,33]. The FCMSTRFM uses multiphase images and land cover maps to capture information about changes in land cover type and to group similar pixels, and it is easy to ensure that the pixels grouped together have the same land cover type within the whole period. Influenced by accuracy of land cover mapping, there is a possible that land cover mapping not can ensure the pixels grouped together have the same land cover type within the whole period. However, we subdivided each land cover class and defined the subdivisions as subclasses; compared with land cover class, with subclass, it is easy to ensure that the pixels grouped together have the same land cover type within the whole period. Third, the FCMSTRFM has higher computational efficiency than that of STARFM-based methods. The main steps of the FCMSTRFM are pixel clustering and class-average reflectance calculation, but STARFM-based methods mainly use similar pixel selection and pixel-by-pixel conversion coefficient calculations [20,28,32]. Compared with the FCMSTRFM, STARFM-based methods have higher time and space algorithm complexity. As a result, the FCMSTRFM would be useful for applications in large study areas.

The Application of FCMSTRFM
Although the FCMSTRFM was designed for spatiotemporal fusion of reflectance data, it also has the potential to blend other remote sensing data, such as vegetation index (VI) products. To test the applicability of the FCMSTRFM to other products, we assessed its performance in fusing the MODIS L3 product MOD13Q1 (providing products EVI and NDVI) with Landsat VI in the HePing Irrigation area. Then, the blended time-series VI products were used to map rice by the method of spectral correlation similarity (SCS). The precision of VI (Table 6) suggested that the FCMSTRFM can be used to generate high-accuracy VI products. The map result is evaluated by a sample from Google Earth images and field investigations, and the accuracy suggests that VI blended by the FCMSTRFM can be used to map crops. For more detailed information about the method and results of rice mapping, we refer to [60]. Furthermore, Sentinel 2 has a five-day revisit time and higher spatial resolution than Landsat (10-20 m), and has more cloud-free images than Landsat. The FCMSTRFM increases the similarity of pixels grouped together by input multi-temporal images, and use time-series images to record the pixels temporal change. For the FCMSTRFM, the accuracy of the generated images increases with the number of input images. Thus, we consider the FCMSTRFM more suitable for Sentinel 2 data fusion than other models. We also used the ESTARFM to map rice, but it was difficult to map rice from the VI data generated by the ESTARFM. It can be known from Section 4.2 that to get the benefit from the selection of similar neighboring pixels and the calculation of each similar neighboring pixel weight pixel by pixel, the ESTARFM can keep spatial detail information perfectly ( Figure 12) when pixels have strong spatial heterogeneity [20,28]. However, because the ESTARFM only captures the temporal changes in pixels through the calculation of VI by two-phase images, there is a large possibility that STARFMbased models cannot capture the pixel time change process perfectly ( Figure 11) when the pixel time changes sharply [20,25,52]. The FCMSTRFM captures each subclass time-series change through multiphase images, so it more easily captures the subclass time-series change process and obtains a class (class that is grouped by generated time-series Landsat-like EVI data) series curve that is more similar to reality. We extracted rice area by SCS between the standard series EVI curve and class series curve, so it more accurately mapped rice.

Conclusion
This paper developed a new model, the FCMSTRFM, to generate high-spatial-resolution timeseries images using fine-and coarse-resolution images. Compared with the STDFA and the ESTARFM, the FCMSTRFM can generate images more accurately, especially for regions with substantial temporal changes in pixels. The main contributions of this study are as follows. First, we developed a new method to group high-resolution pixels to define endmember terms as PHCs. To the best of our knowledge, this is the first time that land cover maps and time-series data cluster methods are combined to group high-resolution pixels. Second, a new strategy is used to calculate the average reflectance of the subclass, which aims to break the ill-posed spectral unmixing limit to the precision of generating images. The results show that the FCMSTRFM is capable of increasing the similarity of pixels grouped together and the precision of the average reflectance and improving the computational efficiency without loss of precision.
Similar to many UMX-based methods, the precision of the FCMSTRFM is currently mainly limited by grouping similar pixels, calculating the average reflectance and quality of images. We believe that the only way to capture the features of temporal changes in pixels is to input as many images to STFMs as possible; Sentinel 2 has higher temporal resolution than Landsat and can get more cloud-free images, so we think it more suitable for data fusion. Almost all STFMs based on clear scene images-that is, images partly contaminated by clouds, are unavailable. In fact, cloud-free pixels in images partly contaminated by clouds can provide valuable information for data fusion. If the denoising time-series data clustering algorithm can be used to group pixels, these images can be used to capture the features of temporal and spatial changes in pixels. Aerosols and BRDF effects impact the quality of images, if atmospheric corrections take into account aerosols, and fusing images can be normalized to nadir BRDF adjusted reflectance in STFMs, the fusing result will be improved.
Author Contributions: J.Y. and Y.Y. prepared the manuscript. Y.W., Y.Z., K.J. and X.Z. revised the manuscript. K.S., X.B. and X.G. participated the discussion. All authors have read and agreed to the published version of the manuscript.