A Framework for Generating High Spatiotemporal Resolution Land Surface Temperature in Heterogeneous Areas

: Land surface temperature (LST) is a crucial biophysical parameter related closely to the land–atmosphere interface. Satellite thermal infrared measurement provides an effective method to derive LST on regional and global scales, but it is very hard to acquire simultaneously high spatiotemporal resolution LST due to its limitation in the sensor design. Recently, many LST downscaling and spatiotemporal image fusion methods have been widely proposed to solve this problem. However, most methods ignored the spatial heterogeneity of LST distribution, and there are inconsistent image textures and LST values over heterogeneous regions. Thus, this study aims to propose one framework to derive high spatiotemporal resolution LSTs in heterogeneous areas by considering the optimal selection of LST predictors, the downscaling of MODIS LST, and the spatiotemporal fusion of Landsat 8 LST. A total of eight periods of MODIS and Landsat 8 data were used to predict the 100-m resolution LST at prediction time t p in Zhangye and Beijing of China. Further, the predicted LST at t p was quantitatively contrasted with the LSTs predicted by the regression-then-fusion strategy, STARFM-based fusion, and random forest-based regression, and was validated with the actual Landsat 8 LST product at t p . Results indicated that the proposed framework performed better in characterizing LST texture than the referenced three methods, and the root mean square error (RMSE) varied from 0.85 K to 2.29 K, and relative RMSE varied from 0.18 K to 0.69 K, where the correlation coefﬁcients were all greater than 0.84. Furthermore, the distribution error analysis indicated the proposed new framework generated the most area proportion at 0~1 K in some heterogeneous regions, especially in artiﬁcial impermeable surfaces and bare lands. This means that this framework can provide a set of LST dataset with reasonable accuracy and a high spatiotemporal resolution over heterogeneous areas.


Introduction
Land surface temperature (LST) is a crucial terrestrial geophysical variable that affects the heat transformation process between the land surface and the atmospheric boundary layer [1]. Its spatiotemporal dynamics play an important role in impacting the surface energy balance, soil moisture content, evapotranspiration, and surface thermal environment [2,3]. As such, continuous spatiotemporal estimation of LST is essential for related fields of terrestrial surface process on a regional or global scale, such as soil moisture content monitoring, vegetation evaporation estimation, water and heat flux measurement, and urban heat island (UHI) monitoring [4]. fine/coarse resolution is too large [26,27]. The spatiotemporal data fusion method could produce long time-series LSTs but is limited by the resolution of high-resolution LST [28]. Thus, making full use of the advantages of LST downscaling and spatiotemporal image fusion to generate LST with high spatiotemporal resolutions has attracted much attention in recent years. At present, many scholars have developed some hybrid strategies [23,28]. For instance, Bai et al. [28] used the extreme learning machine algorithm to first downscale the Landsat ETM+ TIR to 30-m and then used the SADFAT to fuse MODIS LST time series and the downscaled LST for generating 30-m resolution LST time series. Xia et al. [23] proposed the regression-then-fusion (RF) strategy, applied the RF algorithm to downscale the Landsat 8 LST data to the 30-m resolution, and then used the downscaled LST as input for the STARFM to sharpen MODIS LSTs. These methods blend the advantages of regression and data fusion and present better performances than the first two kinds of methods. In spite of this, some key issues still need to be fully considered for the hybrid strategy. On the one hand, due to the large difference in resolution between MODIS LST and Landsat LST, if the MODIS LST is directly re-sampled to fine pixels within the 100-m range for subsequent spatiotemporal data fusion, a large amount of LST change information will not be captured. On the other hand, since the previous hybrid strategies usually employed the STARFM and SADFAT algorithms that do not fully consider the LST pattern heterogeneity of complex landscapes to produce the high-resolution LST time-series, the LSTs generated over the heterogeneous surface usually present out poor accuracy. As a result, it is very necessary to develop a more effective method to gain accurate high spatiotemporal resolution LSTs in non-uniform regions.
According to the above statement, the main objective of this study is to develop a new framework for producing an LST dataset with reasonable accuracy and a high spatiotemporal resolution. Similar to the RF strategy, the proposed new framework also will blend the regression-based LST downscaling process and the spatiotemporal image fusion technology. However, different from the previous RF strategy, this framework will take into full account the optimal selection of LST predictors, the downscaling of MODIS LST, and the spatiotemporal fusion of Landsat 8 LST in complex landscapes for maintaining the accuracy and detailed texture of LST image. As a whole, there are the following two advantages: (1) it can greatly improve the LST prediction accuracy using the downscaling of MODIS LST to better capture the change in MODIS LST from basic time t b to target time t p ; and (2) it can acquire fine spatial texture of LST in the heterogeneous landscape using the flexible spatiotemporal data fusion (FSDAF) algorithm that considers the landscape heterogeneity of land surface fully.
The rest of this article is organized as follows. Section 2 introduces our study area and data set. Section 3 describes the establishment process of this framework. Section 4 shows the results, and Sections 5 and 6 give the discussions and conclusions, respectively.

Study Area
A total of two areas with complex landscapes in China were selected as cases in our study in order to make our framework more representative. Figure 1 presents the geolocations of the study areas with two LULC maps generated from the global land cover (GLC) dataset. The topography is high in the north part and low in the south part. Regarding the landcover type, this region is mainly characterized by artificial surfaces, cultivated lands, forest lands, and water bodies. Among them, urban artificial surfaces and forest lands are two kinds of main land-cover types, and they are widely distributed from the north part of the study area and the south part of the study area, respectively. Because the landscape type of this region is complex, this study area is of significant meaning for our experiment.

Data Collection and Image Processing
Since the Earth observation time of Terra and Landsat 8 satellites are similar, this paper used eight periods of MODIS and Landsat 8 remote sensing images collected in the two study areas to realize the construction of this framework. Study area A used four periods of satellite images collected on 5 July 2013, 21 July 2013, 24 July 2014, and 9 August 2014 (Image ID: A1, A2, A3, A4). Study area B used four periods of satellite images acquired on 4 September 2014, 6 October 2014, 12 September 2017, and 28 September 2017 (Image ID: B1, B2, B3, B4). All selected satellite images were collected under clear sky conditions, and Table 1 shows the information of the data set used. Study area A is located in Zhangye of Gansu Province in China, which belongs to the middle reach of the Heihe River Basin. The coordinate range is between latitudes 38 • 42 N and 39 • 08 N and longitudes 100 • 08 E and 100 • 41 E. This region possesses a temperate continental arid climate type, the annual average temperature is approximately 7 • C, and the annual average precipitation is less than 200 mm. The topography is high in the northeast part and southwest part and low in the central region. From the perspective of land-cover type, this region is mainly characterized by cultivated lands, deserts, bare lands, and artificial surfaces. In recent years, with the continuous implementation of the Heihe Watershed Allied Telemetry Experimental Research (Hi-WATER), this region is frequently selected as the case to implement the scientific experiments on satellite and ground observations [22,29].
Study area B is located in Beijing of China, which belongs to the transitional zone between the North China Plain and the Inner Mongolian Plateau. The coordinate range is between latitudes 40 • 03 N and 40 • 30 N and longitudes 116 • 03 E and 116 • 37 E. This region has a warm temperate continental climate type, the annual average temperature is approximately 10~12 • C, and the annual average precipitation is approximately 483.9 mm. The topography is high in the north part and low in the south part. Regarding the landcover type, this region is mainly characterized by artificial surfaces, cultivated lands, forest lands, and water bodies. Among them, urban artificial surfaces and forest lands are two kinds of main land-cover types, and they are widely distributed from the north part of the study area and the south part of the study area, respectively. Because the landscape type of this region is complex, this study area is of significant meaning for our experiment.

Data Collection and Image Processing
Since the Earth observation time of Terra and Landsat 8 satellites are similar, this paper used eight periods of MODIS and Landsat 8 remote sensing images collected in the two study areas to realize the construction of this framework. Study area A used four periods of satellite images collected on 5 July 2013, 21 All selected satellite images were collected under clear sky conditions, and  Table 1 shows the information of the data set used. (1) MODIS product The MODIS sensor provided plentiful satellite products for understanding the surface change at the global scale and has a moderate spatial resolution with daily continuous global coverage. In this study, the MOD11A1 (collection 6) was used to provide the LST data, and the MOD09GA and MCD12Q1 were used to extract the candidate predictors. The MOD09GQ was used to calculate the final predictors for performing the MODIS LST downscaling. All selected MODIS products were downloaded from the Next Generation Earth Science Discovery Tool (https://ladsweb.modaps.eosdis.nasa.gov/search/, accessed on 28 September 2021).
The MOD11A1 includes the pixel-by-pixel LST and LSE with a 1-km resolution in a sequence of swath-based to grid-based global products, whose LST is derived from the channels 31 and 32 of MODIS using the generalized split-window algorithm [30]. Previous studies have indicated that the estimated LST has good accuracy with less than 1.3 K for most homogeneous surfaces, which has been widely used in LST change analysis [31]. The MOD09GA provides seven bands with 500-m resolution in the Sinusoidal projection. The MCD12Q1 provides seventeen kinds of land-cover information with a 500-m resolution each year. The MOD09GQ provides two bands with 250-m resolution in the Sinusoidal projection.
(2) Landsat 8 product Landsat 8 data provides eight 30-m resolution visible and infrared bands and two 100-m resolution thermal infrared bands for the Earth's monitoring, but there is a relatively long revisiting period of 16 days. In this study, the Ready-To-Use (RTU) Landsat 8 LST product provided by the Chinese Academy of Sciences was used for helping the spatiotemporal fusion of LST and was further used as an actual reference for the evaluation of the predicted LSTs. The RTU product can be acquired from the DATABANK Remote Sensing Data Engine (http://databank.casearth.cn, accessed on 28 September 2021).
The RTU LST product is produced with the generalized single-channel (GSW) algorithm and covers China and central Asia. It provides the LST estimate after 2000 [32]. Previous studies have indicated the comparison between the RTU LST product and in-situ LST measurement in three regions (Xuanwu Lake, Zoucheng, Huairou of China) shows good accuracy, with an average RMSE of 0.83 K [33]. Thus, it can be ensured the RTU LST product can reveal the actual distribution condition of LST to some extent.

(3) DEM image
Because the LST distribution also depends on the geographical location and topographic factors [34], the DEM derived from the ASTER GDEM was collected in this paper to yield five LST predictors: elevation, slope, aspect, longitude, and latitude. In this paper, the GDEM data were collected from the Center for Earth Observation at Yale University (https://yceo.yale.edu/aster-gdem-global-elevation-data/, accessed on 28 September 2021).
The GDEM data were mainly generated from the ASTER sensor onboard the Terra and has been applied in topography studies from 83 • S latitude to 83 • N latitude. It has a resolution of 30 m with an absolute vertical error of less than 20 m [35].
(4) Image processing Using the MODIS re-projection tool (MRT), all MODIS products in the HDF-EOS format were re-projected to the Universal Transverse Mercator (UTM) WGS-1984 projection and re-sampled to 1-km, 500-m, 500-m, and 250-m resolutions. In addition, for better matching the geographic location between MODIS product and Landsat 8 product, the collected MODIS products were all geo-referenced to the locations of the RTU LST products by selecting control points such as road and river intersections using the image-to-image module of ENVI 5.3 software.

Overview
The proposed framework mainly consists of three steps, including (1) the optimal selection of LST predictors; (2) the downscaling of MODIS LST product; and (3) the spatiotemporal image fusion of Landsat 8 LST at t p . For simplicity, we refer to the new framework as the three-step method for short. The detailed implementation is illustrated in Figure 2. The GDEM data were mainly generated from the ASTER sensor onboard the Terra and has been applied in topography studies from 83° S latitude to 83° N latitude. It has a resolution of 30 m with an absolute vertical error of less than 20 m [35].
(4) Image processing Using the MODIS re-projection tool (MRT), all MODIS products in the HDF-EOS format were re-projected to the Universal Transverse Mercator (UTM) WGS-1984 projection and re-sampled to 1-km, 500-m, 500-m, and 250-m resolutions. In addition, for better matching the geographic location between MODIS product and Landsat 8 product, the collected MODIS products were all geo-referenced to the locations of the RTU LST products by selecting control points such as road and river intersections using the image-toimage module of ENVI 5.3 software.

Overview
The proposed framework mainly consists of three steps, including (1) the optimal selection of LST predictors;    To be specific, in step 1, LST predictors used for downscaling MODIS LSTs are determined according to the variable importance in the RF algorithm. In step 2, to better capture the change information of MODIS LST from t b to t p , the MODIS LSTs at t b and t p are sharpened to 250-m by using the RF-based LST downscaling. In step 3, after the 100-m resolution RTU Landsat 8 LST product at t b is obtained, the downscaled 250-m MODIS LSTs at t b and t p are together imported into the FSDAF algorithm to generate the 100-m Landsat 8-like LST at t p . Finally, the actual RTU LST at t p is used to evaluate the predicted LST at t p objectively. For each study area, we used the Terra/MODIS LST products and Landsat 8 remotely sensed images from the same year with different basic and prediction times to achieve this framework. One input date pair is regarded as the base data at t b , while the other is used as the prediction data at t p . Using the proposed three-step method, we can obtain the Landsat 8-like LST at t p given one LST date pair at t b and one downscaled MODIS LST at t p . The specific implementation of this framework is expounded in Section 3.2.

Construction of the Framework
Step 1: Selection of LST predictors The choice of LST predictors plays an essential role in implementing the spatial downscaling of MODIS LST. Since the LST in heterogeneous areas is strongly affected by various biophysical factors, a large number of predictors associated with LST have been widely adopted to characterize LST change in previous studies [14]. However, selecting a number of LST predictors to reveal LST patterns can lead to data redundancy and multicollinearity among variables. In addition, this process will waste lots of time to establish the LST downscaling model. Thus, in order to provide a relatively robust LST downscaling model, this step aims to determine the optimal LST predictors from a number of auxiliary parameters by reducing the redundancy of relevant factors.
By referring to a large number of studies [8,18,[36][37][38][39][40][41], our study selected a total of fifteen parameters significantly correlated with LST as candidate factors to determine reliable LST predictors. Eight remotely sensed indexes impacting the LST were extracted from the MOD09GA (i.e., NDVI; percent vegetation, PV; the normalized difference moisture index, NDMI; bare soil index, BSI; the soil adjusted vegetation index, SAVI; NDBI; integrated ecological index, IEI; and MNDWI). Five terrain elements that are highly correlated with LST differentiation were acquired from the GDEM (i.e., elevation, slope, aspect, longitude, and latitude). The land-cover type that effectively reveals the change in LSTs was acquired from the MCD12Q1 product. In view of the importance of LSE in the process of LST retrieval, LSE was also calculated using the NDVI threshold method proposed by Sobrino et al. [42]. Partial factors were calculated as follows: where NIR is the near-infrared band reflectance; RED is the red band reflectance; SWIR 1 is the shortwave infrared band reflectance in 1.560~1.660 µm; BLUE is the blue band reflectance; SWIR 2 is the shortwave infrared band reflectance in 2.100~2.300 µm; and GREEN is the green band reflectance. IEI is the integrated ecological index proposed by Zhu et al. [43], which is calculated using the PCA with SAVI, NDMI, BSI, and NDBI indexes, and has been normalized to the range of 0~1. PC 1 is the first principal component of PCA; a i is the variance contribution weight of the principal component; PC i is the first principal component for each ecological factor; and m is the number of surface ecological factors.
Since the multicollinearity has little effect on the predictive ability of the RF regression model, this article used the variable importance score of each candidate LST predictor to select the optimal LST predictors [44]: (1) the resolutions of fifteen candidate factors were resampled to 1 km using the pixel aggregation tool of ENVI software; (2) corresponding pixels from the 1-km resolution factors and the MODIS LST were selected; (3) after the outlier removal, RF regression was used to implement the non-linear fitting one-by-one by removing or replacing certain factor; and (4) after the variable importance of fifteen factors were all calculated, the factors with a high Gini index were chose to implement the non-linear regression until the residual up to the minimum; (5) via the incessant tests, the optimal LST predictors were determined based on the fitting residual of RF regression model. In this paper, the selected LST predictors were fixed as the PV, elevation, slope, longitude, and latitude in the end.
Step 2: Downscaling of MODIS LST Since previous fusion methods usually re-sample the MODIS LST data from 1-km resolution to the resolution within 100-m for the subsequent spatiotemporal data fusion, this process will lose detailed spatiotemporal change information of LST. Fortunately, the Terra/MODIS sensor can provide two kinds of resolutions PV images (i.e., 250-m and 500-m) every day to assist the downscaling of MODIS LST. Meanwhile, these PV images have the same instantaneous observation time as MODIS LST images. In our study, for better capturing the spatial texture information of LST change and enhancing the performance of LSTs fusion procedures in the following process, we chose the 250-m resolution PV image derived from Terra/MODIS to sharpen MODIS LSTs from 1-km to 250-m resolutions. This is because that the introduction of a 250-m resolution meets the requirement that the resolution difference is less than 3~5 times in the spatially non-uniform surfaces [26] and offers more abundant spatiotemporal change information of LST in the process of LST downscaling.
In previous MODIS LST downscaling studies, GLR, MLR, and GWR regression models have been extensively applied to estimate model coefficients and then to predict the highresolution LSTs with calibrated parameters and LST predictors [14]. However, under the assumption that the local atmospheric condition is homogeneous, the physic mechanism of a linear regression model is very hard to apply in complex regions due to the changes in land-cover types. Since the RF algorithm has remarkable performance in automatically settling the non-linear relationship between the LST and its predictors by constructing and averaging a large number of randomized and de-correlated decision trees [18], in this study, we used the RF-based non-linear regression model to implement the downscaling of Remote Sens. 2021, 13, 3885 9 of 24 MODIS LST. With the selected LST predictors in step 1, the detailed process of the MODIS LST downscaling is displayed in Figure 2 and is summarized as follows: (1) The 250-m resolution MOD09GQ product and 30-m resolution GDEM were used to calculate the selected LST predictors and then were aggregated to 1 km and 250 m, respectively. LST predictors with a resolution of 1 km belong to the MOD11A1 pixel level, and LST predictors with a resolution of 250 m belong to the MOD09GQ pixel level. (2) The RF regression model was used to construct the relationship between MODIS LST and five predictors at the resolution of 1 km, which can be expressed as follows: (10) where LST 1km denotes the MODIS LST and is fitted by the RF regression with f as a non-linear function; f denotes the function between LST and its predictors; PV 1km is the aggregated PV image using the MOD09GQ with the resolution of 1 km; elevation 1km , slope 1km , longitude 1km and latitude 1km are all the aggregated LST predictors derived from the GDEM with the resolution of 1 km; and ε 1km is the residual of RF regression at a spatial resolution of 1 km. (3) By assuming that regression residuals are uniformly distributed in space, the ordinary kriging interpolation were used to interpolate the residual with a 1-km resolution to 250 m. (4) By assuming that the relationship between LST and its predictors within 1-km resolution is scale-invariant for 250-m resolution, the MODIS LST was sharpened at t b and t p to 250 m based on the linking model at 1-km resolution and combined with the residual and predictors at 250-m resolution: LST 250m = f PV 250m , elevation 250m , slope 250m , longitude 250m , latitude 250m) + ε 250m (11) where LST 250m is the downscaled MODIS LST with a resolution of 250 m; PV 250m is the aggregated PV from the MOD09GQ with a resolution of 250 m; elevation 250m , slope 250m , longitude 250m , and latitude 250m are all the aggregated terrain factors derived from the GDEM with a resolution of 250 m; E 250m is the regression residual with a resolution of 250 m.
Step 3: Spatiotemporal image fusion of LST.
After the MODIS LST products at t b and t p were downscaled to a 250-m resolution, and the FSDAF algorithm was used to predict the Landsat 8-like LST product at t p in combination with the RTU LST product at t b . Initially, this algorithm was mainly used to fuse land surface reflectance with the high spatiotemporal resolutions in heterogeneous areas using daily low-resolution MODIS surface reflectance and high spatial resolution reflectance. Later on, many studies also used this algorithm to estimate the other surface physical parameters (e.g., NDVI and LST). This algorithm makes full use of the image texture details from the neighboring pixels and effectively considers the abrupt landcover type changes in the heterogeneous regions. It requires the same input data as two widely used spatiotemporal fusion methods, including STAFRM [24] and STITFM algorithms [21], whereas its prediction performance is more accurate than the STAFRM and STITFM by uniting the advantages of spectral unmixing analysis and a thin plate spline (TPS) interpolator [45].
The LST spatiotemporal fusion based on the FSDAF is mainly implemented by using six steps as follows [45]: (1) classifying the RTU LST product at t b into five main levels based on the SVM classifier; (2) estimating the temporal change information of each LST level in the downscaled MODIS LST image from t b to t p ; (3) predicting the Landsat 8-like LST at t p using the class-level temporal change information, and calculating the residuals at each downscaled MODIS LST pixels; (4) using the TPS interpolation to predict the Landsat 8-like LST at t p from the downscaled MODIS LST at t p ; (5) distributing the residuals at each pixel of the downscaled MODIS LST using the TPS prediction; and finally (6) acquiring the prediction image of Landsat 8-like LST using the information in neighboring pixels.
Specifically, the equation used to fuse the Landsat 8-like LST at t p through the FSDAF algorithm is given as follows: whereF 2 (x ij , y ij , LST) is the final LST prediction value of the target pixel (x ij , y ij ) at t p ; F 1 (x ij , y ij , LST) is the LST value of the j-th fine pixel (i.e., Landsat 8 LST) within the coarse pixel at location (x i , y i ) observed at t b ; n is the number of similar pixels for (x ij , y ij ) in a sliding window; w k is the weight for the k-th similar pixel; and ∆F(x k , y k , LST) is the LST prediction of the total change in a fine pixel between t b and t p . w k is an important parameter for the spatiotemporal data fusion and is determined by the spatial distance between similar pixels and the target pixel with: where w k is the weight; D k is a relative distance ranging from 1 to √ 2; (x k , y k ) and (x ij , y ij ) denote the target pixel and similar pixels in a sliding window, respectively; and w is the size of the neighborhood, which is determined by the homogeneity of the study area and the size of coarse pixels. The schematic diagram of similar pixels in a moving window is displayed in Figure 3.
uals at each pixel of the downscaled MODIS LST using the TPS prediction; and finall acquiring the prediction image of Landsat 8-like LST using the information in neighbo pixels.
Specifically, the equation used to fuse the Landsat 8-like LST at tp through the FS algorithm is given as follows: is the final LST prediction value of the target pixel (xij, yij) F1(xij, yij, LST) is the LST value of the j-th fine pixel (i.e., Landsat 8 LST) within the co pixel at location (xi, yi) observed at tb; n is the number of similar pixels for (xij, yij) sliding window; wk is the weight for the k-th similar pixel; and ΔF(xk, yk, LST) is the prediction of the total change in a fine pixel between tb and tp.
wk is an important parameter for the spatiotemporal data fusion and is determ by the spatial distance between similar pixels and the target pixel with: where wk is the weight; Dk is a relative distance ranging from 1 to 2 ; (xk, yk) and (x denote the target pixel and similar pixels in a sliding window, respectively; and w i size of the neighborhood, which is determined by the homogeneity of the study area the size of coarse pixels. The schematic diagram of similar pixels in a moving windo displayed in Figure 3.  ∆F(x ij , y ij , LST) depends on the changes in MODIS LSTs from t b to t p , which can be estimated using the distributed residual r and temporal change information ∆F: where r(x ij , y ij , LST) is the LST residual distributed to the j-th fine pixel; ∆F(LST) is the changes in LSTs of different LST levels (L) at fine resolution from t b to t p . Among them, r(x ij , y ij , LST) can be further described as follows: where m is the number of fine pixels (also named as subpixels) within one coarse pixel; R(x i , y i , LST) is a residual term between the true values and temporal prediction of fine pixels; W(x ij , y ij , LST) is the weight of residual distribution. For a more detailed description of the FSDAF algorithm, we can refer to Zhu et al. [45].

Comparison with Other Methods
For highlighting the proposed framework, three LST prediction approaches were additionally applied in this study to serve as contrasts: (1) the RF strategy; (2) the STARFMbased fusion; and (3) the RF-based LST downscaling. These three methods, respectively, represent the hybrid strategy, spatiotemporal data fusion method, and LST downscaling method. They all have obvious advantages within their respective fields.
Different from the previous RF strategy, in our study, the RF strategy used the RF regression model to downscale the MODIS LSTs at t b and t p to 250-m resolution based on the NDVI, then adopted the STARFM to fuse the downscaled 250-m resolution MODIS LSTs at t b and t p and the 100-m resolution RTU LST product at t b to generate the 100-m resolution Landsat 8-like LST at t p . In terms of the operating step, this method is similar to the three-step method. However, the three-step method adopted the optimal LST predictors to downscale MODIS LSTs at t b and t p to 250-m resolution, and then used the FSDAF to fuse the 100-m resolution RTU LST product at t b to generate the 100-m resolution Landsat 8-like LST at t p . Concerning the STARFM-based fusion method, it used the STARFM to blend the RTU LST product at t b and the re-sampled 100-m resolution MODIS LSTs at t b and t p for generating the 100-m resolution Landsat 8-like LST at t p . Furthermore, the RF-based LST downscaling used the RF regression model to downscale MODIS LST from 1-km resolution to 100-m resolution using the selected five predictors derived from the Landsat 8 PV image and GDEM data. Before all methods were implemented, the RTU LST product at t b was converted to the corresponding MODIS LST equivalent by establishing a simple linear transformation relationship between the MODIS LST product and RTU LST product at a 1-km resolution. This is because the LST images derived from Landsat 8 TIRS and Terra/MODIS differ obviously as a result of the instantaneous time difference in the local solar time and the sensor configuration of wavelength, signal-to-noise ratio, and viewing angles.

Accuracy Assessment
Ideally, to evaluate the performance of the predicted high-resolution LST, in-situ LSTs or actual LST images with the same resolution should be available as a reference. However, in practical cases, obtaining sufficient measured in-situ LSTs is very limited. In view of the higher accuracy of the RTU LST product, we used the RTU LST product at t p as the actual LST to evaluate the newly proposed framework. The root-mean-square error (RMSE), relative RMSE (RRMSE), and correlation coefficient (CC) were used as three evaluation indicators [46]. The RMSE and RRMSE can be used to evaluate the consistency between the predicted LST (LST pre ) and the true RTU LST product (LST true ), and CC can be used to characterize the spatial similarity degree between the predicted LST (LST pre ) and the actual RTU LST product (LST true ). In ideal circumstances, the closer the RMSE is to 0, the closer the predicted value is to the actual value; RRMSE well below 0.5 denotes that the used method is more accurate and reliable; the closer the CC is to 1, and the predicted image texture details are more similar to the actual image details. These three indices are depicted as follows: where n is the number of pixels in the LST image; LST pre is the predicted LST; LST true is the true RTU LST product; LST pre is the mean of the predicted LST image; and LST true is the mean of the RTU LST product.

Selection Analysis of LST Predictors
The optimal selection of LST predictors plays an essential role in performing the spatial downscaling of MODIS LSTs. Taking Image A1 as a case, Table 2 lists the variate importance of fifteen predictors in the RF regression model. The %IncMSE denotes the percentage increase in the mean squared error (MSE), and the IncNode Purity denotes the increase in the tree node purity [18]. For convenient comparison, we also put forward the integrated contribution index (IC) by calculating the average value between %IncMSE and IncNode Purity to represent the synthetic importance of each parameter. It is very evident from Table 2 that, as two important parameters closely correlated with the LST distribution, PV and NDVI showed high ICs, and their values were 14,997.52 and 11,714.01, respectively. This is because the natural surfaces covered by large areas of vegetation play an important role in the regulation of LST by absorbing latent and sensible heat. The elevation impact on LST was also apparent with an IC of 5412.84 since the elevation presents a negative relationship with the LST in terrain with many mountainous landscapes [47]. Meanwhile, the soil dryness degree and soil moisture content impacts on the LST were also crucial, which can be represented through the bare soil index (BSI) and the soil humidity index (NDMI, SAVI). However, we found that variate importance was stronger in the aspect and LULC in terms of the MSE, with values of 1.58 and 2.04, respectively. It seems clear that the aspect in mountainous areas presented a pronounced impact on the solar illumination, and the land-cover type had a higher weighting of heat distribution in various areas. Table 2. Variable importance of each factor in the RF regression for Image A1 (%IncMSE denotes the percentage increase in the mean squared error (MSE), IncNode Purity represents the increase in the tree node purity, and IC is integrated contribution index by calculating the average value between %IncMSE and IncNode Purity). Although the involved predictors have various importance in Table 2, the final selection of LST predictors mainly depends on the interaction of included factors. In the RF regression, removing or replacing some factors involved may change the importance scores because different inter-correlated variables could act as surrogates. Thus, in addition to referring to the %IncMSE, IncNode Purity, and IC values, the LST predictors need to be determined according to the regression fitting goodness of the RF model. Via incessant experiments, this paper selected the PV, elevation, slope, longitude, and latitude as LST predictors in the end. Similar to previous studies [6,39], the selected predictors had good representativeness in performing the LST downscaling and were shown to be key factors affecting LST. This result means that the vegetation biomass has a particularity during the LST downscaling. Meanwhile, the spatial configuration of the terrain and geographic location (i.e., elevation, slope, longitude, and latitude) also played obvious roles in the LST downscaling since they determined the incident solar radiation that was available to heat the surface and the area's exposition to long-wave surface cooling [34].

Factors
However, for further discussing the reliability of the selection of LST predictors, Figure 4a,b also present the impacts of PV, elevation, slope, longitude, and latitude on the LST downscaling model for Image A1. With the RF algorithm, after the regression relationship between MODIS LST and selected predictors at a 1-km resolution was established, via assuming that the standard deviations (STD) of five predictors vary from 0.01 to 0.05, the white noise error with an average value of 0 and an STD of 0.01 to 0.05 was added to each factor. Then, we input the processed five LST predictors into the established RF regression relationship and estimated the LST in sequence. Later on, the LST estimated by corresponding predictors was evaluated using R 2 and RMSE by using the original MODIS LST as actual LST. To improve the modeling accuracy, after the outliers were removed, a total of 2000 pixel points were used to build the RF regression model.
However, for further discussing the reliability of the selection of LST predictors, Figure 4a,b also present the impacts of PV, elevation, slope, longitude, and latitude on the LST downscaling model for Image A1. With the RF algorithm, after the regression relationship between MODIS LST and selected predictors at a 1-km resolution was established, via assuming that the standard deviations (STD) of five predictors vary from 0.01 to 0.05, the white noise error with an average value of 0 and an STD of 0.01 to 0.05 was added to each factor. Then, we input the processed five LST predictors into the established RF regression relationship and estimated the LST in sequence. Later on, the LST estimated by corresponding predictors was evaluated using R 2 and RMSE by using the original MODIS LST as actual LST. To improve the modeling accuracy, after the outliers were removed, a total of 2000 pixel points were used to build the RF regression model.
As shown in Figure 4, the R 2 of five LST predictors presented an overall descending trend from 0.01 to 0.05 for white noise, and the change values were 0.038, 0.032, 0.019, 0.016, and 0.009 for elevation, PV, longitude, latitude, and slope, respectively. Their RMSEs displayed an ascending trend from 0.01 to 0.05 for white noise, and the highest RMSE difference (0.19 K) was found in the elevation. This finding indicates that the RFbased LST downscaling will generate a pronounced fluctuation if any errors exist in PV and elevation. Then, a poor estimation will be found in the subsequent fusion of the LST image. However, it is encouraging for the established RF model that the fitting goodness was acceptable at different noise situations (R 2 > 0.93); additionally, the final LST prediction accuracy was also within the rational range (RMSE < 1.72 K). As shown in Figure 4, the R 2 of five LST predictors presented an overall descending trend from 0.01 to 0.05 for white noise, and the change values were 0.038, 0.032, 0.019, 0.016, and 0.009 for elevation, PV, longitude, latitude, and slope, respectively. Their RMSEs displayed an ascending trend from 0.01 to 0.05 for white noise, and the highest RMSE difference (0.19 K) was found in the elevation. This finding indicates that the RF-based LST downscaling will generate a pronounced fluctuation if any errors exist in PV and elevation. Then, a poor estimation will be found in the subsequent fusion of the LST image. However, it is encouraging for the established RF model that the fitting goodness was acceptable at different noise situations (R 2 > 0.93); additionally, the final LST prediction accuracy was also within the rational range (RMSE < 1.72 K). Figures 5 and 6 display the visual comparisons of the four 100-m resolution LST images predicted by the three-step method and three referenced methods with the actual RTU LST product for Images A1 and B1. Comparing the LST images, all methods can provide the LST with high spatial and temporal resolutions for the two study areas, but the specific details are different. The LSTs predicted by the three-step method (Figures 5b and 6b) more closely resembled the actual RTU LST product, presenting a relatively clear distribution pattern (e.g., the zoomed-in sub-regions). This indicates that blending the high spatial reconstruction from the regression and the daily temporal reconveyance from the image fusion process to predict high spatial and temporal resolution LSTs has obvious advantages.

Accuracy Evaluation of the Framework
Since the STARFM has some limitations in obtaining pure temporal information from the homogeneous pixels, the RF strategy (Figures 5c and 6c) and the STARFM-based fusion method (Figures 5d and 6d) showed unsatisfactory smoothing effects in the desert of Image A1 and the urban area of Image B1. However, the RF strategy was better than the STARFM fusion in the farmland of Image A1 and the forest of Image B1. In contrast, the RF-based LST downscaling (Figures 5e and 6e) exhibited the worst performance, showing many fragmented patches in the desert and cropland of Image A1. At the same time, when the land-cover type transitioned from one feature to another feature, obvious blocky artifacts were observed in both areas. Two possible reasons can explain this phenomenon. First, the RF-based regression process is based on the minimum mean square error. A common manifestation is that high values tend to be underestimated, and low values tend to be overestimated [6]. In addition, this phenomenon is caused by the LST variability in the coarse spatial resolution images, whereas the LST downscaling process of the RF algorithm does not fully consider this problem. In short, the three-step method showed an excellent visual agreement with the actual Landsat 8 LST, which also can be found in the frequency distribution maps of the four predicted LST images. Figure 7a-c further displays the RMSE, RRMSE, and CC values between the actual RTU LST product and the LSTs predicted by the three-step method, RF strategy, STARFMbased fusion, and RF-based downscaling. Low RMSE and high CC values are indicative of LST prediction of satisfactory quality; the optimal method would result in the RMSE equal to 0 and CC value equal to 1. Taking study area A as a case, the mean RMSE decreased from 4.45 K for the RF-based downscaling to 1.89 K for the three-step method, the mean RRMSE decreased from 0.63 K for the RF-based downscaling to 0.28 K for the three-step method, and the mean CC increased from 0.828 for the RF-based downscaling to 0.976 for the three-step method. It is apparent that the RF-based LST downscaling method had the worst performance in study area A, the STARFM-based fusion method was better than the RF-based downscaling, and the three-step method and the RF strategy were better among the four methods. Especially for the three-step method, in Image A1, it had the best performance for three evaluation indicators: RMSE, RRMSE, and CC, of 1.62 K, 0.18 K, and 0.987, respectively. However, for study area B, the mean RMSE decreased from 3.02 K of the STARFM-based fusion to 1.30 K of the three-step method, the mean RRMSE decreased from 1.00 K of the STARFM-based fusion to 0.45 K of the three-step method, and the mean CC increased from 0.516 for the STARFM-based fusion method to 0.921 for the three-step method. It is worth noting that, in study area B, the RF-based downscaling presented higher accuracies than the RF strategy and STARFM fusion method, whereas its accuracy was still lower than the three-step method. One possible interpretation is that the STARFM performs poorly in keeping the texture of the LST image in heterogeneous regions (e.g., the urban area of Beijing) since the STARFM not fully considered land-cover change information. In contrast, the RF-based LST downscaling better retained the auxiliary information of LST predictors by constructing and averaging a large of randomized and de-correlated decision trees. Thus, the MODIS LST downscaling based on the RF regression has been widely used in previous studies.  Figure 7a-c further displays the RMSE, RRMSE, and CC values between the actual RTU LST product and the LSTs predicted by the three-step method, RF strategy, STARFM-based fusion, and RF-based downscaling. Low RMSE and high CC values are indicative of LST prediction of satisfactory quality; the optimal method would result in the RMSE equal to 0 and CC value equal to 1. Taking study area A as a case, the mean RMSE decreased from 4.45 K for the RF-based downscaling to 1.89 K for the three-step method, the mean RRMSE decreased from 0.63 K for the RF-based downscaling to 0.28 K for the three-step method, and the mean CC increased from 0.828 for the RF-based downscaling to 0.976 for the three-step method. It is apparent that the RF-based LST downscaling method had the worst performance in study area A, the STARFM-based fu- From the above visual comparison and statistical analysis of different methods, an apparent superiority of the three-step method to other methods in some instances was observed. This is because the three-step method blended the information of multiple predicting factors in the downscaling of MODIS LST and also better considered the LST heterogeneity in the spatiotemporal fusion of LST. fusion method, whereas its accuracy was still lower than the three-step method. One possible interpretation is that the STARFM performs poorly in keeping the texture of the LST image in heterogeneous regions (e.g., the urban area of Beijing) since the STARFM not fully considered land-cover change information. In contrast, the RF-based LST downscaling better retained the auxiliary information of LST predictors by constructing and averaging a large of randomized and de-correlated decision trees. Thus, the MODIS LST downscaling based on the RF regression has been widely used in previous studies. From the above visual comparison and statistical analysis of different methods, an apparent superiority of the three-step method to other methods in some instances was observed. This is because the three-step method blended the information of multiple predicting factors in the downscaling of MODIS LST and also better considered the LST heterogeneity in the spatiotemporal fusion of LST.

Distribution Error Analysis of Predicted LSTs
Taking the predicted LST images in Images A1 and B1 as cases, Figures 8 and 9 also present the distribution error maps of four predicted LSTs. The distribution error was defined as the absolute value of the spatial difference between the RTU LST product at tp and the predicted LST at tp. This indicator can not only better reveal the distribution status of LST prediction error in the form of images but also reflect the actual size of the LST prediction error. Prior to the analysis, the distribution error of predicted LST was classified into five levels: 0~1 K, 1~2 K, 2~3 K, 3~5 K, and >5 K.
Concerning Image A1, the distribution error of the three-step method was very similar to that of the RF strategy but observably different in level 4 and level 5. The land-cover types corresponding to these two kinds of levels were widely covered by impervious surfaces and sand lands, and they have peculiar thermal conductivity and heat capacity (see rectangle 1 and 2 of Image A1). The STARFM fusion method showed more errors in level 5, whereas it performed better than the RF-based LST downscaling. Regarding Image B1, the three-step method occupied the widest error area in level 1, followed by the RF strategy, STARFM-based fusion, and RF-based LST downscaling. The regions corresponding to level 1 were mainly distributed in mountainous areas, which were covered by trees (see rectangle 1 of Image B1). In addition, we found that the three-step method occupied the

Distribution Error Analysis of Predicted LSTs
Taking the predicted LST images in Images A1 and B1 as cases, Figures 8 and 9 also present the distribution error maps of four predicted LSTs. The distribution error was defined as the absolute value of the spatial difference between the RTU LST product at t p and the predicted LST at t p . This indicator can not only better reveal the distribution status of LST prediction error in the form of images but also reflect the actual size of the LST prediction error. Prior to the analysis, the distribution error of predicted LST was classified into five levels: 0~1 K, 1~2 K, 2~3 K, 3~5 K, and >5 K.   For the sake of comparison, Table 3 shows in-detail statistics including the area percentages of these error levels for each method in Images A1 and B1. In Image A1, more than 43.96% of the study area fell under 1 K for the three-step method and generating the smallest percentage area in level 5, which was approximately 2.18% of this area. The RF strategy displayed a similar performance to the three-step method in level 1, with an error   For the sake of comparison, Table 3 shows in-detail statistics including the area percentages of these error levels for each method in Images A1 and B1. In Image A1, more than 43.96% of the study area fell under 1 K for the three-step method and generating the smallest percentage area in level 5, which was approximately 2.18% of this area. The RF Concerning Image A1, the distribution error of the three-step method was very similar to that of the RF strategy but observably different in level 4 and level 5. The land-cover types corresponding to these two kinds of levels were widely covered by impervious surfaces and sand lands, and they have peculiar thermal conductivity and heat capacity (see rectangle 1 and 2 of Image A1). The STARFM fusion method showed more errors in level 5, whereas it performed better than the RF-based LST downscaling. Regarding Image B1, the three-step method occupied the widest error area in level 1, followed by the RF strategy, STARFMbased fusion, and RF-based LST downscaling. The regions corresponding to level 1 were mainly distributed in mountainous areas, which were covered by trees (see rectangle 1 of Image B1). In addition, we found that the three-step method occupied the least error area in level 5, which was mainly located in the urban area (see rectangle 2 of Image B1).
For the sake of comparison, Table 3 shows in-detail statistics including the area percentages of these error levels for each method in Images A1 and B1. In Image A1, more than 43.96% of the study area fell under 1 K for the three-step method and generating the smallest percentage area in level 5, which was approximately 2.18% of this area. The RF strategy displayed a similar performance to the three-step method in level 1, with an error percentage of more than 39% of the area, whereas it had a larger error level (>3 K) than the three-step method, approximately 13.43% of study area A. In contrast, the STARFM fusion method performed poorly than the first two methods, with fewer area percentages in levels 1 and 2. The RF-based LST downscaling generated the most unsatisfactory result in level 5, displaying the largest error area percentage, approximately 55% of the study area. Similarly, as for Image B1, the three-step method still performs best, followed by the RF strategy, STARFM-based fusion method, and RF-based LST downscaling. For the three-step method, more than 42.30% of study area B fell below 1 K, and a minimum percentage area of 0.30% is generated outside of 5 K. However, due to the surficial property differences, such as the thermal inertia, pyroconductivity, and vegetation evaporation status, the performance of the proposed three-step method varies over land-cover types [48]. Thus, Tables 4 and 5 also discuss the mean distribution errors of predicted LSTs for Images A1 and B1 in five kinds of land-cover backgrounds. The comparisons in Images A1 and B1 show that all predicted LSTs performed poorly in the shrub and bare lands with many higher mean errors, whereas they performed better in the vegetation-covered regions (i.e., cultivated land, forest, and grassland). This indicates that all methods were suitable for the agricultural lands and forests but were not applicable to bare lands, especially for the STARFM-based fusion and RF-based LST downscaling. However, the three-step method still possessed good accuracy than the other methods in all land-cover types, and the RF-based LST downscaling was unsatisfactory in these regions. For instance, in Image A1, the three-step method generated the lowest mean distribution error value (1.84 K) in five land-cover types, and the RF strategy was second, with a mean distribution error of 2.56 K. The rest two kinds of methods were inapparent, but the STARFM-based fusion presented better (with a mean distribution error of 3.31 K) than the RF-based LST downscaling (with a mean distribution error of 6.27 K).

Impacts of MODIS LST Downscaling
In our study, the proposed three-step method combined the advantages of the regressionbased LST downscaling and the FSDAF-based image fusion to generate LSTs with high spatiotemporal resolutions. Thus, the LST predicted by this new method is largely affected by the downscaling of MODIS LSTs at t b and t p . Via implementing the forward fusion and the backward fusion with the data collected in two days in the same year, Figure 10a-d display the impacts of the re-sampled MODIS LSTs and downscaled MODIS LSTs on the predicted LSTs for all images. One input date pair is rewarded as the base time data at t b , while the other is used as the prediction time data at t p . The impact of the re-sampled MODIS LSTs on the predicted LST at t p can be denoted as the FSDAF-based fusion. The impact of the downscaled MODIS LSTs on the predicted LST at t p can be represented with the three-step method. Specifically, taking Figure 10a

Impacts of MODIS LST Downscaling
In our study, the proposed three-step method combined the advantages of the regression-based LST downscaling and the FSDAF-based image fusion to generate LSTs with high spatiotemporal resolutions. Thus, the LST predicted by this new method is largely affected by the downscaling of MODIS LSTs at tb and tp. Via implementing the forward fusion and the backward fusion with the data collected in two days in the same year, Figure 10a-d display the impacts of the re-sampled MODIS LSTs and downscaled MODIS LSTs on the predicted LSTs for all images. One input date pair is rewarded as the base time data at tb, while the other is used as the prediction time data at tp. The impact of the re-sampled MODIS LSTs on the predicted LST at tp can be denoted as the FSDAFbased fusion. The impact of the downscaled MODIS LSTs on the predicted LST at tp can be represented with the three-step method. Specifically, taking Figure 10a   It is very evident from Figure 10a-d that no matter what fusion processes were used to predict the 100-m resolution LST at tp, the three-step method that combines the downscaling of MODIS LST performed the best all the time. This is since the three-step method used the 250-m resolution auxiliary MODIS image as an intermediate resolution to sharpen the MODIS LST from 1 km to 250 m, and then to a 100-m resolution so that the downscaled MODIS LSTs at tb and tp was more accurate than the result of direct re-sampling, maintaining a wealth of LST change information from tb to tp. Thus, it can be seen that, when using the downscaled MODIS LSTs at tb and tp to further fuse the 100-m resolution LST at tp, the performance of the new framework is more pronounced in contrast with the FSDAF-based fusion. Accordingly, the downscaling accuracy of the MODIS LST It is very evident from Figure 10a-d that no matter what fusion processes were used to predict the 100-m resolution LST at t p , the three-step method that combines the downscaling of MODIS LST performed the best all the time. This is since the three-step method used the 250-m resolution auxiliary MODIS image as an intermediate resolution to sharpen the MODIS LST from 1 km to 250 m, and then to a 100-m resolution so that the downscaled MODIS LSTs at t b and t p was more accurate than the result of direct re-sampling, maintaining a wealth of LST change information from t b to t p . Thus, it can be seen that, when using the downscaled MODIS LSTs at t b and t p to further fuse the 100-m resolution LST at t p , the performance of the new framework is more pronounced in contrast with the FSDAF-based fusion. Accordingly, the downscaling accuracy of the MODIS LST lays a critical foundation for the final prediction of the 100-m resolution LST at t p ; the higher its accuracy is, the better the predicted LST image is. In addition, from Figure 10, a prominent finding indicates that, regardless of which methods were applied in predicting the 100-m resolution LST at t p , the more accurate the predicted LST at t p is, the better the fused LST at t p is, especially for the newly proposed method. We found from Figure 10a that LST RMSEs obviously decreased from 1.62 K of the backward fusion to 1.50 K of the forward fusion by using Image A1 and Image A2. Regarding the FSDAF-based fusion method, LST RMSEs decreased from 1.99 K of the backward fusion to 1.75 K of the forward fusion. This finding suggests that the downscaling accuracy of the MODIS LST at t p is more important and plays a crucial role in predicting the 100-m resolution LST at t p .

Advantages and Disadvantages of the Proposed Framework
Similar to the traditional RF strategy, the proposed method is also an RF strategy, and it blended the spatial downscaling process of MODIS LST and the spatiotemporal data fusion process of Landsat 8 LST. However, the traditional RF strategy produced the 30-m resolution LST at t p by downscaling the Landsat LST (~100 m) into high-resolution (~30 m), and then used the STARFM to fuse the MODIS LST time series and the downscaled LST obtained [13,28]. The newly developed framework produced the 100-m resolution LST at t p by downscaling the MODIS LST (~1000 m) into medium resolution (~250 m) and then using the FSDAF to fuse the low-resolution LST time series and the downscaled LST obtained. The main difference between the two methods is the downscaling of MODIS LST and consideration of LST heterogeneity. In addition, for highlighting the importance of downscaling MODIS LST, the new method does not downscale the Landsat LST from 100-m to 30-m resolutions; thus, the three-step method only derived the 100-m resolution Landsat 8-like LST.
On the whole, compared with the previous RF strategies, the three-step method has more evident advantages for predicting high-resolution LST images in regions with strong spatial heterogeneity. First, the three-step method selected the optimal LST predictors by using the importance ranking to perform the MODIS LST downscaling so that reducing the multicollinearity between one and the other variables and improving the velocity of the model building since the redundant variables will substantially increase the complexity and computational cost of the model. Second, to capture more detailed change information of MODIS LSTs from the basic time t b to the prediction time t p , by introducing the 250-m resolution LST predictors to downscale the MODIS LST, the three-step method effectively maintained the accuracy of MODIS LST and inherited the texture information of predictors. Third, by using the FSDAF algorithm, the three-step method can acquire LSTs with more clear textures in heterogeneous landscapes and predict 100-m resolution LST time series using daily 1-km resolution MODIS LST products and 100-m resolution Landsat 8 LST data.
Despite these advantages, the three-step method has several limitations. First, for better capturing the change information of MODIS LSTs from t b to t p , the spatial downscaling of MODIS LSTs needs to take twice at t b and t p . This process will take more time and cause some regression errors. Second, the selection of regression methods plays an essential role in the MODIS LST downscaling, which determines the texture and prediction accuracy of LST. Although the new method used the RF regression to build the non-linear relationship between LST and its predictors, more regression models still need to be considered in the future because the RF regression is limited by the number of samples to a great extent [47]. Third, the proposed three-step method only allows for the clear-sky condition because the thermal infrared signal cannot penetrate through clouds [49][50][51]. If we want to obtain the all-weather LST, it should be essential to remove the cloud effect. Considering that the combination of regression and data fusion has a number of potential applications in generating fine-resolution LST time series, we will propose some more accurate strategies for predicting the high spatiotemporal resolution LSTs in the future. A variety of spatial and temporal fusion models or algorithms could be adopted to enhance the texture characteristic of LST images, and more effective regression methods or machine learning algorithms could be used to improve the accuracy of LST downscaling.

Conclusions
By considering the spatial downscaling of MODIS LST and spatial heterogeneity of LST, this study developed a new framework (i.e., the three-step method) to predict the 100-m resolution Landsat 8-like LST at t p in two areas. Three key points are involved in this study: (1) the optimal selection of LST predictors; (2) the downscaling of MODIS LST; and (3) the implementation of the FSDAF algorithm. These processes can better solve the problems of inaccurate LSTs and unclear image textures and gain more detailed LST distribution features and more accurate LST values than other methods.
The visual comparison of the predicted LSTs derived from four kinds of methods indicates that the three-step method performed better than the other methods over heterogeneous regions, especially for the regions with relatively high LST variation and spatially fragmented landscapes, which obviously removed the blocky effect and blurring effect. With three evaluative indexes, our results presented similar results to the visual comparison, and the three-step method had the best accuracy: its RMSEs varied from 0.85 K of Image B4 to 2.29 K of Image A4, and RRMSEs varied from 0.18 K of Image A1 to 0.69 K of Image B2. Additionally, the distribution error analysis indicated that the three-step method minimized the predicted LST errors at five levels and five kinds of land-cover types, especially at bare land, with the minimum average distribution error (3.30 K of Image A1 and 1.38 K of Image B1, respectively). However, behind the use of the three-step method, there are still some limitations, such as the uncertainty of the LST downscaling model and the impacts of MODIS LST downscaling on LST prediction. As a result, to develop some more accurate LST prediction methods, more statistical regression models, spatiotemporal data fusion algorithms, and study areas need to be considered in the future.