Spatial Downscaling of Land Surface Temperature over Heterogeneous Regions Using Random Forest Regression Considering Spatial Features

: Land surface temperature (LST) is one of the crucial parameters in the physical processes of the Earth. Acquiring LST images with high spatial and temporal resolutions is currently difﬁcult because of the technical restriction of satellite thermal infrared sensors. Downscaling LST from coarse to ﬁne spatial resolution is an effective means to alleviate this problem. A spatial random forest downscaling LST method (SRFD) was proposed in this study. Abundant predictor variables— including land surface reﬂection data, remote sensing spectral indexes, terrain factors, and land cover type data—were considered and applied for feature selection in SRFD. Moreover, the shortcoming of only focusing on information from point-to-point in previous statistics-based downscaling methods was supplemented by adding the spatial feature of LST. SRFD was applied to three different heterogeneous regions and compared with the results from three classical or excellent methods, including thermal image sharpening algorithm, multifactor geographically weighted regression, and random forest downscaling method. Results show that SRFD outperforms other methods in vision and statistics due to the beneﬁts from the supplement of the LST spatial feature. Speciﬁcally, compared with RFD, the second-best method, the downscaling results of SRFD are 10% to 24% lower in root-mean-square error, 5% to 20% higher in the coefﬁcient of determination, 11% to 25% lower in mean absolute error, and 4% to 17% higher in structural similarity index measure. Hence, we conclude that SRFD will be a promising LST downscaling method.


Introduction
Land surface temperature (LST) is one of the crucial parameters in the physical process of surface energy and water balance from local to global scales, and accurate LST is most important for studies, such as climate change, soil moisture condition, and environmental monitoring [1][2][3][4].
With the development of satellite thermal infrared sensors, obtaining thermal infrared (TIR) images to retrieve LST images in the regional and global scales with various spatial and temporal resolutions has already been achieved [3]. However, the existing satellite thermal infrared sensors could not provide high spatiotemporal TIR images because of the trade-off between scanning swatch and pixel size [5]. For instance, the moderate resolution imaging spectroradiometer (MODIS) of the National Aeronautics and Space Administration (NASA) onboard the Terra/Aqua satellites could provide TIR images with a spatial resolution of 1 km and a high temporal resolution of daily [6,7]. Meanwhile, the advanced spaceborne thermal emission and reflection radiometer onboard the Terra satellite could provide TIR images with a high spatial resolution of 90 m but with a temporal resolution of 16 days [8,9]. In addition, the visible and infrared radiometer and medium resolution spectral imager of the Chinese National Meteorological Satellite Center onboard Fengyun-3 (A/B/C/D) could provide different spatial (1 km, 250 m) and temporal resolutions (daily, 5.5 days) [10][11][12][13][14].
The above trade-offs can be alleviated by LST downscaling [9,15]. The LST downscaling methods mainly fall into two categories, including fusion-based methods and kernel-driven methods [16]. The fusion-based methods usually predict LST images with a high spatiotemporal resolution by integrating the temporal information from coarseresolution images and the spatial information from fine-resolution images [15]. The most widely used fusion-based method is the spatial and temporal adaptive reflectance fusion model (STARFM) [17] and its enhanced version (ESTARFM) [18,19]. The STARFM was initially developed to produce surface reflection at the Landsat spatial resolution of 30 m, daily, by integrating daily information from MODIS. The STARFM was directly applied to generate daily fine-resolution LST images in most previous studies [20,21]. However, they did not perform satisfactorily over heterogeneous areas because of the difference between surface reflection and LST. Subsequently, Weng et al. [22] further improved precedents by considering urban thermal information and landscape heterogeneity to obtain the LST images with a high spatiotemporal resolution. In addition, most fusion-based methods can be divided into four groups in Wu's review [18]: (1) weighted function-based methods [17,19,22]; (2) unmixing-based methods [23][24][25]; (3) hybrid methods [26][27][28]; and (4) learning-based methods [29,30].
In contrast to the fusion-based methods, kernel-driven methods, which are an effective means to enhance the spatial resolution of LST images by associating them with fine-resolution auxiliary data, such as shorter wavebands and terrain data [9,16]. The kernel-driven methods are largely divided into the following two categories: physical and statistical methods [31]. The physics-based methods establish a physically meaningful function by combining thermal radiance (or LST) and ancillary data (land cover type and shorter waveband data) to achieve downscaling, such as pixel block intensity modulation [32] and emissivity modulation method [33]. The statistics-based methods establish a statistical relationship between LST and predictor variable data (shorter wavebands, terrain data, and biophysical parameters), then apply the relationship to relatively fine-resolution predictor variable data to obtain the LST images with a higher spatial resolution. The statistics-based methods have been commonly accepted because of their satisfactory accuracy and simple procedure [34][35][36][37][38][39][40]. In addition, this study will make improvements to the previous statistics-based methods.
The classical statistics-based downscaling LST methods include disaggregation procedure for radiometric surface temperature (DisTrad) [35], which regards normalized difference vegetation index (NDVI) as a scale predictor variable, and thermal image sharpening (TsHARP) [34], which replaces NDVI with fractional vegetation cover. However, the performance of the above downscaling LST methods is only ideal in study areas with adequate vegetation cover, which is unsuitable to urban and arid regions because of the limitation of a single predictor variable [38,41]. Therefore, some other important predictor variables that could represent land surface conditions were gradually considered to establish the statistical relationship to express LST. For example, Dominguez et al. [36] added surface albedo data to TsHARP to develop the high-resolution urban thermal sharpener to downscale LST in urban areas. Wang et al. [39] used normalized difference built-up index, normalized difference water index, and other remote sensing spectral indexes as predictor variables to establish a regression model to downscale LST in urban areas.
The LST is jointly affected by numerous predictor variables, and the numerical relationship between them is complicated because of the coupling effects [42]. Capturing the complicated relationship using the traditional linear and nonlinear models-such as ordinary least squares (OLS) linear regression and logarithm models-is difficult with the increase in the number of predictor variables. However, machine learning algorithmssuch as random forest (RF), artificial neural networks (ANN), and support vector machine (SVM)-could be competent for representing multivariable nonlinearity [40,[43][44][45][46][47]. Hutengs et al. [40] first applied RF to achieve downscaling LST, and their downscaling Remote Sens. 2021, 13, 3645 3 of 25 results are better than those of TsHARP in statistical accuracy and vision enhancement. Li et al. [48] downscaled LST using various machine learning methods, including ANN, SVM, and RF, and then performed estimation between them. Their results indicated that the downscaling of RF is better than others.
To improve the accuracy of downscaled LST, the focus of statistics-based LST downscaling methods has mainly undergone two stages, including adequate predictor variables and excellent regression models [34,38,40,49]. However, most excellent models still have some room for improvement. On the one hand, the importance of feature selection increases with the number of predictor variables because of multicollinearity. Although some machine learning models are less sensitive to multicollinearity than traditional regression models. Nevertheless, the model will be complicated, lack stability, and demonstrate a long training time because of the redundant variables. Ebrahimy et al. [47] focused on the aforementioned problem and used the SVM recursive feature elimination to select the most important predictor variables. Next, the selected predictor variables were inputted to the machine learning model-including RF, SVM, and extreme learning machine-to fit the downscaling LST model. Compared with the model using unselected predictor variables, the model utilizing selected predictor variables has improved stability and high overall effectiveness of the LST downscaling procedure [47]. On the other hand, the most optimal machine learning method had been screened by extensive comparison and evaluation by researchers. Nevertheless, the numerical relationship described by the machine learning model still remains point-to-point [50], which neglects the geographical correlation of LST. Li et al. [50] and Wei et al. [51] estimated the PM 2.5 with high accuracy successfully by incorporating the geospatial information to the deep belief network (DBN) and random forest model, respectively. The LST is similar to the environmental parameters such as PM 2.5 , which vary in space and show the geographical correlation relationship [15]. Specifically, the adjacent pixels of every pixel can provide information related to it in the LST image, and that information is related to the spatial pattern and varies in space. Hence, that information is called the spatial feature of LST, and incorporating the spatial feature into the machine learning model to improve the performance of downscaling is crucial.
Consequently, this study aims to develop an LST downscaling method based on RF considering the spatial feature of LST, which is called the spatial random forest downscaling method (SRFD). Notably, the model features were objectively selected from original abundant predictor variables. The SRFD was applied to three different heterogeneous regions and compared with the results from three classical or excellent methods-including TsHARP [34], multi-factor geographically weighted regression (MFGWR) [38], and random forest downscaling method (RFD) [40]-to evaluate its performance.

Study Area
To objectively evaluate the performance and applicability of the SRFD, three areas were chosen as study areas. The land cover types are multiple in each study area and the climates, land cover types, and terrain among study areas are diverse. As shown in Figure 1, the study area is located in Wuhan, Shanghai, and Chengde, China. Wuhan is located in the south of China. The elevation of Wuhan ranges from 19.2 m to 873.7 m, with most ranges found below 50 m. Wuhan also has a humid subtropical climate and has abundant rainfall and heat all year round. The total annual precipitation is approximately 1320 mm, and the annual mean temperature is 17.13 • C. The dominant land cover types of the study area of Wuhan are impervious surfaces (41%), croplands (31%), and water comprising lakes and rivers (18%). Shanghai is located in the eastern part of China. Shanghai is part of the alluvial plain of the Yangtze River Delta region, with an average elevation of around 2.19 m and a maximum elevation of 103.7 m. Shanghai has a humid subtropical climate and has sufficient sunshine and rainfall all year round. The total annual precipitation is approximately 1178 mm, and the annual mean temperature is 15.8 • C. The dominant land cover types of the study area of Shanghai include impervious surfaces (68%) and Remote Sens. 2021, 13, 3645 4 of 25 croplands (28%). Chengde is located in Northeast China, and its elevation is mostly at 1200 to 2000 m. Chengde has a four-season, monsoon-influenced humid continental climate. The total annual precipitation of Chengde is approximately 504 mm, and the annual mean temperature is 8.93 • C. The dominant land cover types of the study area of Chengde study are forests (54%), grasslands (24%), and croplands (19%).

Data and Data Preprocessing
Landsat 8 of NASA was launched on 11 February 2013, and its satellite payload comprises two science instruments, including the operational land imager and the thermal infrared sensor, which could provide seasonal coverage of the global landmass with a Remote Sens. 2021, 13, 3645 5 of 25 spatial resolution of 30 m (visible, near-infrared, shortwave-infrared), 100 m (TIR), and 15 m (panchromatic) [52]. The L1TP products were used in this study to provide TIR data for the retrieval of the LST, and the L2SP products [53] were used to provide land surface reflectance data to calculate multiple remote sensing spectral indexes. The basic information of the Landsat 8 data used is shown in Table 1. Although Landsat 8 has two TIR channels, including bands 10 (10.60-11.19 µm) and 11 (11.50-12.51 µm), band 11 has the most serious radiometric calibration error due to stray light effect [54,55]. Hence, only band 10 was used in this study to retrieve the original LST with a spatial resolution of 100 m using the radiative transfer equation [56] as follows: where T s is the LST; B −1 represents the inverse plank function; L TOA,10 is the radiance measured at the top of the atmosphere in band 10; L up,10 and L down,10 are the thermal path atmospheric upwelling and downwelling radiance of band 10, respectively; ε 10 is the land surface emissivity of band 10; τ 10 is the total atmospheric transmittance along the target to sensor path in band 10. L up,10 , L down,10 , and τ 10 were obtained from the atmospheric correction parameter calculator of NASA [57,58]. The ε 10 was calculated by the NDVI threshold method in terms of Equations (2) to (4): where ε λ is the land surface emissivity; ε vλ and ε sλ are the emissivity of vegetation and soil, respectively; C λ represents the surface roughness, which can be calculated by Equation (3); P v is the fractional vegetation cover, which can be calculated by Equation (4).
where F represents a geometrical factor; NDV I v and NDV I s represent the NDVI of vegetation and soil, respectively. References [59,60] provide additional information and specific parameters of the NDVI threshold method. Shuttle Radar Topography Mission (SRTM), a kind of digital elevation model (DEM), is the first near-global dataset of land elevations with an accuracy of 16 m [61]. The three arcsecond SRTM data with a spatial resolution of approximately 30 m were used in this study to provide DEM and calculate the terrain factors, including slope and aspect. In addition, original SRTM data were reprojected to the same Universal Transverse Mercator (UTM) projection as Landsat 8 data, and rigorous geographic registration and raster alignment were performed.
Finer Resolution Observation and Monitoring of Global Land Cover (FROM-GLC), the first global land cover map with a spatial resolution of 30 m, was produced using Landsat Thematic Mapper and Enhanced Thematic Mapper Plus data [62]. The producer of FROM-GLC aimed to develop a multiple-stage method to map global land covers to the results to address the demands of land process modeling effectively and facilitate easy integration with existing land cover classification schemes [62]. Land cover data were used in this study to express a further stratification of the relationships between LST and predictor variables across different land cover types [40]. The FROM-GLC data were also reprojected to the same UTM projection as Landsat 8 data, and rigorous geographic registration and raster alignment were performed.
Using images acquired from different sensors will introduce sensor system errors caused by the difference in acquisition time, orbit gesture, and spectral response functions. Therefore, the aggregated LST images with a spatial resolution of 500 m were used as the coarse LST image to reduce the extra uncertainties for establishing the model and evaluating the proposed method [38,63]. Simultaneously, the land surface reflectance, terrain, remote sensing spectral indexes, and land cover data were aggregated to the spatial resolutions of 100 and 500 m as the predictor variable dataset with coarse and fine spatial resolutions, respectively.

Random Forest Regression
Random forest regression (RFR), an excellent ensemble machine learning model, provides reliable estimation using a substantial number of decorrelated and random decision trees [64,65]. The RFR is generally employed to solve the retrieval problem of land surface and atmosphere parameters and downscaling due to the strong generalization capability and the insensitivity of multicollinearity [51,66,67]. The bootstrap method is used during the training stage of the model to sample the original dataset randomly. For instance, k rounds and m times of sampling, which randomly select n features {X 1 , X 2 . . . X n } in every round, are taken. A k new dataset with m samples and n features is obtained after sampling, and k decision trees {h 1 , h 2 . . . h k } are trained based on every dataset. Finally, the final output is calculated by taking the average from the prediction of all decision trees, which can be represented as where Y is the final output of RFR, k is the number of decision trees, h i represents the ith decision tree, X n represents random features, and h i (X n ) represents the estimation of the ith decision tree. The RFR was used in this study as the basic method for downscaling LST. In addition, the nested five-fold cross-validation procedure [68] during the model training stage was performed to avoid model overfitting and optimize hyperparameters of the model.

Full Name Formula Reference
Bare soil index (BI) Modified soil adjusted vegetation index (MSAVI) Urban index (UI) U I = SW IR2−N IR SW IR2+N IR [79] As mentioned above, the candidate predictor variables are up to 22. However, only some predictor variables are significantly correlated with LST. The Pearson correlation coefficient (hereafter P) between the candidate predictor variables and the LST of every image was calculated to remove unnecessary predictor variables. The two can generally be regarded as weakly correlated when the P is less than 0.2. Hence, the predictor variables, in which P is less than 0.2, were first removed. In addition, RFR slightly suffers from the influence of the multicollinearity among variables compared with the traditional regression methods, such as the OLS and the geographically weighted regression (GWR) [80]. However, the redundant variables will substantially increase the complexity and computational cost of the model [47]. Hence, the variance inflation factor (VIF) shown in Equation (6), which can indicate the multicollinearity between one and the other variables, was used as the indicator to further select predictor variables. The obtained VIFs of the predictor variables after preliminary selection using the P were calculated. The predictor variable with maximal VIF was then removed. This process was repeatedly performed until all the VIFs of the predictor variables are less than 10 [81][82][83].
i is the coefficient of determination of the regression equation, wherein dependent variable is the ith variable and independent variables are other variables.

Spatial Random Forest LST Downscaling Method
SRFD is an LST downscaling method based on RFR considering the spatial feature of LST. The spatial feature of LST is composed of the LST information weighted by the distance of adjacent pixels from a center pixel. That information is related to the spatial pattern and varies in space. For a pixel, the information from nearer areas is more relevant than further ones, and the spatial feature of the LST can be expressed as where S LST,j represents the jth pixel's spatial feature of the LST image, i represents the adjacent pixels around the jth pixel, and ds is the spatial distance among pixels. The square windows are used in the practical calculation to represent the adjacent area of the objective pixel. After supplementing the spatial feature of the LST, the statistical relationship between the LST and predictor variables at a coarse spatial resolution can be expressed as where f represents a nonlinear function, ρ i are the land surface reflectance data, S i are the remote sensing spectral index data, S LST is the spatial feature of LST, and the subscript c represents the images with a coarse spatial resolution.
Owing to the complex origin of LST and the limited fitting capability of the model, a residual LST between the original and estimated LST can be expressed as where ∆LST c is the residual LST with a coarse spatial resolution, LST o is the original LST, and LST e is the LST estimated by the RFR model. Assuming that a sole statistical relationship exists in different sensor scenes, the predictor variables with a fine spatial resolution are applied into the trained RFR model, and the residual correction is performed, the final downscaled LST can be expressed as where the subscript f represents the images with a fine spatial resolution, and the ∆LST f is interpolated from the ∆LST c . The overall workflow of SRF presented in Figure 2 could be divided into five steps as shown below.

1.
Obtaining the trained RFR model (Model RFR ) using the LST and predictor variable dataset at a coarse spatial resolution, excluding the spatial feature of the LST (S LST,c ).

2.
Obtaining the downscaled LST image with a fine spatial resolution (LST RFR, f ) by applying the predictor variable dataset with a fine spatial resolution to the Model RFR and performing the residual correction. 3.
Obtaining the spatial feature of LST with a fine spatial resolution (S LST, f ) by Equation (7) based on LST RFR, f . 4.
Obtaining the trained spatial RFR (Model SRFR ) using the LST and predictor variable dataset at a coarse spatial resolution, including the S LST,c . 5.
Obtaining the final downscaled LST image by applying the predictor variable dataset with a fine spatial resolution to the Model SRFR and performing the residual correction.

Validation Methods
The downscaling results in this study were compared to three statistical downscaling methods, including a classical single factor method (TsHARP) [34], a multi-factor GWR method (MFGWR) [38], and an excellent machine learning method (RFD) [40], to evaluate the LST downscaling performance of the SRFD extensively. Notably, the TsHARP requires a study region without any water area. Hence, the MNDWI was used to build a water mask, in which the threshold was automatically obtained by the OTSU [84]. The MFGWR is a method based on GWR, which builds a local regression equation for every factor. If the factors include the classified data, such as land cover types, then the risk of local multicollinearity will rise due to a local spatial clustering phenomenon. Hence, if the predictor variable dataset after feature selection includes land cover types data, then the land cover type data will be removed when applied to the MFGWR.

Validation Methods
The downscaling results in this study were compared to three statistical downscaling methods, including a classical single factor method (TsHARP) [34], a multi-factor GWR method (MFGWR) [38], and an excellent machine learning method (RFD) [40], to evaluate the LST downscaling performance of the SRFD extensively. Notably, the TsHARP requires a study region without any water area. Hence, the MNDWI was used to build a water mask, in which the threshold was automatically obtained by the OTSU [84]. The MFGWR is a method based on GWR, which builds a local regression equation for every factor. If the factors include the classified data, such as land cover types, then the risk of local multicollinearity will rise due to a local spatial clustering phenomenon. Hence, if the predictor variable dataset after feature selection includes land cover types data, then the land cover type data will be removed when applied to the MFGWR. In addition to qualitative visual evaluation of the downscaled LST images, three common statistical indicators-including the root-mean-square error (RMSE), the coefficient of determination (R 2 ), and the mean absolute error (MAE)-were used to quantitative evaluate the downscaling results in this study. In addition, an image evaluation indicatornamely, structural similarity index measure (SSIM)-was also used to evaluate downscaled LST images in vision quantitatively. SSIM is a perception-based model; compared with RMSE, SSIM can indicate the sensory similarity between images by considering the texture of the images [85]. The SSIM can be calculated by where µ D and µ R are the downscaled and referenced LST images, respectively; σ DR is the covariance between two images; σ 2 D and σ 2 R are the variance of the downscaled and referenced LST images, respectively; c 1 and c 2 are two variables to stabilize the division with weak denominator, which can be respectively calculated by Equations (12) and (13) as where k 1 and k 2 are 0.01 and 0.03, respectively, and L is the dynamic range of the pixel values. The structure of the two images is similar when SSIM is close to 1.

Determination of the Window Size for Calculating Spatial Feature of the LST
In the SRFD, Equation (7) was used to express the spatial feature of a certain pixel in one LST image, which comprises adjacent pixel information. A square window was used in the practical calculation to represent the adjacent area of the target pixel. SRFD includes two main steps: model training using coarse-resolution datasets and downscaling using fineresolution predictor variable datasets. Hence, the calculation of spatial features involves two images of LST at a coarse or fine resolution. Sufficient experiments were performed considering the window size in the spatial feature calculation of the two resolution LST images to select the most reasonable window size.

Window Size of Coarse-Resolution LST Image
The S LST,c was applied in the training Model SRFR stage. Therefore, the external crossvalidation RMSEs of the model were used to evaluate the accuracy of the model with different window sizes. Figure 3 shows that the RMSEs of different images are the smallest when the window size is three, while those of almost all images increase with the window size. The first law of geography [86] can explain this phenomenon: the adjacent area of the target pixel is extended as the window increases, and the spatial distance between margin pixels in the adjacent area and the target pixel also rises. The information provided by adjacent pixels is no longer highly correlated with the target pixel, which is even interferential information. Therefore, the window size for calculating the spatial feature of LST is three during the Model SRFR training phase in this study. . Random forest regression and spatial random forest regression models with different window sizes for spatial feature calculation of the LST. SRFR represents spatial random forest regression, RFR represents random forest regression, and the number in brackets represents the window size.

Window Size of Fine-Resolution LST Image
The , in the SRDF is a predictor variable used to downscale the LST image. Therefore, 15 window sizes from 3 to 31 with an interval of 2 were tested, and the RMSEs and SSIMs of the downscaled LST images were simultaneously calculated. The RMSE and SSIM respectively indicate the deviation and the sensory similarity between the referenced and downscaled LST images. Therefore, a window size with low RMSE and high SSIM is expected. However, Figure 4 shows that the RMSEs and SSIMs demonstrate a Figure 3. Random forest regression and spatial random forest regression models with different window sizes for spatial feature calculation of the LST. SRFR represents spatial random forest regression, RFR represents random forest regression, and the number in brackets represents the window size.
In addition, the Model RFR was compared to Model SRFR in terms of the external crossvalidation RMSEs. Figure 3 shows that all the external RMSEs of Model SRFR are signifi-cantly lower than those of Model RFR . Thus, the spatial feature established by adjacent pixel information is beneficial to the construction of the nonlinear relationship between the LST and predictor variables because of the geographical correlation of the LST.

Window Size of Fine-Resolution LST Image
The S LST, f in the SRDF is a predictor variable used to downscale the LST image. Therefore, 15 window sizes from 3 to 31 with an interval of 2 were tested, and the RMSEs and SSIMs of the downscaled LST images were simultaneously calculated. The RMSE and SSIM respectively indicate the deviation and the sensory similarity between the referenced and downscaled LST images. Therefore, a window size with low RMSE and high SSIM is expected. However, Figure 4 shows that the RMSEs and SSIMs demonstrate a trade-off relationship in most images, wherein SSIM decreases with the RMSE. Particularly, the SSIMs tend to increase first and then decrease as the window size rises in the three images of Wuhan (Figure 4a-c). The coarse-resolution images are aggregated from the fine-resolution images. The same information provided by the adjacent area in the fine-resolution images requires a wider window than the images with a coarse spatial resolution because Wuhan is a highly heterogeneous area with lakes, impervious surface, and cropland. In addition, we obtained the S LST,c images by aggregating S LST, f images calculated by different window sizes. Then the RMSEs between aggregated S LST,c images and the true S LST,c images were calculated. Figure 5 shows that RMSEs decrease as the window sizes increase and level off when the window size is around 15. These phenomena can be explained as follows, the window size of fine-resolution images needs to be greater than that of coarseresolution images to provide consistent information, and the information provided by S LST, f is similar to S LST,c when the calculated window size is 15. Simultaneously, Figure 4 shows that almost all images have a minimum RMSE at a window size of around 15 but the SSIMs are continuously decreasing. The reason for the continuous decrease of SSIMs is that as the window increases, the S LST, f images become smoother, eventually causing downscaled images smoothing. Moreover, the computation memory and time cost in calculating the S LST, f will increase significantly as the window size rises. Hence, the window size for calculating the spatial feature of LST is 15 during downscaling LST phase to consider the statistical accuracy and visual effect of SRFD downscaled images and save the calculation cost.

Visual Evaluation
Images W1, S1, and C1 are selected in this study to compare the visual downscaling performance of different algorithms. Notably, the TsHARP does not apply to the water region. However, the water body region of the TsHARP downscaling results is filled by the water area of the referenced LST image for convenient comparison. Figure 6 shows that the high LST area is the built-up areas, the medium LST area is the croplands or forests, and the low LST area is the rivers or lakes. Compared with the coarse-resolution LST image, all downscaling results demonstrate additional spatial details. Compared with the referenced LST image, the TsHARP and SRFD downscaled LST images are most similar to the referenced image in terms of spatial distribution, whereas the MFGWR downscaled LST image is vague. Meanwhile, the RFD downscaled LST image distributes discontinuously in the built-up areas and croplands and has an overall excessive amount of details in the vision. The subset images reveal that all images have a significant underestimation in the built-up areas at high LST because the masked extremes of the LST during image aggregation led to the smoothed coarse-resolution dataset for modeling [40,87]. The TsHARP downscaled LST image has some boxy artifacts in the built-up areas and the areas around the water; that is, some areas still have grid characteristics of the coarse-resolution image, which could be due to the dependence of the variation of TsHARP downscaled LST image on the introduction of the coarse-scale residuals into fine-scale images [34]. A serious smoothing effect is observed in the MFGWR downscaled LST image because the regression and coefficient interpolation processes are based on the minimum mean square error (MMSE), and the MMSE-based method easily causes underestimation and overestimation of high and low values, respectively; thus, the ultimate predicted values have a smoothing characteristic [37]. Most details are found in the RFD downscaled LST image. Nevertheless, the transition of RFD downscaled LST is also insufficiently natural in the highly heterogeneous areas where built-up areas and croplands are mixed, and many noise-like image elements emerge, which are inconsistent with the natural surface distribution. RFR can effectively capture multivariate nonlinear relationships. However, the limitation of a point-to-point relationship between predictor variables and LST easily causes unnatural spatial distribution. The spatial distribution of the SRFD downscaled LST image is most similar to the referenced image, with a natural transition of LST from built-up areas to croplands due to the supplement by the S LST . Thus, the downscaling results are consistent with the natural ground surface.
As shown in Figures 7 and 8, the downscaling results of the different methods in the Shanghai and Chengde regions demonstrate the same phenomenon as in Wuhan. Notably, Figure 8f shows that some serious boxy artifacts are distributed over the TsHARP downscaled LST image because the imaging time of the C1 is in spring when the vegetation is exiguous. The drawback of the TsHARP lies in its unsuitability for areas with exiguous vegetation. By contrast, other LST downscaling methods can accomplish the downscaling task to varying degrees despite the exiguous vegetation because they consider multiple predictor variables. Figure 8g,k show that the downscaled LST image of SRFD is remarkably similar to that of MFGWR considering the overall trends in spatial distribution but has rich spatial details. The MFGWR considers the spatial heterogeneous relationships between predictor variables and LST and can produce locally optimal results. However, the spatial distribution of the downscaled LST image demonstrates an excessive smoothing effect. SRFD ignores the spatial heterogeneity of the relationship between predictor variables and LST. However, the complement of spatial features of LST can also reflect the spatial heterogeneity of LST, resulting in smooth LST transitions on different feature boundaries on the downscaled image. Figure 8j shows that spatial patterns, such as salt-and-pepper noise, are observed in the RFD downscaled LST image because the RFD ignores the S LST comprising the adjacent LST information, as previously mentioned.

Quantitative Evaluation
The quantitative indicators for all downscaled LST images with different methods are shown in Table 3. The overall RMSEs range from 1.43 K to 2.76 K for TsHARP, from 1.4 K to 2.49 K for MFGWR, from 1.23 K to 2.07 K for RFD, and from 0.94 K to 1.61 K for SRFD. By contrast, SRFD shows the best performance on all images, that is, the smallest RMSE and MAE and the largest R 2 and SSIM. RFD also demonstrates satisfactory performance compared with TsHARP and MFGWR. Notably, the RMSE, R 2 , and MAE of MFGWR are similar to RFD for some images, such as W1, S1, and S2. However, the SSIM of MFGWR is the smallest for all images due to the excessive smoothing effect of the downscaled LST images. The downscaling results of SRFD are 10% to 24% lower in RMSE, 5% to 20% higher in R 2 , 11% to 25% lower in MAE, and 4% to 17% higher in SSIM compared with those of RFD. This finding indicates that the downscaling results of SRFD, which consider the S LST , are enhanced considering statistical accuracy and visual information compared with RFD.
In addition, the R 2 of the downscaling results of all methods for the three images in the Shanghai study area, such as S1, is unsatisfactory. R 2 is only 0.65 despite the best accuracy of SRFD. Two reasons could explain this phenomenon. First is the lack of prediction capability of the model for extreme values because the training samples of the model are smoothed. Furthermore, the geometry and adjacency effects in the thermal radiative transfer process result in the generally poor accuracy of the retrieved LST because Shanghai is a megacity with many high-rise buildings [88,89].

Quantitative Evaluation
The quantitative indicators for all downscaled LST images with different methods are shown in Table 3. The overall RMSEs range from 1.43 K to 2.76 K for TsHARP, from 1.4 K to 2.49 K for MFGWR, from 1.23 K to 2.07 K for RFD, and from 0.94 K to 1.61 K for SRFD. By contrast, SRFD shows the best performance on all images, that is, the smallest RMSE and MAE and the largest R 2 and SSIM. RFD also demonstrates satisfactory performance compared with TsHARP and MFGWR. Notably, the RMSE, R 2 , and MAE of MFGWR are similar to RFD for some images, such as W1, S1, and S2. However, the SSIM of MFGWR is the smallest for all images due to the excessive smoothing effect of the downscaled LST images. The downscaling results of SRFD are 10% to 24% lower in RMSE, 5% to 20% higher in R 2 , 11% to 25% lower in MAE, and 4% to 17% higher in SSIM compared with those of RFD. This finding indicates that the downscaling results of SRFD, which consider the , are enhanced considering statistical accuracy and visual information compared with RFD.

Error Characteristics of the SRFD
The error characteristics and main error sources of SRFD are further analyzed in this section. Figure 9 intuitively shows that the errors of SRFD have the largest probability density around zero on all images. In addition, the error probability density curve of SRFD is slightly biased to the left of the error zero reference line, that is, an overall underestimation of SRFD is observed. Figure 10 shows that the range of the quartile errors of SRFD is centrally distributed around zero. Simultaneously, the median line of the error box plot and the labeled upper and lower quartile values also indicate the underestimation of SRFD. The TsHARP and RFD also have significant underestimation, while MFGWR has no significant systematic bias. These findings illustrate that the smoothing characteristics of the modeled dataset cause substantially larger underestimation than overestimation effects on the downscaling results of the TsHARP, RFD, and SRFD methods.

Error Sources of the SRFD
The errors in the downscaling process mainly come from the assumption of constant statistical relationship scales and errors in the predictor variables. No temporal and sensor spectral differences are found between the native predictor images and the LST due to the aggregation-disaggregation strategy used in this study. As the most important predictor in SRFD, the S LST, f is calculated from the fine-resolution LST downscaling by an RFD that does not consider S LST . The RFD errors in this process are introduced into the calculation of S LST, f . The S LST, f is calculated based on referenced fine-resolution LST images and used for downscaling and accuracy evaluation to examine the influence of this process on the final accuracy of SRFD, and the downscaling results are abbreviated as SRFD-R. Table 4 shows that the indicators of SRFD-R downscaling results are significantly improved over that of SRFD, in which RMSEs are reduced by 9% to 18%, R 2 s are increased by 2% to 18%, MAEs are reduced by 10% to 20%, and SSIMs are increased by 4% to 20%. This finding illustrates that S LST, f is an important source of error in SRFD. Therefore, a relatively accurate preliminary downscaling result is crucial for SRFD. tion of SRFD is observed. Figure 10 shows that the range of the quartile errors of SRFD is centrally distributed around zero. Simultaneously, the median line of the error box plot and the labeled upper and lower quartile values also indicate the underestimation of SRFD. The TsHARP and RFD also have significant underestimation, while MFGWR has no significant systematic bias. These findings illustrate that the smoothing characteristics of the modeled dataset cause substantially larger underestimation than overestimation effects on the downscaling results of the TsHARP, RFD, and SRFD methods. Figure 9. Error probability density plots of downscaling results for different methods. The labels (a-i) represent the image IDs of W1, W2, W3, S1, S2, S3, C1, C2, and C3. Figure 9. Error probability density plots of downscaling results for different methods. The labels (a-i) represent the image IDs of W1, W2, W3, S1, S2, S3, C1, C2, and C3.

Error Sources of the SRFD
The errors in the downscaling process mainly come from the assumption of constant statistical relationship scales and errors in the predictor variables. No temporal and sensor spectral differences are found between the native predictor images and the LST due to the aggregation-disaggregation strategy used in this study. As the most important predictor in SRFD, the , is calculated from the fine-resolution LST downscaling by an RFD that does not consider . The RFD errors in this process are introduced into the calculation of , . The , is calculated based on referenced fine-resolution LST images and used for downscaling and accuracy evaluation to examine the influence of this process on the final accuracy of SRFD, and the downscaling results are abbreviated as SRFD-R. Table 4 shows that the indicators of SRFD-R downscaling results are significantly improved over that of SRFD, in which RMSEs are reduced by 9% to 18%, R 2 s are increased by 2% to 18%, MAEs are reduced by 10% to 20%, and SSIMs are increased by 4% to 20%. This finding illustrates that , is an important source of error in SRFD. Therefore, a relatively accurate preliminary downscaling result is crucial for SRFD. In addition, the method of obtaining fine-resolution residuals is also a key error source for downscaled LST images. Only three common image interpolation methodsincluding nearest neighbor, bilinear, cubic spline interpolation-were attempted in this study to evaluate the general performance of the proposed method and save the calculation cost. All the downscaling results by different methods have satisfactory accuracy when compensating residuals to fine-resolution images using bilinear interpolation. Actually, the residuals also appear to be spatial non-stationary and autocorrelative [90,91]. Hence, the geostatistical interpolation methods usually outperform common interpolation methods. Duan et al. [37] compared simple spline tension interpolation with ordinary Kriging interpolation and results show that the downscaling results using the ordinary Kriging interpolation are slightly better than that of simple spline tension interpolation. Recently, some studies about downscaling made improvements on the residual correction procedure and got satisfying results, such as area-to-point Kriging [92,93]. The downscaling results were compared to further analyze the performance of the SRFD, in which the ways for residual correction were bilinear and area-to-point Kriging, respectively. Notably, the S LST, f was also calculated based on referenced fine-resolution LST images and the downscaling results are abbreviated as SRFD-RK. Table 5 shows that the statistical indicators of SRFD-RK downscaling results are slightly better than those of SRFD-R in most images. Nevertheless, the overall accuracies of the two are similar. Moreover, the SSIMs of all SRFD-RK images are lower than those of SRFD-R, maybe because of the smoothing effect from the Kriging interpolation method. In other studies, such as estimating the PM 1 and LST by using statistical models and physical algorithms, attention was also paid to the residual correction. To further improve the accuaracy of results, they built a separate model for residuals by the geographically and temporally weighted regression model (GTWR) or machine learning model [94,95]. Therefore, the approach of residual correction is still a promising direction for improving downscaling accuracy in future studies.

Difference of the Way to Obtain Fine-Resolution Spatial Feature Image
As mentioned, the S LST, f is calculated from the fine-resolution LST downscaling by an RFD without considering the S LST . The common solution to obtain an unknown fineresolution parameter image is to interpolate in a manner similar to performing a residual correction. Moreover, interpolation is far easier and faster than training an RFR model. However, the interpolation will cause extreme smoothing of the S LST, f images, eventually losing feature details to the downscaled images. (as shown in Figure 11, taking image W1 as an example). The downscaled image using interpolated S LST, f image is abbreviated as SRFD-I.

Difference of the Way to Obtain Fine-Resolution Spatial Feature Image
As mentioned, the , is calculated from the fine-resolution LST downscaling by an RFD without considering the . The common solution to obtain an unknown fineresolution parameter image is to interpolate in a manner similar to performing a residual correction. Moreover, interpolation is far easier and faster than training an RFR model. However, the interpolation will cause extreme smoothing of the , images, eventually losing feature details to the downscaled images. (as shown in Figure 11, taking image W1 as an example). The downscaled image using interpolated , image is abbreviated as SRFD-I.

Conclusions
A spatial random forest LST downscaling method which considers numerous predictor variables and performs feature selection-and more importantly, complements the spatial feature of LST-is proposed in this study. The proposed method was applied to

Conclusions
A spatial random forest LST downscaling method which considers numerous predictor variables and performs feature selection-and more importantly, complements the spatial feature of LST-is proposed in this study. The proposed method was applied to three study areas with heterogeneous land covers based on the Landsat 8 images. Finally, the downscaling results were compared to three classical or excellent methods, namely TsHARP, MFGWR, and RFD.
The comparative analysis shows that the downscaling results of SRFD in three study areas with heterogeneous land covers have the best performance in terms of statistical accuracy and visual effects. Benefiting from the complement of the spatial feature of LST, the downscaling results of SRFD have a natural spatial transition and eliminate the noise phenomenon caused by RFD, which ignores the spatial information of LST. Compared with the downscaling results of the second-best RFD, SRFD has reduced RMSEs by 10% to 24%, improved R 2 s by 5% to 20%, decreased MAEs by 11% to 25%, and enhanced SSIMs by 4% to 17%. In addition, SRFD naturally responds to the spatial heterogeneity of LST and far outperforms MFGWR in terms of spatial detail compared with MFGWR methods that consider heterogeneous relationships between LST and predictor variables.
The main error source in the SRFD, which is the computation of the fine-resolution spatial feature of LST, is also quantified. Compared with the downscaling results of SRFD, the quantitative indicators of the downscaling results of SRFD-R decreased by 9% to 18% for RMSEs, increased by 2% to 18% for R 2 s, decreased by 10% to 20% for MAEs, and increased by 4% to 20% for SSIMs.
SRFD also provides a framework for enhancing the performance of downscaling methods through the supplement of the spatial feature of LST. SFRD can theoretically replace the random forest model with any basic machine learning methods (such as ANN and SVM), machine learning frameworks (such as stacking), or even deep learning methods (such as DBN). Moreover, the idea of SFRD applies to downscaling other surfaces or atmospheric parameters with spatially varying and geographical correlation at high resolution, such as precipitation and soil moisture. In addition, we demonstrate the advantages of our approach for acquiring fine-resolution spatial feature images. Complete and referenceable experimental and analytical methods are also provided for the window size of spatial feature calculation of parameters for images of different spatial resolutions.