Spatial Downscaling of Land Surface Temperature Based on a Multi-Factor Geographically Weighted Machine Learning Model

: Land surface temperature (LST) is a critical parameter of surface energy ﬂuxes and has become the focus of numerous studies. LST downscaling is an effective technique for supplementing the limitations of the coarse-resolution LST data. However, the relationship between LST and other land surface parameters tends to be nonlinear and spatially nonstationary, due to spatial heterogeneity. Nonlinearity and spatial nonstationarity have not been considered simultaneously in previous studies. To address this issue, we propose a multi-factor geographically weighted machine learning (MFGWML) algorithm. MFGWML utilizes three excellent machine learning (ML) algorithms, namely extreme gradient boosting (XGBoost), multivariate adaptive regression splines (MARS), and Bayesian ridge regression (BRR), as base learners to capture the nonlinear relationships. MFGWML uses geographically weighted regression (GWR), which allows for spatial nonstationarity, to fuse the three base learners’ predictions. This paper downscales the 30 m LST data retrieved from Landsat 8 images to 10 m LST data mainly based on Sentinel-2A images. The results show that MFGWML outperforms two classic algorithms, namely thermal image sharpening (TsHARP) and the high-resolution urban thermal sharpener (HUTS). We conclude that MFGWML combines the advantages of multiple regression, ML, and GWR, to capture the local heterogeneity and obtain reliable and robust downscaled LST data.


Introduction
Land surface temperature (LST) refers to the radiative temperature of the Earth's surface. LST is highly responsive to the interactions between the land surface and the atmosphere, water circulation, and energy exchange from the local scale to the global scale [1]. Therefore, LST is an essential parameter in various environmental research fields, including climate change and urban heat island effect monitoring [2,3]; land-surface carbon, water, energy, and evapotranspiration mapping [4,5]; soil moisture condition and drought assessment [6,7]; and forest fire detection [8]. Accurate LST measurements at different scales can facilitate these environmental monitoring studies.
Although researchers have made great efforts to improve the LST downscaling accuracy, most previous studies still have other limitations. First, multiple regression models are prone to overfitting, which is a common issue. Therefore, feature selection should be the first and most crucial step in the model design stage. However, the features selected in most multi-factor studies are not optimal for LST downscaling models as they disregard the correlations among features and the feature combination effects. In practice, highly correlated features are redundant, which will reduce the prediction performance and stability of models. More importantly, features act on a model in combination rather than individually [37]. Accordingly, researchers should pay more attention to obtaining the optimal feature combination for each LST downscaling model. Second, most previous LST downscaling studies involved MODIS LST, Landsat LST, and ASTER LST data but rarely employed high-resolution images without TIR bands, such as Sentinel-2A data. Sentinel-2A data have the advantages of relatively high resolutions (10 m) and open access. Therefore, LST downscaling based on Sentinel-2A data needs more investigations to obtain LST data with relatively high resolution. Third, most multi-factor LST downscaling studies utilized classic single-factor algorithms, such as DisTrad and TsHARP, as benchmarks. However, as mentioned above, LST heterogeneity is related to multi-type factors, so that a single factor (such as NDVI) is inadequate for representing LST. In terms of LST downscaling, the multi-factor models outperform the single-factor models is within expectation. Therefore, the model comparison between multi-factor models and the single-factor models is not sufficient. Model comparisons that use the two-factor algorithms or other multi-factor approaches as benchmarks need further investigation.
In this context, the objectives of this paper involve three aspects: (1) providing an objective feature selection scheme to select the optimal feature combination for each model; (2) developing a multi-factor geographically weighted machine learning (MFGWML) downscaling method, to produce reliable and robust 10 m LST data based on Sentinel-2A images; (3) assessing the MFGWML model by comparing it with two classic downscaling algorithms, namely the single-factor model (namely TsHARP) and the two-factor model (namely HUTS). The remainder of this paper is arranged as follows: Section 2 introduces the study area, data, and methods. Section 3 evaluates the LST downscaling results. Furthermore, Section 4 describes the limitations of this paper and future research. Section 5 summarizes the conclusions.

Study Area
Beijing (39 • 28 N-41 • 05 N, 115 • 25 E-117 • 30 E), the capital of China, is located in northern China ( Figure 1) and covers an area of approximately 16,410.54 km 2 . The general topography of Beijing is high in the northwest and low in the southeast, and the average altitude is 43.5 m. Beijing is located in a warm temperate zone, and its climate type is a typical semi-humid continental monsoon climate. Beijing is hot and rainy in summer, dry and cold in winter, and has short spring and autumn seasons. The monthly average temperatures in January and July are −3.7 and 26.2 • C, respectively. The annual average air temperature is approximately 12.3 • C. The annual precipitation in Beijing is 572 mm, and summer precipitation accounts for approximately 3/4 of the annual precipitation. With the acceleration of urbanization, the population of Beijing has rapidly increased. By the end of 2018, Beijing had a population of 21.542 million, and its urbanization rate was 86.5%.

Data Description and Image Preprocessing
The Landsat 8 L1TP products, Sentinel-2A Level-1C products, and Shuttle Radar Topography Mission (SRTM) data were downloaded from the United States Geological Survey (USGS) website (https://earthexplorer.usgs.gov/ (accessed on 17 March 2021)). Landsat 8 and Sentinel-2A images are both defined in the Universal Transverse Mercator (UTM) map projection with the World Geodetic System 84 (WGS84) datum. As shown in Table  1, their corresponding bands have almost consistent spectral ranges. Three Landsat 8 images and seven Sentinel-2A images are needed to cover the entire study area (Table 2). SRTM data are a kind of DEM data defined in the geographic projection with the WGS84 datum (horizontal datum) and Earth Gravitational Model 1996 (EGM96, vertical datum). The spatial resolution of SRTM data is one arc-second (approximately 30 m). We also obtained meteorological data from the Resource and Environment Science and Data Center (http://www.resdc.cn/ (accessed on 17 March 2021)). The daily minimum and maximum air temperatures observed by the 20 meteorological stations in Beijing on July 10, 2017, were averaged for LST retrieval.

Data Description and Image Preprocessing
The Landsat 8 L1TP products, Sentinel-2A Level-1C products, and Shuttle Radar Topography Mission (SRTM) data were downloaded from the United States Geological Survey (USGS) website (https://earthexplorer.usgs.gov/ (accessed on 17 March 2021)). Landsat 8 and Sentinel-2A images are both defined in the Universal Transverse Mercator (UTM) map projection with the World Geodetic System 84 (WGS84) datum. As shown in Table 1, their corresponding bands have almost consistent spectral ranges. Three Landsat 8 images and seven Sentinel-2A images are needed to cover the entire study area (Table 2). SRTM data are a kind of DEM data defined in the geographic projection with the WGS84 datum (horizontal datum) and Earth Gravitational Model 1996 (EGM96, vertical datum). The spatial resolution of SRTM data is one arc-second (approximately 30 m). We also obtained meteorological data from the Resource and Environment Science and Data Center (http://www.resdc.cn/ (accessed on 17 March 2021)). The daily minimum and maximum air temperatures observed by the 20 meteorological stations in Beijing on July 10, 2017, were averaged for LST retrieval.
The preprocessing of Landsat 8 and Sentinel-2A images involved radiometric corrections, atmospheric corrections, geometric registration, seamless mosaic, clipping, and resampling. The atmospheric corrections of the Landsat 8 multispectral bands and Sentinel-2A images were performed, using the fast line-of-sight atmospheric analysis of spectral hypercubes (FLAASH) model embedded in ENVI software (version 5.3) and Sen2Cor tool (version 2.8), respectively. The preprocessing of SRTM data consisted of re-projections, seamless mosaic, clipping, and resampling.

Determination of the Optimal Feature Combination
Feature selection is the first task for a multivariate regression model, aiming to find the most significant indicators for predicting the response variable. Feature selection is a hot topic, and numerous techniques for feature selection have been proposed. One of the most classic and popular methods is the least absolute shrinkage and selection operator (LASSO) [54]. The LASSO imposes an L1 penalty on the regression coefficients to obtain the optimal subset of covariates automatically. However, the LASSO does not Remote Sens. 2021, 13, 1186 8 of 33 perform well in the case of highly correlated predictors. Some extensions of the LASSO have been developed, such as the least angle regression (LARS) [55] and the component selection and smoothing operator (COSSO) [56]. However, all covariates must be included in these models at the same time, leading to high complexity and computational cost of the models. Most feature selection studies focus on selecting variables for linear models while neglecting variables with multiple types. To address the feature selection problems for nonlinear models, Yenigün and Rizzo (2015) [57] proposed two different feature selection criteria based on partial maximal correlation and partial distance correlation, respectively. In the case of variables with different natures (such as scalar, circular, directional), Febrero-Bande et al. (2019) [58] introduced distance correlation to perform feature selection. However, the selected variables are not the optimal variables for a specific model because they are chosen without considering the model.
The above approaches do not consider the effects of feature combination. As mentioned in the introduction, it is crucial to identify the optimal variable combination before establishing models. This paper objectively determines the optimal feature combination through three steps. First, we eliminate the irrelevant features based on the correlations between LST and features. Second, redundant features that are highly correlated are removed. The first two steps are conducive to reducing the number of variables so that the computational cost of the models is significantly decreased. Third, the optimal feature combination is determined according to the variable importance of each model. The extreme gradient boosting (XGBoost) and multivariate adaptive regression splines (MARS) models can provide variable importance, while the Bayesian ridge regression (BRR) model does not have a built-in feature selection function.

Principles of Base Learners
According to ensemble learning theory, base learners with either no correlation or a weak correlation can be integrated to generate better predictions [59]. Therefore, three excellent models with different model structures and optimization approaches, namely XGBoost, MARS, and BRR, were selected as base learners. XGBoost is a tree structure model that implements gradient boosting to perform optimization. MARS is a model based on multivariate adaptive regression splines; it utilizes the generalized cross-validation (GCV) method to determine the optimal basis functions. BRR is a hybrid model that integrates ridge regression with Bayesian estimators. The regularization parameters of BRR are optimized based on their distribution. The optimization methods prevent these three models from overfitting. Thus, they are excellent in practical applications [29,60,61]. The XGBoost, MARS, and BRR algorithms were implemented with the xgboost, earth, and monomvn packages, respectively, in the R language environment (version 3.6.3).

XGBoost
XGBoost is a scalable end-to-end tree boosting system [62]. For a given dataset (D) with n examples and m features, expressed as D = {(x i , y i )}(|D| = n, x i ∈ R m , y i ∈ R), a tree ensemble model uses additive functions (Equation (9)) to make predictions. Assuming thatŷ i (t) is the predicted result of the ith instance during the tth iteration, f t is added greedily to minimize the objective function (Equation (10)). XGBoost employs the secondorder gradient derivatives of the loss function to quickly optimize the objective function (Equation (11)). XGBoost uses an exact greedy algorithm or the approximate algorithm to propose candidate split points for constructing optimal trees.
where K is the number of additive functions that correspond to each tree (K trees in total), l denotes a differentiable convex loss function, and Ω indicates the regularizer. g i and h i denote the first-order approximation and second-order approximation of the Taylor series, respectively.

MARS
MARS is a nonlinear and nonparametric regression model [63]. The MARS model (Equation (12)) is established in two steps. First, all possible basis functions are introduced into the model in the forward stepwise regression procedure. Second, basis functions that contribute minimally to the model accuracy are eliminated in the backward stepwise pruning procedure.ŷ whereŷ is the output prediction; x is the input variable;

BRR
BRR refers to ridge regression implemented based on Bayesian statistical inference [61]. Ridge regression is a technique that imposes L 2 regularization on ordinary least squares regression (Equation (14)) to mitigate the problem of multicollinearity. The regularization parameters in ridge regression (Equation (15)) are specific fixed values, while Bayesian regression tunes them to the available data. Bayesian regression assumes the output prediction (y) to be Gaussian distributed around Xw and establishes a total probability model (Equation (16)). BRR uses a spherical Gaussian function to estimate the prior values of coefficients in the probability model (Equation (17)). The relative variable importance for the BRR model was generated by the recursive feature elimination (RFE) algorithm [64].
whereŷ(w, x) represents the predicted value; X = x 1 , x 2 , . . . , x p denotes the predictors; w 0 refers to the intercept; w = w 1 , w 2 , . . . , w p denotes the coefficients, and α (α ≥ 0) denotes the complexity parameter. The priors of α and λ are gamma distributions. The values of the regularization parameters α and λ are estimated by maximizing the log marginal likelihood.

Principle of GWR
GWR is a local regression method that is conducive to investigating spatial nonstationarity, which was proposed by applying geographically varied coefficients to the ordinary linear regression framework (Equation (18)) [30]. GWR estimates its coefficients with the weighted least squares method (Equation (19)) [30] and calculates the weight matrix with a spatial kernel function, such as the Gaussian kernel (Equation (20)).
where β 0 is a constant; β k (k = 1, . . . , m) denotes the coefficients; ε i is the residual at point i; m refers to the total number of predictors; (u i , v i ) denotes the geographical coordinates of the ith observation; the matrices for the independent variables and dependent variables, respectively; W(u i , v i ) is a spatial weighted square matrix; d ij denotes the spatial distance between regression point i and the neighbouring observation point j, and b indicates the kernel bandwidth.

Geographically Weighted Ensemble Learning
MFGWML uses GWR to perform geographically weighted ensemble learning; the procedure involves three steps. First, we randomly sampled 15,000 points from the study area. All samples were partitioned into training and testing subsets with the proportion 7:3 [65]. Second, we employed the same training subset to train the three base models; their hyperparameter optimizations were implemented automatically with a 10-fold crossvalidation approach. Third, the principal component analysis (PCA) technique was utilized on the three LST predictions to avoid local multicollinearity effects. The PCA components correlated with LST were integrated into the GWR model to generate ensemble predictions. In terms of downscaling from 30 to 10 m in this paper, PC1, PC2 and PC3 explained 97.472% variance, 1.857% variance, and 0.671% variance, respectively, in the 10 m LST predictions produced by the three base models. The PCA components are highly correlated with LST (|P| = 0.904), weakly correlated with LST (|P| = 0.122), and uncorrelated with LST (|P| = 0.042), respectively. The PCA components for the other six downscaling schemes have similar cases. Accordingly, we excluded PC3 and integrated PC1 and PC2 into the GWR model for different downscaling schemes.
2.6. Classic LST Downscaling Algorithms 2.6.1. TsHARP Algorithm TsHARP uses FVC ( f C ) instead of the NDVI as the scale factor to simulate T R (Equation (21)) [16], and the form of f C is simplified to f Cs (Equation (22)) to yield an easy-tooperate function of the NDVI (Equation (23)). The procedure of TsHARP is described as follows. First, a simple linear regression model is established to fit the relationship between T R and f Cs at coarse thermal resolutions (NDV I low ) (Equation (24)). Second, this regression relationship is applied to the NDVI data with a relatively high resolution (NDV I high ) (Equation (25)). Third, the coarse-resolution residual ∆T R low is calculated (Equation (26)); it is resampled as ∆T R low(r) to have the same spatial resolution of NDV I high with the nearest neighbour resampling method. The sharpened T R (T R high ) is derived by adding the resampled residual (∆T R low(r) ) to the predicted LST (T R NDV I high ) data (Equation (27)). It should be noted that water bodies must be excluded before establishing the TsHARP model [16].T whereT R (NDV I) denotes the predicted LST (K); a 0 and a 1 denote the regression coefficients; f Cs is the simplified f C ; NDV I low and NDV I high denote NDVI images with coarse spatial resolution and relatively high spatial resolution, respectively; T R low refers to the coarse-resolution LST (K) data; ∆T R low denotes the coarse-resolution residual (K); ∆T R low(r) represents the resampled ∆T R low ; andT R high indicates the downscaled LST (K) data.

HUTS Algorithm
The HUTS algorithm was specifically proposed for LST downscaling in urban areas [17], which fits the 4th-order bivariate polynomial relationship between LST and the two factors, namely the NDVI and albedo (Equation (28)). The sharpening procedure of HUTS is similar to that of TsHARP.
where LST S denotes the downscaled LST (K) data, LST R indicates the reference LST (K) data, and n represents the number of pixels in the entire image. Flowchart of the LST downscaling procedure. IMW, improved mono-window; "(low)" and "(high)" represent coarse spatial resolution and relatively high spatial resolution, respectively. Table 5 summarizes the Pearson correlation coefficient values (abbreviated as P) between features and LST. Most features are moderately correlated (0.4 < | | < 0.7) or highly correlated (0.7 < | | < 0.9) with LST. Albedo, NIR, and SWIR1 are weakly correlated with LST (0.2 < | | < 0.4). Although they might be useless individually, they can significantly improve the model performance when employed with other features [37]. However, aspect and hillshade were removed, as they are almost irrelevant to LST (| | ≈ 0).   Table 5 summarizes the Pearson correlation coefficient values (abbreviated as P) between features and LST. Most features are moderately correlated (0.4 < |P| < 0.7) or highly correlated (0.7 < |P| < 0.9) with LST. Albedo, NIR, and SWIR1 are weakly correlated with LST (0.2 < |P| < 0.4). Although they might be useless individually, they can significantly improve the model performance when employed with other features [37]. However, aspect and hillshade were removed, as they are almost irrelevant to LST (|P| ≈ 0).

Correlations among Features
The Pearson correlation coefficients (P) among features are listed in Table A1 of Appendix A. Among vegetation indices, FVC, index-based vegetation index (IVI), modified soil adjusted vegetation index (MSAVI), NDVI, optimal soil adjusted vegetation index (OSAVI), SAVI, and TC2 have extremely high correlations (0.9 ≤ |P| ≤ 1). FVC was retained, and other vegetation indices were eliminated, given that FVC is the most relevant to LST among these features. Subsequently, other features with Pearson correlation coefficients less than 0.85 were selected based on the coefficient matrix. There were seven features selected in this stage: DEM, FVC, MNDWI, NIR, slope, SWIR2, and TC1.

Variable Importance Assessment
To identify the optimal feature combination for each model, we separately added the previously selected features to the three base models according to the variable importance rankings provided by each model, in descending order (Figure 3a-c). The RMSE values of the models with different numbers of variables were calculated (Figure 3c-e). The lowest RMSE value of each model indicates the optimal variable number. Therefore, the optimal feature combination for the XGBoost model is the combination of the top three variables, namely DEM, SWIR2, and FVC. The optimal feature combination for the MARS model is the combination of the top four variables, namely DEM, MNDWI, SWIR2, and TC1. The optimal feature combination for the BRR model is the combination of the top five variables, namely SWIR2, FVC, MNDWI, DEM, and TC1. Figure 4 displays the correlation coefficients among the XGBoost, MARS, and BRR models. All the correlation coefficient values are less than 0.75, which indicates that the three base models have weak relationships; thus, they can be fused to improve predictions.

MFGWML Model Parameters
As shown in Figure 5, the regression coefficients (namely the slope of PC1 and the slope of PC2), intercept, and residual for the MFGWML model with a spatial resolution of 30 m represent evident spatial heterogeneity. Therefore, it is essential to utilize the local model, namely GWR, to characterize the spatially nonstationary relationships between LST and explanatory variables (PC1 and PC2). Moreover, these parameters were resampled from 30 to 10 m with the nearest neighbor resampling method for predicting the 10 m LST.
RMSE value of each model indicates the optimal variable number. Therefore, the optimal feature combination for the XGBoost model is the combination of the top three variables, namely DEM, SWIR2, and FVC. The optimal feature combination for the MARS model is the combination of the top four variables, namely DEM, MNDWI, SWIR2, and TC1. The optimal feature combination for the BRR model is the combination of the top five variables, namely SWIR2, FVC, MNDWI, DEM, and TC1.  Figure 4 displays the correlation coefficients among the XGBoost, MARS, and BRR models. All the correlation coefficient values are less than 0.75, which indicates that the three base models have weak relationships; thus, they can be fused to improve predictions.

Comparing MFGWML with TsHARP
The downscaling results of MFGWML and TsHARP are compared after excluding water regions. Water regions were identified, using the water mask built by setting the threshold "MNDW I > 0.2" to the MNDWI image. As shown in Figure 6, the MFGWML downscaled result has a more similar spatial pattern distribution to that of the original 30 m LST data than the TsHARP downscaled result. There are many overestimations in the TsHARP downscaled result. These results illustrate that MFGWML can recognize and retain LST heterogeneity much better than TsHARP.

MFGWML Model Parameters
As shown in Figure 5, the regression coefficients (namely the slope of PC1 and the slope of PC2), intercept, and residual for the MFGWML model with a spatial resolution of 30 m represent evident spatial heterogeneity. Therefore, it is essential to utilize the local model, namely GWR, to characterize the spatially nonstationary relationships between LST and explanatory variables (PC1 and PC2). Moreover, these parameters were resampled from 30 to 10 m with the nearest neighbor resampling method for predicting the 10 m LST. A total of 50,000 points were randomly selected from the entire city of Beijing (excluding water regions), to assess the model performances of MFGWML and TsHARP. The pixel-based scatterplots of the downscaled LST data produced by MFGWML and TsHARP against the corresponding reference LST data for the six downscaling schemes are presented in Figure 7. The fitting line of scatters in the MFGWML model is closer to the reference line than that of scatters in the TsHARP model for each downscaling scheme. Table 6 lists the statistical parameters of the fitting lines. The slope values for the fitting line of MFGWML and TsHARP are very close. However, the RMSE and MAE values for the fitting equation of MFGWML are lower than those of TsHARP for each downscaling scheme. The R 2 value for the fitting equation of MFGWML is higher than that of TsHARP for each downscaling scheme. Most scatter points in the MFGWML model are generally distributed on the reference line, which represents a 1:1 relationship. However, most scatter points in the TsHARP model tend to be located above the reference line, which indicates that TsHARP overestimates the downscaled LST data for the six downscaling schemes.
The results confirm that the single explanatory variable, namely the NDVI, in the TsHARP model is incapable of capturing the complicated spatial heterogeneity characteristics of LST data. MFGWML outperforms TsHARP and is robust at different scales.

Comparing MFGWML with TsHARP
The downscaling results of MFGWML and TsHARP are compared after excluding water regions. Water regions were identified, using the water mask built by setting the threshold "MNDWI > 0.2" to the MNDWI image. As shown in Figure 6, the MFGWML downscaled result has a more similar spatial pattern distribution to that of the original 30 m LST data than the TsHARP downscaled result. There are many overestimations in the TsHARP downscaled result. These results illustrate that MFGWML can recognize and retain LST heterogeneity much better than TsHARP.  A total of 50,000 points were randomly selected from the entire city of Beijing (excluding water regions), to assess the model performances of MFGWML and TsHARP. The pixel-based scatterplots of the downscaled LST data produced by MFGWML and TsHARP against the corresponding reference LST data for the six downscaling schemes are presented in Figure 7. The fitting line of scatters in the MFGWML model is closer to the reference line than that of scatters in the TsHARP model for each downscaling scheme. Table 6 lists the statistical parameters of the fitting lines. The slope values for the fitting line of MFGWML and TsHARP are very close. However, the RMSE and MAE values for the fitting equation of MFGWML are lower than those of TsHARP for each downscaling scheme. The value for the fitting equation of MFGWML is higher than that of TsHARP for each downscaling scheme. Most scatter points in the MFGWML model are generally distributed on the reference line, which represents a 1:1 relationship. However, most scatter points in the TsHARP model tend to be located above the reference line, which indicates that TsHARP overestimates the downscaled LST data for the six downscaling schemes. The results confirm that the single explanatory variable, namely the NDVI, in the TsHARP model is incapable of capturing the complicated spatial heterogeneity characteristics of LST data. MFGWML outperforms TsHARP and is robust at different scales.   Figure 7. Pixel-based density scatterplots of the downscaled LST (K) data (Y-axis) and reference LST (K) data (X-axis) for the six downscaling schemes: (a1,a2,b1,b2,c1,c2,d1,d2,e1,e2,f1,f2). Plots (a1,a2) are the density scatterplots for Scheme 1, which corresponds to the MFGWML model and TsHARP model, respectively. The remainder of the plots are named in the same manner. The dotted black line is the 1:1 reference line, and the solid red line is the fitting line.  Figure 8 shows the LST error distribution histograms of the MFGWML and TsHARP models for the six downscaling schemes. Intuitively, the LST errors of the MFGWML model for the six downscaling schemes are concentrated near 0 K. However, the LST errors of the TsHARP model for the six downscaling schemes tend to be less than 0 K, which indicates that TsHARP performs worse than MFGWML at different scales. Pixel-based density scatterplots of the downscaled LST (K) data (Y-axis) and reference LST (K) data (X-axis) for the six downscaling schemes: (a1,a2,b1,b2,c1,c2,d1,d2,e1,e2,f1,f2). Plots (a1,a2) are the density scatterplots for Scheme 1, which corresponds to the MFGWML model and TsHARP model, respectively. The remainder of the plots are named in the same manner. The dotted black line is the 1:1 reference line, and the solid red line is the fitting line. Figure 8 shows the LST error distribution histograms of the MFGWML and TsHARP models for the six downscaling schemes. Intuitively, the LST errors of the MFGWML model for the six downscaling schemes are concentrated near 0 K. However, the LST errors of the TsHARP model for the six downscaling schemes tend to be less than 0 K, which indicates that TsHARP performs worse than MFGWML at different scales. Figure 8. LST error distribution histograms of the MFGWML and TsHARP models for the six downscaling schemes: (a1,a2,b1,b2,c1,c2,d1,d2,e1,e2,f1,f2). Plots (a1,a2) represent the LST error distribution histograms for Scheme 1, which corresponds to the MFGWML and TsHARP models, respectively. The remainder of the plots are named in the same manner. Figure 9 shows the LST error distribution boxplots of the MFGWML and TsHARP models for the six downscaling schemes. It is intuitively displayed that the median of LST errors in MFGWML for each scheme is almost 0 K, indicating that MFGWML is unbiased. The median of LST errors in TsHARP for each scheme is greater than -5 K and less than 0 K, indicating that TsHARP is biased. The interquartile range (IQR) is compared, which means the difference between the upper and lower quartiles of boxplots. The IQR for LST errors of MFGWML for each scheme is obviously narrower than that of TsHARP, indicating that the LST error distribution of MFGWML is more concentrated than that of TsHARP. Table 7 lists the exact values of the statistical indicators for evaluating LST error distributions. The means, medians, the 25th percentiles (marked as Q1), and the 75th percentiles (marked as Q3) for LST errors of MFGWML are closer to 0 K than those of TsHARP for the six downscaling schemes. Besides, the IQR values of MFGWML are less than those of TsHARP for the six downscaling schemes. These statistical results indicate that LST errors of MFGWML are more concentrated to 0 K than those of TsHARP. In other words, MFGWML performs better than TsHARP at different scales. (a1,a2,b1,b2,c1,c2,d1,d2,e1,e2,f1,f2). Plots (a1,a2) represent the LST error distribution histograms for Scheme 1, which corresponds to the MFGWML and TsHARP models, respectively. The remainder of the plots are named in the same manner. Figure 9 shows the LST error distribution boxplots of the MFGWML and TsHARP models for the six downscaling schemes. It is intuitively displayed that the median of LST errors in MFGWML for each scheme is almost 0 K, indicating that MFGWML is unbiased. The median of LST errors in TsHARP for each scheme is greater than -5 K and less than 0 K, indicating that TsHARP is biased. The interquartile range (IQR) is compared, which means the difference between the upper and lower quartiles of boxplots. The IQR for LST errors of MFGWML for each scheme is obviously narrower than that of TsHARP, indicating that the LST error distribution of MFGWML is more concentrated than that of TsHARP. Table 7 lists the exact values of the statistical indicators for evaluating LST error distributions. The means, medians, the 25th percentiles (marked as Q1), and the 75th percentiles (marked as Q3) for LST errors of MFGWML are closer to 0 K than those of TsHARP for the six downscaling schemes. Besides, the IQR values of MFGWML are less than those of TsHARP for the six downscaling schemes. These statistical results indicate that LST errors of MFGWML are more concentrated to 0 K than those of TsHARP. In other words, MFGWML performs better than TsHARP at different scales. Remote Sens. 2021, 13, x FOR PEER REVIEW 20 of 33  According to the above boxplots, there are outliers of LST errors in the MFGWML and TsHARP models for the six downscaling schemes. Pixels with LST error less than (Q1 − 1.5*IQR) or greater than (Q3 + 1.5*IQR) are identified as outliers. Figure 10 displays the spatial distribution of LST errors and their outliers in the MFGWML and TsHARP models for Scheme 1, namely downscaling from 90 to 30 m. Most LST errors in the MFGWML model range from −1.5 to 1.5 K (Figure 10a1), while most LST errors in the TsHARP model are less than -3 K (Figure 10a2). Table 8 shows that pixels with LST errors of <−3 K, −3 K to -1.5 K, −1.5 K to 1.5 K, 1.5 K to 3 K, and >3 K, account for 2.929%, 9.344%, 80.735%,  According to the above boxplots, there are outliers of LST errors in the MFGWML and TsHARP models for the six downscaling schemes. Pixels with LST error less than (Q1 − 1.5*IQR) or greater than (Q3 + 1.5*IQR) are identified as outliers. Figure 10 displays the spatial distribution of LST errors and their outliers in the MFGWML and TsHARP models for Scheme 1, namely downscaling from 90 to 30 m. Most LST errors in the MFGWML model range from −1.5 to 1.5 K (Figure 10a1), while most LST errors in the TsHARP model are less than −3 K (Figure 10a2). Table 8 shows that pixels with LST errors of <−3 K, −3 K to −1.5 K, −1.5 K to 1.5 K, 1.5 K to 3 K, and >3 K, account for 2.929%, 9.344%, 80.735%, 6.118%, and 0.874% of all the samples in the MFGWML model for Scheme 1, respectively. Pixels with LST errors of < −3 K, −3 K to −1.5 K, −1.5 K to 1.5 K, 1.5 K to 3 K, and >3 K, account for 37.389%, 34.695%, 26.939%, 0.703%, and 0.274% of all the samples in the TsHARP model for Scheme 1, respectively. Most outliers in the MFGWML model for Scheme 1 occurred in the northern and eastern parts of the urban area, ranging from −5 to −2.616 K. Most outliers in the TsHARP model for Scheme 1 also occurred in the northern and eastern parts of the urban area, but they range from −10 to −7.152 K. The outliers in the MFGWML model and TsHARP model for Scheme 1 account for 5.433% and 1.991%, respectively. The other five downscaling schemes have similar statistical results. These results indicate that the downscaled LSTs have few outliers, and the MFGWML downscaled LST is more reliable than the TsHARP downscaled LST.   Plots (a1,b1) represent the spatial distribution of LST errors for MFGWML and TsHARP, respectively. Plots (a2,b2) indicate LST error outliers for MFGWML and TsHARP, respectively. In the plots (a2,b2), different colors except the light yellow color indicate outliers with different error ranges.  Figure 10. Spatial distribution of LST errors (K) in the MFGWML and TsHARP models for Scheme 1. Plots (a1,b1) represent the spatial distribution of LST errors for MFGWML and TsHARP, respectively. Plots (a2,b2) indicate LST error outliers for MFGWML and TsHARP, respectively. In the plots (a2,b2), different colors except the light yellow color indicate outliers with different error ranges.

Comparing MFGWML with HUTS
As mentioned in Section 2.6.2, the HUTS algorithm was specifically proposed for LST downscaling in urban areas, where the spatial heterogeneity of land covers and LST is relatively considerable. Therefore, we performed the HUTS model for LST downscaling around the central urban area in Beijing. As shown in Figure 11, the MFGWML downscaled result has a more similar spatial pattern distribution to that of the original 30 m LST data than the HUTS downscaled result. There are many overestimations in the HUTS downscaled result. These results illustrate that MFGWML can recognize and retain LST heterogeneity much better than HUTS. A total of 50,000 points were randomly selected around the central urban area to assess the model performances of MFGWML and HUTS. The pixel-based scatterplots of the downscaled LST data produced by MFGWML and HUTS against the corresponding reference LST data for the six downscaling schemes are presented in Figure 12. The fitting line of scatters in the MFGWML model is a little closer to the reference line than that of scatters in the HUTS model for each downscaling scheme. Table 10 lists the statistical parameters of the fitting lines. The slope values, RMSE values, MAE values, and values for the fitting lines of MFGWML and HUTS are very close in each downscaling scheme. Most scatter points in the MFGWML model generally distribute on the reference line, which represents a 1:1 relationship. However, most scatter points in the HUTS model tend to be located above the reference line, thus indicating that the HUTS model overestimates the downscaled LST data for the six downscaling schemes. The results demonstrate that the two factors (namely NDVI and albedo) in the HUTS model are incapable of capturing the LST heterogeneity. MFGWML outperforms HUTS and is robust at different scales. A total of 50,000 points were randomly selected around the central urban area to assess the model performances of MFGWML and HUTS. The pixel-based scatterplots of the downscaled LST data produced by MFGWML and HUTS against the corresponding reference LST data for the six downscaling schemes are presented in Figure 12. The fitting line of scatters in the MFGWML model is a little closer to the reference line than that of scatters in the HUTS model for each downscaling scheme. Table 10 lists the statistical parameters of the fitting lines. The slope values, RMSE values, MAE values, and R 2 values for the fitting lines of MFGWML and HUTS are very close in each downscaling scheme. Most scatter points in the MFGWML model generally distribute on the reference line, which represents a 1:1 relationship. However, most scatter points in the HUTS model tend to be located above the reference line, thus indicating that the HUTS model overestimates the downscaled LST data for the six downscaling schemes. The results demonstrate that the two factors (namely NDVI and albedo) in the HUTS model are incapable of capturing the LST heterogeneity. MFGWML outperforms HUTS and is robust at different scales. Figure 13 displays the LST error distribution histograms of the MFGWML and HUTS models for the six downscaling schemes. Intuitively, the LST errors of the MFGWML model for the six downscaling schemes are concentrated near 0 K. However, the LST errors of the HUTS model for the six downscaling schemes tend to be less than 0 K, which indicates that HUTS performs worse than MFGWML at different scales.  Figure 12. Density scatterplots of per-pixel comparison between the downscaled LST (K) data (Y-axis) and the reference LST (K) data (X-axis) at different scales (a1,a2,b1,b2,c1,c2,d1,d2,e1,e2,f1,f2). Plots (a1,a2) are the density scatterplots for Scheme 1, which corresponds to the MFGWML and HUTS models, respectively. The remainder of the plots are named in the same manner. The dotted black line is the 1:1 reference line, and the solid red line is the fitting line.  Figure 13 displays the LST error distribution histograms of the MFGWML and HUTS models for the six downscaling schemes. Intuitively, the LST errors of the MFGWML model for the six downscaling schemes are concentrated near 0 K. However, the LST errors of the HUTS model for the six downscaling schemes tend to be less than 0 K, which indicates that HUTS performs worse than MFGWML at different scales. Figure 12. Density scatterplots of per-pixel comparison between the downscaled LST (K) data (Y-axis) and the reference LST (K) data (X-axis) at different scales (a1,a2,b1,b2,c1,c2,d1,d2,e1,e2,f1,f2). Plots (a1,a2) are the density scatterplots for Scheme 1, which corresponds to the MFGWML and HUTS models, respectively. The remainder of the plots are named in the same manner. The dotted black line is the 1:1 reference line, and the solid red line is the fitting line. Figure 14 shows the LST error distribution boxplots of the MFGWML and HUTS models for the six downscaling schemes. It is intuitively displayed that the median of LST errors in MFGWML for each scheme is almost 0 K, indicating that MFGWML is unbiased. The median of LST errors in HUTS for each scheme is greater than −5 K and less than 0 K, indicating that HUTS is biased. The IQRs of LST errors in MFGWML and HUTS are almost the same, indicating that the LST errors in MFGWML and HUTS have the same dispersion. Table 11 lists the exact values of the statistical indicators for evaluating LST errors. It is evident that the means, medians, the 25th percentiles, and the 75th percentiles for LST errors of MFGWML are closer to 0 K than those of HUTS for the six downscaling schemes. The IQR values of MFGWML are a little less than those of HUTS for Scheme 1 and Scheme 2, while the IQR values of MFGWML are a little greater than those of HUTS for Scheme 3, Scheme 4, Scheme 5, and Scheme 6. In general, The IQR values of MFGWML and HUTS are very close for the six downscaling schemes, indicating that the degrees of LST error dispersion in MFGWML and HUTS are almost the same. The above statistical results indicate that LST errors in MFGWML are concentrated to 0 K while those in HUTS are concentrated to a value less than 0 K. In other words, MFGWML performs better than HUTS at different scales. Figure 13. LST error distribution histograms of the MFGWML and HUTS models for the six downscaling schemes: (a1,a2,b1,b2,c1,c2,d1,d2,e1,e2,f1,f2). Plots (a1,a2) represent the LST error distribution histograms for Scheme 1, which corresponds to the MFGWML and HUTS models, respectively. The remainder of the plots are named in the same manner. Figure 14 shows the LST error distribution boxplots of the MFGWML and HUTS models for the six downscaling schemes. It is intuitively displayed that the median of LST errors in MFGWML for each scheme is almost 0 K, indicating that MFGWML is unbiased. The median of LST errors in HUTS for each scheme is greater than −5 K and less than 0 K, indicating that HUTS is biased. The IQRs of LST errors in MFGWML and HUTS are almost the same, indicating that the LST errors in MFGWML and HUTS have the same dispersion. Table 11 lists the exact values of the statistical indicators for evaluating LST errors. It is evident that the means, medians, the 25th percentiles, and the 75th percentiles for LST errors of MFGWML are closer to 0 K than those of HUTS for the six downscaling schemes. The IQR values of MFGWML are a little less than those of HUTS for Scheme 1 and Scheme 2, while the IQR values of MFGWML are a little greater than those of HUTS for Scheme 3, Scheme 4, Scheme 5, and Scheme 6. In general, The IQR values of MFGWML and HUTS are very close for the six downscaling schemes, indicating that the degrees of LST error dispersion in MFGWML and HUTS are almost the same. The above statistical results indicate that LST errors in MFGWML are concentrated to 0 K while those in HUTS are concentrated to a value less than 0 K. In other words, MFGWML performs better than HUTS at different scales. Figure 13. LST error distribution histograms of the MFGWML and HUTS models for the six downscaling schemes: (a1,a2,b1,b2,c1,c2,d1,d2,e1,e2,f1,f2). Plots (a1,a2) represent the LST error distribution histograms for Scheme 1, which corresponds to the MFGWML and HUTS models, respectively. The remainder of the plots are named in the same manner.    Figure 15 displays the spatial distribution of LST errors and their outliers in the MFGWML and HUTS models for Scheme 1, namely downscaling from 90 to 30 m. Most LST errors in the MFGWML model range from −1.5 to 1.5 K (Figure 15a1), while most LST errors in the HUTS model are less than −3 K (Figure 15a2). Table 12 shows that the pixels with LST errors of <−3 K, −3 K to −1.5 K, −1.5 K to 1.5 K, 1.5 K to 3 K, and >3 K, account for 8.790%, 17.512%, 62.838%, 9.259%, and 1.601% of all the samples in the MFGWML model for Scheme 1, respectively. Pixels with LST errors of < −3 K, −3 K to −1.5 K, −1.5 K to 1.5 K, 1.5 K to 3 K, and > 3 K, account for 39.714%, 30.830%, 28.643%, 0.673%, and 0.140% of all the samples in the HUTS model for Scheme 1, respectively. Most outliers in the MFGWML model for Scheme 1 occurred in the northern part of the urban area, ranging from -10 to -5.018 K. Most outliers in the HUTS model for Scheme 1 occurred in the northern and   Table 12 shows that the pixels with LST errors of <−3 K, −3 K to −1.5 K, −1.5 K to 1.5 K, 1.5 K to 3 K, and >3 K, account for 8.790%, 17.512%, 62.838%, 9.259%, and 1.601% of all the samples in the MFGWML model for Scheme 1, respectively. Pixels with LST errors of < −3 K, −3 K to −1.5 K, −1.5 K to 1.5 K, 1.5 K to 3 K, and > 3 K, account for 39.714%, 30.830%, 28.643%, 0.673%, and 0.140% of all the samples in the HUTS model for Scheme 1, respectively. Most outliers in the MFGWML model for Scheme 1 occurred in the northern part of the urban area, ranging from −10 to −5.018 K. Most outliers in the HUTS model for Scheme 1 occurred in the northern and southwestern parts of the urban area, ranging from −15 to −10 K. The outliers in the MFGWML and HUTS models for Scheme 1 are few, accounting for 1.359% and 1.332%, respectively. The other five downscaling schemes have similar statistical results. These results indicate that the MFGWML downscaled LST is more reliable than the HUTS downscaled LST. southwestern parts of the urban area, ranging from −15 to −10 K. The outliers in the MFGWML and HUTS models for Scheme 1 are few, accounting for 1.359% and 1.332%, respectively. The other five downscaling schemes have similar statistical results. These results indicate that the MFGWML downscaled LST is more reliable than the HUTS downscaled LST. Figure 15. Spatial distribution of LST errors (K) in the MFGWML and HUTS models for Scheme 1. Plots (a1,b1) represent the spatial distribution of LST errors in the MFGWML and HUTS models, respectively. Plots (a2,b2) indicate LST error outliers for MFGWML and HUTS, respectively. In the plots (a2,b2), different colors, except for the light yellow color, indicate outliers with different ranges of LST errors.

Sources of LST Errors
The sources of LST errors in this paper mainly include three aspects. First, LST errors may come from the spectral differences due to the different acquisition dates of Landsat 8 and Sentinel-2A images. Second, there are inevitable errors generated in the LST retrieval procedures and the aggregation processing of the original retrieved LST data and the explanatory variables. Third, residual calibration is an important means to eliminate the uncertainty of model outputs. However, the nearest neighbor resampling method performed on the coarse-resolution residual is incapable of eliminating the uncertainty of model outputs. Therefore, there are some LST errors generated in the residual calibration procedures.

Limitations
This study has two shortcomings. First, the selected Landsat 8 and Sentinel-2A images have different acquisition dates due to the lack of appropriate images. Nevertheless, given that their dates are very similar, the spectral differences within several days in summer are expected to be very small. Second, the input PCA components (PC1 and PC2) make the model interpretability of MFGWML relatively weak. However, the MFGWML model requires PCA due to the effects of local multicollinearity.

Future Research
Further studies involve the following three aspects. First, the multicollinearity issue in GWR needs further investigation to improve the interpretability of the MFGWML model. Recently, Fotheringham, one of the GWR algorithm developers, demonstrated that GWR is very robust to multicollinearity influences and appealed to reconsider the previous contention [69]. Second, more validation strategies for downscaled LST data need more explorations, due to the lack of corresponding ground measurements at the satellite transit time. Third, the applicability, sensitivity, and uncertainty of the MFGWML model should be investigated for LST downscaling in areas with different climatic, environmental, and economic characteristics.

Conclusions
This study proposed the MFGWML downscaling method, to generate reliable and robust high-resolution LST data based on Sentinel-2A images. The MFGWML model introduced GWR to obtain geographically weighted ensemble predictions based on three different types of excellent machine learners (namely XGBoost, MARS, and BRR). By comparing the performances of different LST downscaling models, the main conclusions are summarized as follows: (1) The multi-factor downscaling model performs better than the classic single-factor algorithm, namely TsHARP. This conclusion is consistent with those of many previous studies. The multi-factor downscaling model outperforms the classic two-factor method, namely HUTS. This paper demonstrates that both the single indicator (namely NDVI) in TsHARP and the two factors (namely NDVI and albedo) in HUTS are insufficient for capturing the LST heterogeneity.
(2) The experimental results for the six LST downscaling schemes also indicate that MFGWML is robust at different scales.
(3) The MFGWML model integrates the advantages of multi-factor regressions, nonparametric ML algorithms, and the GWR method, to recognize the local heterogeneity and generate reliable and robust LST data.
MFGWML is a practical downscaling approach for obtaining LST data with relatively high spatial resolution, reliability, and robustness. The detailed spatial heterogeneity information of high-resolution LST data produced by MFGWML can promote LST applications in urban climate and other environmental research. The MFGWML method can also be applied for high-spatial-resolution predictions of other land surface variables and meteorological parameters (such as wind speed), which involve multiple influential factors.