A Data Fusion Modeling Framework for Retrieval of Land Surface Temperature from Landsat-8 and MODIS Data

Land surface temperature (LST) is a critical state variable of land surface energy equilibrium and a key indicator of environmental change such as climate change, urban heat island, and freezing-thawing hazard. The high spatial and temporal resolution datasets are urgently needed for a variety of environmental change studies, especially in remote areas with few LST observation stations. MODIS and Landsat satellites have complementary characteristics in terms of spatial and temporal resolution for LST retrieval. To make full use of their respective advantages, this paper developed a pixel-based multi-spatial resolution adaptive fusion modeling framework (called pMSRAFM). As an instance of this framework, the data fusion model for joint retrieval of LST from Landsat-8 and MODIS data was implemented to generate the synthetic LST with Landsat-like spatial resolution and MODIS temporal information. The performance of pMSRAFM was tested and validated in the Heihe River Basin located in China. The results of six experiments showed that the fused LST was high similarity to the direct Landsat-derived LST with structural similarity index (SSIM) of 0.83 and the index of agreement (d) of 0.84. The range of SSIM was 0.65–0.88, the root mean square error (RMSE) yielded a range of 1.6–3.4 °C, and the averaged bias was 0.6 °C. Furthermore, the temporal information of MODIS LST was retained and optimized in the synthetic LST. The RMSE ranged from 0.7 °C to 1.5 °C with an average value of 1.1 °C. When compared with in situ LST observations, the mean absolute error and bias were reduced after fusion with the mean absolute bias of 1.3 °C. The validation results that fused LST possesses the spatial pattern of Landsat-derived LSTs and inherits most of the temporal properties of MODIS LSTs at the same time, so it can provide more accurate and credible information. Consequently, pMSRAFM can be served as a promising and practical fusion framework to prepare a high-quality LST spatiotemporal dataset for various applications in environment studies.


Introduction
Land surface temperature (LST) is the measurement of the radiative skin temperature over the earth's surface [1]. As a crucial variable associated with the energy balance, LST is highly responsive to land surface energy equilibrium [2,3], which may contribute to explore a variety of phenomena taking place at the surface-atmosphere interface [4]. Therefore, it becomes valuable for various scientific studies [1], such as climatology [5], drought monitoring [6], urban heat island [7][8][9], hydrology [10,11], infrastructure [12], agriculture [13,14], public health [15], and permafrost mapping [16]. However, calculation, and the dynamics of the land surface was treated statically or linearly [25,28,35]. As for LST fusion, they are mainly from the perspective of surface reflectance [27,29,45,48,49], and few studies formulate a common framework for generating synthetic LST products.
To date, a variety of fusion methods have been proposed to achieve both high spatial and temporal resolutions based on the combination of Landsat and MODIS [50,51]. MODIS and Landsat-8 have complementary characteristics in terms of spatial and temporal resolution, especially the two TIR bands. The two TIR channels of Landsat-8 are band 10 and band 11, and the central wavelengths are 10.9 µm and 12 µm, respectively, which are very similar to Band 31 and Band 32 of MODIS. It lay the foundation for the LST fusion of these two satellites at the product level. MODIS LST products are excellent in temporal resolution, and the spatial resolution is adequate to study the global LST variations, but it may not be appropriate for regional research [52]. On the contrary, the 30 m spatial resolution, multispectral bands, and freely available data constitute an excellent utility for a wide range of applications based on Landsat satellites, which provide sufficient spatial details for monitoring land surface changes at the regional scale [53]. However, the 16-day revisit-cycle, even longer because of inferior atmospheric conditions, has limited further applications [51,54]. Beyond that, it is relatively difficult to automatically retrieve accurate and valid LST from Landsat TIR bands, although there are several algorithms available [21,55]. When using these direct retrieval algorithms, the preparation of the corresponding parameters and auxiliary data (such as emissivity, atmospheric parameters, etc.) is rather complicated and tedious [50]. Meanwhile, there is not yet a universal LST product derived from Landsat-8. It is a practical solution to combine MODIS and Landsat to learn more about the temporal and spatial variations of LST [9,30]. The common idea of these methods is to disaggregate LSR data into HSR data based on a pair of co-registered imageries to produce high spatiotemporal products. Each method has its own strengths and limitations, and some reviews on best practices of these methods were briefly described by Hazaymeh and Hassan et al. [47], Zhu et al. [2], and Zhao et al. [28].
The continuous monitoring of the LST dynamics is helpful in assessing the impact of climate change and human activities on environmental change. Thus, it is necessary to develop a practical method to generate the high-quality LST spatiotemporal dataset for exploring the spatial pattern and temporal variations by aggregating multi-source satellite data. Despite the fusion method progresses, the development of new techniques in the LST fusion should be further enhanced. In this paper, we mainly focus on how to generate a synthetic LST by blending the Landsat-8 TIR data and MODIS LST product. Specifically, a pixel-based multi-spatial resolution adaptive fusion modeling framework (named pMSRAFM) is proposed for generating the high-quality synthetic LST in an automated manner. The pMSRAFM uses MODIS LST as prior knowledge to solve the parameters of the Landsat-8 retrieval algorithm for generating LST at Landsat-like resolution on the MODIS observation time. In a sense, it is a disaggregation approach of MODIS LST products with the main difference in the decomposition of the mixed pixels. The remainder of this paper is organized as follows. Section 2 first describes the test region and the dataset used for experiments. Then, it presents a brief introduction to the fundamental theory and the details of the proposed framework. The results of comparison and validation are presented in Section 3. Finally, conclusions and suggestions for future work are provided in Section 4.

Study Area
The southeastern area (40 • 10 -42 • 00 N and 100 • 10 -112 • 10 E) of Heihe River Basin (HRB) was selected as the study area to test the performance of pMSRAFM ( Figure 1). HRB is the second-largest endorheic basin with unique landscapes and coexisting cold and arid regions in northwestern China [56]. It features glaciers, frozen soil, alpine meadows, forests, irrigated crops, riparian ecosystems, and deserts from upstream to downstream [57]. The regional environment variation and monitoring investigation require the support of high spatial and temporal resolution LST datasets. The study area covers 3285 km 2 and ranges in elevation from 2538 to 4615 m. As a climatic and geographical transition zone, Sensors 2020, 20, 4337 4 of 23 the study area provides an ideal testing ground with complex land surface characteristics for verifying the rationality of the proposed LST fusion framework. Many scholars have verified the applicability of satellite-derived LSTs in this region, and the results showed that root mean square error (RMSE) was generally within 5K [42,[58][59][60]. Besides, a large number of aeronautical experiments have been carried out in the HRB [61], and some ground observations can be provided to validate the fusion result and analyze the uncertainty. In situ observations were measured by the automatic TIR stations with an SI-111 infrared radiometer (8-14 µm). The information of seven LST stations operated by HiWATER [61] is described in Table 1 and displayed in Figure 1.
Sensors 2020, 20, x FOR PEER REVIEW 4 of 23 area covers 3285 km 2 and ranges in elevation from 2538 to 4615 m. As a climatic and geographical transition zone, the study area provides an ideal testing ground with complex land surface characteristics for verifying the rationality of the proposed LST fusion framework. Many scholars have verified the applicability of satellite-derived LSTs in this region, and the results showed that root mean square error (RMSE) was generally within 5K [42,[58][59][60]. Besides, a large number of aeronautical experiments have been carried out in the HRB [61], and some ground observations can be provided to validate the fusion result and analyze the uncertainty. In situ observations were measured by the automatic TIR stations with an SI-111 infrared radiometer (8-14μm). The information of seven LST stations operated by HiWATER [61] is described in Table 1 and displayed in Figure 1.    The auxiliary data can provide key information about spatial details to optimize the disaggregation and distribution of LSR pixels. The proper classification of HSR pixels is the most critical for the disaggregation of LSR pixels. For efficiency and convenience, pMSRAFM in this study assigns a unique type code to each HSR pixel based on the land cover category. Therefore, the accuracy of land cover data is the primary consideration. The FROM-GLC (Finer Resolution Observation and Monitoring-Global Land Cover, http://data.ess.tsinghua.edu.cn/) data was selected because of its excellent performance. FROM-GLC is the first 30 m resolution global land cover maps produced using Landsat Thematic Mapper (TM) and Enhanced Thematic Mapper Plus (ETM+) data [62]. The FROM-GLC data contains ten base classes (1-Cropland, 2-Forest, 3-Grassland, 4-Shrubland, 5-Wetland, 6-Water, 8-Impervious, 9-Bareland, 10-Snow and ice), and the overall accuracy of global land cover was above 65% [62]. Image refinement was executed to manually correct certain confusing pixels to further improve the classification accuracy. Since the latest FROM-GLC data only includes products for 2015 and 2017, the selection of satellite data was limited to these two years. The Shuttle Radar Topography Mission (SRTM) 1 Arc-Second Global elevation data was downloaded from the USGS EarthExplorer website (https://earthexplorer.usgs.gov/). These data were reprojected and resampled to match Landsat data.

Satellite Data
The satellite data with minimal cloud cover and clear-sky conditions will be used to evaluate the proposed framework. Finally, Landsat-8 OLI/TIRS C1 Level-1 (L1TP) products of Path 133, Row 34, acquired on 13 September, 2015 (acquisition time 10:45 a.m.) and 16 July, 2017 (acquisition time 10:28 a.m.) were selected as the HSR data. As the highest quality Landsat data, the L1TP products are suitable for pixel-level time series analysis, which have been radiometrically calibrated and orthorectified using ground control points and DEM data to correct for relief displacement [63]. Besides, NDVI can be calculated based on the red and near-infrared bands of Landsat-8. Two types of MODIS daily LST products, MOD11_L2 version 6 swath product and the MOD11A1 version 6 product, were selected and used as the LSR LST data. These MODIS LST products provide daily per-pixel LST and emissivity with a 1 km spatial resolution. The MOD11_L2 is produced daily in 5-min temporal increments of satellite acquisition using the generalized split-window algorithm, and the MOD11A1 is derived from the MOD11_L2 swath product by projecting MOD11_L2 pixels to Earth locations on a sinusoidal mapping grid. These MODIS LST products and Landsat-8 L1TP data were collected from the USGS EarthExplorer. Note that the Landsat-8 TIRS bands provided by the USGS were resampled to 30 m to match multispectral bands. The MODIS products were converted and projected by the HDF-EOS to GeoTIFF Conversion Tool (HEG) [64] to the UTM projection the same as Landsat data.
To ensure the validity of the pixel values, quality control was performed on each input data. A shared quality assurance mask was created for both Landsat and MODIS data to select the valid pixels for the computation. First, a filter screened out the pixels that had the LST values beyond reasonable expectations according to the LST changes during that month. Second, quality flags from both MODIS and Landsat data were used to ensure that only high-quality pixels were selected and involved in fusion. The data quality of MODIS was controlled by built-in quality control flags. In practice, the selection criteria of MOD11A1 pixels was that the daytime LST quality indicators equal to 0. As for MOD11_L2, the LST error was constrained to less than 0.5°C. The data quality of Landsat-8 was controlled based on the Landsat Collection 1 Level-1 Quality Assessment 16-bit Band to select the clear pixels.

Reference LST
To prepare HSR reference LST, we directly retrieved LST from Landsat-8 using the split-window algorithm (SWA) described by Du et al. [65] to obtain comparable high-quality reference LST (LST SWA HSR ). The SWA could obtain LST with an accuracy of better than 1.0 k [65]. Furthermore, we prepared the second HSR reference LST (LST SC HSR ) using the single-channel (SC) method based on Landsat-8 band 10 and FROM-GLC data. The Landsat-derived LSTs in the HRB have been verified with the in situ LSTs estimated from the flux measurements at HiWATER stations and the RMSE ranges from 1.12 • C to 3.67 • C [42,60]. These two HSR reference LSTs are complementary to each other, and both are useful to quantify and characterize the accuracy of the HSR fused LST. In addition, the corresponding observational records were collected from the LST stations located in the study area ( Figure 1 and Table 1). These observational records were calibrated using the Stefan Boltzmann law as described in Duan et al. [22] to assess the accuracy of satellite-derived LSTs.

Methodology
In this study, two aspects are considered in advance when constructing the fusion framework. First, the fusion framework is expected to be mathematically simple, robust, and computationally efficient. Due to the fusion process requires ancillary data and is implemented for each pixel in sequential mode, it should be simple and efficient. The development of temporal-spatial fusion techniques has provided a foundation for establishing the basic framework for multi-source data fusion. The overview of the pMSRAFM framework for multi-source data fusion is described in Figure 2. Our proposed framework relies on the assumption that the LST derived from the different satellites with similar spectral bandwidth and wavelength in the same region at the same time should be highly consistent or linearly correlated. As a form of physical downscaling, the LSR pixel can be considered as an aggregated block of HSR pixels. That is, the LSR pixel value can be approximately expressed by a transfer function of a block of HSR pixel values. The pMSRAFM aims to minimize the global error and estimate an optimal transfer function by joint utilizing the spatiotemporal information derived from multi-source satellites. The solving process consists of three key steps. First, calculate the same physical variables from the multi-resolution data according to their respective retrieval algorithms and extract the multi-dimensional attributes from HSR input data as pixel features. Second, establish the corresponding relationship at the pixel-level between two kinds of data based on the characteristic of the land surface and automatically solve the equation that describes the corresponding relationship. Finally, apply the solved parameters to the retrieval algorithm of the HSR data to generate a new synthetic data product. The following uses LST fusion as a practical case to introduce the application of the pMSRAFM.
Landsat-8 band 10 and FROM-GLC data. The Landsat-derived LSTs in the HRB have been verified with the in situ LSTs estimated from the flux measurements at HiWATER stations and the RMSE ranges from 1.12 °C to 3.67 °C [42,60]. These two HSR reference LSTs are complementary to each other, and both are useful to quantify and characterize the accuracy of the HSR fused LST. In addition, the corresponding observational records were collected from the LST stations located in the study area ( Figure 1 and Table 1). These observational records were calibrated using the Stefan Boltzmann law as described in Duan et al. [22] to assess the accuracy of satellite-derived LSTs.

Methodology
In this study, two aspects are considered in advance when constructing the fusion framework. First, the fusion framework is expected to be mathematically simple, robust, and computationally efficient. Due to the fusion process requires ancillary data and is implemented for each pixel in sequential mode, it should be simple and efficient. The development of temporal-spatial fusion techniques has provided a foundation for establishing the basic framework for multi-source data fusion. The overview of the pMSRAFM framework for multi-source data fusion is described in Figure  2. Our proposed framework relies on the assumption that the LST derived from the different satellites with similar spectral bandwidth and wavelength in the same region at the same time should be highly consistent or linearly correlated. As a form of physical downscaling, the LSR pixel can be considered as an aggregated block of HSR pixels. That is, the LSR pixel value can be approximately expressed by a transfer function of a block of HSR pixel values. The pMSRAFM aims to minimize the global error and estimate an optimal transfer function by joint utilizing the spatiotemporal information derived from multi-source satellites. The solving process consists of three key steps. First, calculate the same physical variables from the multi-resolution data according to their respective retrieval algorithms and extract the multi-dimensional attributes from HSR input data as pixel features. Second, establish the corresponding relationship at the pixel-level between two kinds of data based on the characteristic of the land surface and automatically solve the equation that describes the corresponding relationship. Finally, apply the solved parameters to the retrieval algorithm of the HSR data to generate a new synthetic data product. The following uses LST fusion as a practical case to introduce the application of the pMSRAFM.

LST Fusion with pMSRAFM
The basic idea of LST fusion based on the pMSRAFM is that LSR LST can be estimated by a weighted sum of the HSR LSTs. Essentially, the proposed LST fusion method is designed to integrate the temporal information from the LSR sensor with the spatial pattern of the HSR sensor to generate the high spatiotemporal LST data. It has been proved that different satellite-derived LSTs with different spatial resolutions exist a correlation at the low spatial resolution. In general, the LSR pixel can be seen as an aggregated block of the corresponding HSR pixels at the same time at the same location [35,66,67], as shown in Figure 3. So, the relationship between LSTs from two satellites over the homogeneous land surface can be expressed by the linear transfer function: Sensors 2020, 20, 4337 7 of 23 different spatial resolutions exist a correlation at the low spatial resolution. In general, the LSR pixel can be seen as an aggregated block of the corresponding HSR pixels at the same time at the same location [35,66,67], as shown in Figure 3. So, the relationship between LSTs from two satellites over the homogeneous land surface can be expressed by the linear transfer function: where LSTLSR and LSTHSR define low-resolution LST and high-resolution LST, respectively. (x, y) represents a given location, and t is the acquisition time. a and b are the coefficients used to adjust the observational differences, ideally set to 1 and 0, respectively [44,51]. ( ) is the residual representing the observational difference between these two satellite platforms. It is mainly due to the discrepancy in acquisition time, bandwidth, orbit parameters, geolocation errors, effective pixel coverage, and spectral response function. To handle the heterogeneous characteristics, we employ a similar method to the linear spectral mixture analysis [66][67][68] to distribute the temporal information of LSTLSR according to the spatial details of LSTHSR. Thus, each HSR pixel may have different parameters of a and b. Then the Equation (1) is reformed as: where ( ) is LSR pixel value at time t. , and , are regression coefficients between two satellites of the first layer at coarse resolution. N is the total number of HSR pixels corresponding to an LSR pixel. , ( ) is the i-th HSR pixel value at time t. , and , are parameters of the second layer for each HSR pixel at fine resolution. The pixel value of LSR LST can be considered as a two-layer linear combination of HSR LSTs [34]. In the practical calculation, Equation (2) can be simplified by merging the two-layer parameters as: where LST LSR and LST HSR define low-resolution LST and high-resolution LST, respectively. (x, y) represents a given location, and t is the acquisition time. a and b are the coefficients used to adjust the observational differences, ideally set to 1 and 0, respectively [44,51]. ε(t) is the residual representing the observational difference between these two satellite platforms. It is mainly due to the discrepancy in acquisition time, bandwidth, orbit parameters, geolocation errors, effective pixel coverage, and spectral response function. To handle the heterogeneous characteristics, we employ a similar method to the linear spectral mixture analysis [66][67][68] to distribute the temporal information of LST LSR according to the spatial details of LST HSR . Thus, each HSR pixel may have different parameters of a and b. Then the Equation (1) is reformed as: where LST LSR (t) is LSR pixel value at time t. a 1,t and b 1,t are regression coefficients between two satellites of the first layer at coarse resolution. N is the total number of HSR pixels corresponding to an LSR pixel. LST HSR,i (t) is the i-th HSR pixel value at time t. a 2,i and b 2,i are parameters of the second layer for each HSR pixel at fine resolution. The pixel value of LSR LST can be considered as a two-layer linear combination of HSR LSTs [34]. In the practical calculation, Equation (2) can be simplified by merging the two-layer parameters as: Then, pMSRAFM requires at least two pairs of LSR and HSR imageries to solve Equation (3) pixel by pixel within the LSR imagery. That is, the relationship between MODIS and Landsat-8 can be conveniently expressed by Equation (3) in the following matrix form: a n,1 a n,j a n,n where n is the spatial resolution scaling factor between MODIS and Landsat-8, LST Lat i,j is the pixel value at location [i,j] in the Landsat-8 scene, and LST MOD LSR is the MODIS LST pixel value. Equations (3) and (4) show the correspondence between LSTs at different spatial resolutions under ideal conditions. In order Sensors 2020, 20, 4337 8 of 23 to solve Equation (4), the HSR pixels must be classified according to their response characteristics to LST. Furthermore, given the complexity of LST retrieval from Landsat-8 data, we replace the real HSR LST with the physical retrieval algorithm. To keep MODIS and Landsat-8 LST retrieval algorithms consistent, the practical split-window algorithm (SWA) [69,70], as described by Du et al. [65], was selected. The split-window algorithm is well-proven and widely used in LST retrieval from the TIR sensor [71]. Then, the LST value of a Landsat-8 pixel with pixel type k can be approximated by SWA as a multivariate quadratic equation: where a k,0 , a k,1 , a k,2 , and a k,3 are equation parameters related to the pixel type k. BT i and BT j are brightness temperatures converted from the spectral radiance of Landsat-8 band 10 and band 11, respectively. ε SWA is the adjustment term due to the effects of spatial heterogeneity. For more information about this SWA algorithm, please refer to Du et al. [65]. The LST retrieval algorithm generally relies on ground surface information to characterize the spatial pattern [4]. As the most common environmental factors, NDVI and DEM data can provide complementary information about the spatial variability of LST.
They are not only easy to obtain but also directly affect spatial distribution [72]. These two factors are incorporated into the transfer function as covariables to minimize fusion error. Finally, the relationship between MODIS LST and Landsat-8 data can be expressed mathematically by the equation as follows: where a k,1 , a k,2 , a k,3 , a k,4 , a k,5 , and b k are the parameters of the transfer function. M is the total number of HSR classifications. The parameters are determined by the least square method. Once the optimal solution is determined, the HSR fused LST for the whole study area can be calculated by the following equation: where a k,1 , a k,2 , a k,3 , a k,4 , a k,5 , and b k denote the optimal solution of Equation (6).

Adjustment for HSR Comparison
In order to analyze the relationship between the HSR fused LST and the Landsat-derived LST under the same space-time basis, the value range of the HSR fused LST needs to be adjusted. Since the HSR fused LST aims to preserve the temporal information of MODIS LST and capture the spatial details of the Landsat-8, there is an observational difference between the HSR fused LST and reference LST derived from Landsat. The difference mainly includes two sources: the first is satellite differences in bandwidth, acquisition time, spectral response functions, geolocation errors, and atmospheric correction [73]. The second is the accumulation of simulation errors, which arises from data quality and the rationality of the transfer function. So, it is necessary to adjust the value range of the HSR fused LST for a quantitative analysis of the relationship between the fused LST and Landsat-derived LST. However, it is difficult to accurately estimate the distribution of differences without the corresponding high-resolution reference LST. In this study, we use an indirect method to adjust the discrepancy to fully verify the accuracy of the fusion result. We first calculate the LSR fused LST (LST DF LSR ) of Landsat and MODIS at low spatial resolution and use it as LSR reference LST. The LST DF LSR is generated based on MODIS emissivity and Landsat-8 brightness temperature according to the single-channel (SC) method [50,74] as follow: where ε LSR is the MODIS Band 31 emissivity. BT LSR is the average brightness temperature of N pixels of Landsat-8 band 10. λ is the center wavelength. σ is Boltzmann's constant, 1.38 × 10 −23 J/K, h is Planck's constant, 6.626 × 10 −34 J, c is the velocity of light. The difference between the HSR fused LST and Landsat-derived LST is estimated by analyzing the relationship between LST DF LSR and MODIS LST LST MODIS LSR as follow: To summarize, pMSRAFM is applied to LST fusion based on Landsat-8 and MODIS data. It involves three major steps to solve the transfer function. The first is to calculate brightness temperatures and NDVI from Landsat-8 data. The second step is to extract MODIS LST pixels according to the Landsat-8 scene acquisition date and geolocation. The final step is to determine the parameters of the transfer functions and downscale MODIS LST into Landsat-8 spatial resolution while preserving the original temporal properties. The detailed steps of pMSRAFM implemented for the LST fusion of Landsat-8 and MODIS data are described in Figure 4.
correction [73]. The second is the accumulation of simulation errors, which arises from data quality and the rationality of the transfer function. So, it is necessary to adjust the value range of the HSR fused LST for a quantitative analysis of the relationship between the fused LST and Landsat-derived LST. However, it is difficult to accurately estimate the distribution of differences without the corresponding high-resolution reference LST. In this study, we use an indirect method to adjust the discrepancy to fully verify the accuracy of the fusion result. We first calculate the LSR fused LST ( ) of Landsat and MODIS at low spatial resolution and use it as LSR reference LST. The is generated based on MODIS emissivity and Landsat-8 brightness temperature according to the single-channel (SC) method [50,74] as follow: where is the MODIS Band 31 emissivity. BTLSR is the average brightness temperature of N pixels of Landsat-8 band 10. λ is the center wavelength. σ is Boltzmann's constant, 1.38 × 10 −23 J/K, h is Planck's constant, 6.626 × 10 −34 J, c is the velocity of light. The difference between the HSR fused LST and Landsat-derived LST is estimated by analyzing the relationship between and MODIS LST as follow: To summarize, pMSRAFM is applied to LST fusion based on Landsat-8 and MODIS data. It involves three major steps to solve the transfer function. The first is to calculate brightness temperatures and NDVI from Landsat-8 data. The second step is to extract MODIS LST pixels according to the Landsat-8 scene acquisition date and geolocation. The final step is to determine the parameters of the transfer functions and downscale MODIS LST into Landsat-8 spatial resolution while preserving the original temporal properties. The detailed steps of pMSRAFM implemented for the LST fusion of Landsat-8 and MODIS data are described in Figure 4.

Accuracy Assessment
There are many challenges in the evaluation of the satellite-derived LSTs. Compared to the relatively large pixel observed by the satellite, most of the in situ LST observations are measured within a very small area and may introduce significant differences to the satellite measurements [4]. Besides, it is not sufficient to simply compare the fused LST with the ground measurements due to the scarcity and uneven distribution of in situ stations with the lack of observations at satellite overpass time. To fully evaluate the performance of pMSRAFM for generating Landsat-like daily LST, a comprehensive evaluation has been made with three types of reference data. First, LST DF HSR is resampled and compared with MODIS LST product. Second, LST DF HSR is direct pixel-by-pixel comparison with the reference LST derived from Landsat. Finally, the satellite-derived LSTs are compared with the in situ observations. Evaluation metrics included the correlation coefficient (R), bias, standard deviation (STD), root mean square error (RMSE), structural similarity index (SSIM) [75], and the index of agreement (d) [76]. Bias, RMSE, STD, and d are used to quantitatively validate the difference between the selected LST product and reference LST. SSIM is used to describe the closeness of spatial similarity. They are calculated using the following equations: where LST  [77]. The SSIM is a measure of the structural similarity of two images [78][79][80]. It considers similarity degrees of luminance, contrast, and structure of two images [75]. The larger the SSIM value, the better, and the maximum is 1. To standardize the LST range, x and y are divided by their maximum value, respectively. Mathematically, the modified SSIM is defined as: SSIM(x, y) = 2µ x µ y + C 1 2σ xy + C 2 where σ and σ xy denote the standard deviation and cross-correlation between x and y, respectively. C 1 and C 2 are two positive stabilizing constants [80].

Results
The performance of pMSRAFM was evaluated by comparing the HSR fused LST with MODIS LST products, and the LSTs directly derived from Landsat-8 and in situ observations. The evaluation methods involved visual comparison and statistical indicators. Two pairs of MODIS and Landsat images acquired on 13 September 2015 and 16 July 2017 were used as the primary inputs, named Case 1 and Case 2, respectively. In each case, the MOD11A1 and MOD11_L2 were used to be fused with Landsat-8, respectively. To analyze different types of MODIS data, we conducted a comparative experiment based on MOD11_L2 data, which was divided into two scenarios and labeled "MOD11L2_a" and "MOD11L2_b". A total of six experiments were conducted to evaluate the performance of pMSRAFM.
For visual comparison, all spatial maps were rotated clockwise by 18 • . In each case, pMSRAFM shared the same HSR data, but the MODIS data was different. Therefore, we named each experiment with the MODIS LST product name to distinguish each fusion result below.

Fusion Results of Case 1
The LST fusion of MOD11A1 and Landsat-8 was constrained to pixels that: the local time of MOD11A1 pixel observation was limited to 11:42~11:48 a.m., the zenith angle was within the range from 60 • to 63 • , and there was no invalid value in the corresponding Landsat pixel block. The final selected pixels accounted for 80.15% of the total number of MODIS pixels. The number of land cover classes was set to 7 according to the FROM-GLC data. The observation time of MOD11L2_a was limited to 10:06~10:12 and the difference with LST DF LSR was less than 0. The final selected pixels accounted for 44.68% of the total. As for MOD11L2_b, the observation time was limited to 11:42~11:48 and the difference with LST DF LSR was greater than 0. The final selected pixels accounted for 36.04% of the total. Because all the fused LSTs shared the same model and auxiliary data, their comparisons with reference LSTs were feasible. The temporal and spatial accuracy of the fusion results in Case 1 were evaluated from the following three aspects:

Comparison with MODIS LST
A pixel-by-pixel comparison between the HSR fused LSTs and the MODIS LSTs was conducted to quantify the fusion accuracy at the low spatial resolution. To ensure the consistency of spatial resolution, the HSR fused LSTs were aggregated to MODIS-like resolution. Figure 5 illustrates the comparison results between the HSR fused LSTs(y-axis) and their corresponding MODIS LST products (x-axis): Figure 5a   with Landsat-8, respectively. To analyze different types of MODIS data, we conducted a comparative experiment based on MOD11_L2 data, which was divided into two scenarios and labeled "MOD11L2_a" and "MOD11L2_b". A total of six experiments were conducted to evaluate the performance of pMSRAFM. For visual comparison, all spatial maps were rotated clockwise by 18°.
In each case, pMSRAFM shared the same HSR data, but the MODIS data was different. Therefore, we named each experiment with the MODIS LST product name to distinguish each fusion result below.

Fusion Results of Case 1
The LST fusion of MOD11A1 and Landsat-8 was constrained to pixels that: the local time of MOD11A1 pixel observation was limited to 11:42 ~ 11: 48 a.m, the zenith angle was within the range from 60° to 63°, and there was no invalid value in the corresponding Landsat pixel block. The final selected pixels accounted for 80.15% of the total number of MODIS pixels. The number of land cover classes was set to 7 according to the FROM-GLC data. The observation time of MOD11L2_a was limited to 10:06~10:12 and the difference with was less than 0. The final selected pixels accounted for 44.68% of the total. As for MOD11L2_b, the observation time was limited to 11:42~11:48 and the difference with was greater than 0. The final selected pixels accounted for 36.04% of the total. Because all the fused LSTs shared the same model and auxiliary data, their comparisons with reference LSTs were feasible. The temporal and spatial accuracy of the fusion results in Case 1 were evaluated from the following three aspects:

Comparison with MODIS LST
A pixel-by-pixel comparison between the HSR fused LSTs and the MODIS LSTs was conducted to quantify the fusion accuracy at the low spatial resolution. To ensure the consistency of spatial resolution, the HSR fused LSTs were aggregated to MODIS-like resolution. Figure 5 illustrates the comparison results between the HSR fused LSTs(y-axis) and their corresponding MODIS LST products (x-axis): Figure 5a vs. MOD11A1 LST, Figure 5b vs. MOD11L2_a LST, and Figure 5c vs. MOD11L2_b LST on 13 September, 2015, respectively. The results show an overall good agreement between the HSR fused LSTs and MODIS LSTs with R 2 > 0.7 and d > 0.9. MOD11A1 has the most samples but the lowest accuracy with R = 0.85, STD = 1.31, bias = 0.04, and RMSE =1.31. Figure 5b shows that the difference between the and MOD11L2_b is minimal with RMSE = 0.76, bias = 0, and d = 0.91. Most of the evaluation indicators in Figure 5c are the best with R = 0.93, d = 0.96, and RMSE = 0.92, which indicates that the retains most of MODIS LST information. MOD11L2 LST has a better precision compared to MOD11A1. Although a few pixels show significant differences, the HSR fused LSTs and MODIS LSTs maintain a high consistency at low spatial resolution.  The spatial distributions of the HSR fused LSTs and reference LSTs of Case 1 are mapped and shown in Figure 6. Figure 6a,c,e and Figure 6b,d,f are MODIS LSTs and their corresponding HSR fused LSTs, respectively. The SSIM ranges from 0.66 to 0.86, indicating a significant spatial similarity. Figure 6g,h are the resampled reference LSTs. Visually, it can be clearly seen that the spatial pattern of the HSR fused LST is closer to that of reference LST. The HSR LSTs present more delicate spatial details and show a high correlation between them. Figures 5 and 6 indicate that the spatiotemporal information of the MODIS LST is well reflected in the HSR fused LST.
Sensors 2020, 20, x FOR PEER REVIEW 12 of 23 The spatial distributions of the HSR fused LSTs and reference LSTs of Case 1 are mapped and shown in Figure 6. Figure 6a,c,e and Figure 6b,d,f are MODIS LSTs and their corresponding HSR fused LSTs, respectively. The SSIM ranges from 0.66 to 0.86, indicating a significant spatial similarity. Figure 6g,h are the resampled reference LSTs. Visually, it can be clearly seen that the spatial pattern of the HSR fused LST is closer to that of reference LST. The HSR LSTs present more delicate spatial details and show a high correlation between them. Figures 5 and 6 indicate that the spatiotemporal information of the MODIS LST is well reflected in the HSR fused LST.

Comparison with Landsat-Derived LST
Since the primary goal of pMSRAFM is to improve the spatial resolution of the LSR data meanwhile retaining its temporal information, we pay more attention to the spatial pattern of the HSR fused LSTs at high spatial resolution. The spatial distribution and frequency histogram of the HSR fused LSTs and reference LSTs are shown in Figure 7, and the comparison results are summarized in Table 2. It can be seen from Figure 7 that the LST spatial variations of the five maps are highly similar. Compared with Figure 6, the spatial pattern of the fused LST, especially the fusion result of MOD11L2, is visually improved and more similar to that of the reference LST. The fused LST and reference LST are approximately the same in spatial variation with the terrain. Although the LST ranges (maximum and minimum) are different, the reference LST and fused LST match well in terms of the spatial details. It is observed from the frequency histogram that the reference LST is significantly higher than the fusion LST. Figure 6 and Figure 7 illustrate the fused LST captures the spatial pattern of Landsat LST and retains the original value range of MODIS LST.

Comparison with Landsat-Derived LST
Since the primary goal of pMSRAFM is to improve the spatial resolution of the LSR data meanwhile retaining its temporal information, we pay more attention to the spatial pattern of the HSR fused LSTs at high spatial resolution. The spatial distribution and frequency histogram of the HSR fused LSTs and reference LSTs are shown in Figure 7, and the comparison results are summarized in Table 2. It can be seen from Figure 7 that the LST spatial variations of the five maps are highly similar. Compared with Figure 6, the spatial pattern of the fused LST, especially the fusion result of MOD11L2, is visually improved and more similar to that of the reference LST. The fused LST and reference LST are approximately the same in spatial variation with the terrain. Although the LST ranges (maximum and minimum) are different, the reference LST and fused LST match well in terms of the spatial details. It is observed from the frequency histogram that the reference LST is significantly higher than the fusion LST. Figures 6 and 7  As shown in Table 2, SSIM, d, bias, and R are improved after fusion, and the best agreement occurs in MOD11L2_b with SSIM ≥ 0.8 and R ≥ 0.9. Figure 7 and Table 2 indicate that the fused LST is substantially the same as the reference LST in the spatial pattern, despite the observational differences. From the perspective of SSIM and d, the fused LSTs are closer to . The SSIM ranges from 0.77 to 0.88, and the d ranges from 0.79 to 0.94. The is the best fused LST in terms of SSIM and d. It is suggested that pMSRAFM will be practically useful for the LST fusion when the relationship between the paired satellite data can be linearly modeled. Overall, the fusion results of MOD11_L2 are better than that of MOD11A1. Other indicators are also calculated as measures of accuracy. It is found that the bias is less than 2 °C, and RMSE ranges from 1.7 to 3.5 °C. Further contrastive analysis is discussed in Section 3.3.   As shown in Table 2, SSIM, d, bias, and R are improved after fusion, and the best agreement occurs in MOD11L2_b with SSIM ≥ 0.8 and R ≥ 0.9. Figure 7 and Table 2 indicate that the fused LST is substantially the same as the reference LST in the spatial pattern, despite the observational differences. From the perspective of SSIM and d, the fused LSTs are closer to LST SC HSR . The SSIM ranges from 0.77 to 0.88, and the d ranges from 0.79 to 0.94. The LST DF3 HSR is the best fused LST in terms of SSIM and d. It is suggested that pMSRAFM will be practically useful for the LST fusion when the relationship between the paired satellite data can be linearly modeled. Overall, the fusion results of MOD11_L2 are better than that of MOD11A1. Other indicators are also calculated as measures of accuracy. It is found that the bias is less than 2 • C, and RMSE ranges from 1.7 to 3.5 • C. Further contrastive analysis is discussed in Section 3.3.

Validation with In Situ Observations
The HSR fused LST was also compared with in situ observations at the same time. Four stations in the study area were available to evaluate the accuracy of fusion results in Case 1. The error probability was estimated through the Gaussian distribution of the difference between the LSR fused LST and MODIS LST. In addition, the MODIS LSTs and in situ observations were compared to analyze the accuracy under different spatial resolutions. The results are shown in Figure 8. From Figure 8a, the bias of the fused LSTs and MODIS LSTs is in the acceptable range (−2.1~0.6 • C). The mean absolute errors between the fused LSTs and in situ observations are 2.1 • C, 0.6 • C, and 0.6 • C, respectively. After fusion, the absolute value of bias of MOD11A1 becomes larger, while that of MOD11L2 becomes smaller. Figure 8b shows that the fusion error of MOD11L2 is relatively stable and smaller than that of MOD11A1 LST. The fusion result of MOD11L2_b is the best. By contrast, there is some uncertainty in the fusion result of MOD11A1, which underestimates the real LST by 2 • C. Overall, most of the errors are concentrated in the range −22 • C, so the fusion results are satisfactory. From the probability of error occurrence, the smaller the absolute value of the error, the higher the probability. The maximum probability is 0.38, and the minimum probability is 0.09. The comparisons with in situ observations in Case 1 indicate that the LST generated by pMSRAFM could depict the spatial distribution of the real LST within a reasonable error range.

. Validation with in Situ Observations
The HSR fused LST was also compared with in situ observations at the same time. Four stations in the study area were available to evaluate the accuracy of fusion results in Case 1. The error probability was estimated through the Gaussian distribution of the difference between the LSR fused LST and MODIS LST. In addition, the MODIS LSTs and in situ observations were compared to analyze the accuracy under different spatial resolutions. The results are shown in Figure 8. From Figure 8a, the bias of the fused LSTs and MODIS LSTs is in the acceptable range (−2.1~0.6°C). The mean absolute errors between the fused LSTs and in situ observations are 2.1°C, 0.6°C, and 0.6°C, respectively. After fusion, the absolute value of bias of MOD11A1 becomes larger, while that of MOD11L2 becomes smaller. Figure 8b shows that the fusion error of MOD11L2 is relatively stable and smaller than that of MOD11A1 LST. The fusion result of MOD11L2_b is the best. By contrast, there is some uncertainty in the fusion result of MOD11A1, which underestimates the real LST by 2 °C. Overall, most of the errors are concentrated in the range −22 °C, so the fusion results are satisfactory. From the probability of error occurrence, the smaller the absolute value of the error, the higher the probability. The maximum probability is 0.38, and the minimum probability is 0.09. The comparisons with in situ observations in Case 1 indicate that the LST generated by pMSRAFM could depict the spatial distribution of the real LST within a reasonable error range.

Fused Results of Case 2
Similar to Case 1, the pixels of each MODIS LST product of Case 2 were also filtered under different constraints. The pixel of MOD11A1 was constrained to that: the local time of MOD11A1 pixel observation was limited to 11:42 ~ 11: 48, the zenith angles were within the range from 60° to 62°, and there was no invalid value in the corresponding Landsat pixel block. The final selected pixels accounted for 85.48% of the total. The number of land cover classes was set to 9 according to the FROM-GLC data. The observation time of MOD11L2_a pixel was limited to 10:06 and the difference with was less than 0. The final selected pixels accounted for 62.34% of the total. The observation time of MOD11L2_b pixel was limited to 11:42. The final selected pixels accounted for 38.99% of the total. Based on the above dataset, we used pMSRAFM to generate three kinds of daily LST with 30 m resolution to further verify its universality.

Fused Results of Case 2
Similar to Case 1, the pixels of each MODIS LST product of Case 2 were also filtered under different constraints. The pixel of MOD11A1 was constrained to that: the local time of MOD11A1 pixel observation was limited to 11:42~11: 48, the zenith angles were within the range from 60 • to 62 • , and there was no invalid value in the corresponding Landsat pixel block. The final selected pixels accounted for 85.48% of the total. The number of land cover classes was set to 9 according to the FROM-GLC data. The observation time of MOD11L2_a pixel was limited to 10:06 and the difference with LST DF LSR was less than 0. The final selected pixels accounted for 62.34% of the total. The observation time of MOD11L2_b pixel was limited to 11:42. The final selected pixels accounted for 38.99% of the total. Based on the above dataset, we used pMSRAFM to generate three kinds of daily LST with 30 m resolution to further verify its universality. Figure 9 shows the pixel-by-pixel comparisons between HSR fused LSTs and MODIS LSTs. As expected, there is a significant correlation between them. The d ranges from 0.75 to 0.89, and the range of R 2 is 0.65~0.77. Overall, the R is greater than 0.8, bias is less than 0.05 • C, and RMSE yields a range of 0.93~1.53 • C. The spatial distributions of the fused LSTs and their corresponding MODIS LST products are shown in Figure 10. Figure 10g,h is the resampled reference LSTs. As can be seen from Figure 10, the spatial patterns of the fused LSTs and MODIS LST products are similar, especially in 'hot' areas, and the range of SSIM is from 0.66 to 0.83. Another proof that the fused LST retains most of the spatial and temporal information of MODIS LST and is closer to the spatial pattern of reference LST at the low spatial resolution.

Comparison with MODIS LST
Sensors 2020, 20, x FOR PEER REVIEW 15 of 23 Figure 9 shows the pixel-by-pixel comparisons between HSR fused LSTs and MODIS LSTs. As expected, there is a significant correlation between them. The d ranges from 0.75 to 0.89, and the range of R 2 is 0.65~0.77. Overall, the R is greater than 0.8, bias is less than 0.05°C, and RMSE yields a range of 0.93~1.53°C. The spatial distributions of the fused LSTs and their corresponding MODIS LST products are shown in Figure 10. Figure 10g and 10h are the resampled reference LSTs. As can be seen from Figure 10, the spatial patterns of the fused LSTs and MODIS LST products are similar, especially in 'hot' areas, and the range of SSIM is from 0.66 to 0.83. Another proof that the fused LST retains most of the spatial and temporal information of MODIS LST and is closer to the spatial pattern of reference LST at the low spatial resolution.    Figure 9 shows the pixel-by-pixel comparisons between HSR fused LSTs and MODIS LSTs. As expected, there is a significant correlation between them. The d ranges from 0.75 to 0.89, and the range of R 2 is 0.65~0.77. Overall, the R is greater than 0.8, bias is less than 0.05°C, and RMSE yields a range of 0.93~1.53°C. The spatial distributions of the fused LSTs and their corresponding MODIS LST products are shown in Figure 10. Figure 10g and 10h are the resampled reference LSTs. As can be seen from Figure 10, the spatial patterns of the fused LSTs and MODIS LST products are similar, especially in 'hot' areas, and the range of SSIM is from 0.66 to 0.83. Another proof that the fused LST retains most of the spatial and temporal information of MODIS LST and is closer to the spatial pattern of reference LST at the low spatial resolution.

Comparison with Landsat-Derived LST
The spatial distributions with frequency histogram of reference LST and the fused LST in Case 2 are shown in Figure 11. As can be seen from the five maps, their spatial patterns are highly coherent and consistent. Intuitively, the spatial pattern of the fused LST is substantially similar to that of the reference LST. Although each LST has a unique value range, most of its data is concentrated in the range of 20 • C~40 • C. Table 3 summarizes the statistical results of the fused LSTs and MODIS LSTs compared with reference LSTs at different spatial resolutions. As expected, pMSRAFM performs well in the spatial pattern simulation of Case 2, in which the SSIM ranges from 0.80 to 0.88. Table 3 and Figure 11 show the fused LST tends to share a similar spatial pattern of reference LST derived by the SC algorithm. Again, the fusion result of MOD11L2_b has the best evaluation indicators, and MOD11L2 is better than MOD11A1.

Comparison with Landsat-Derived LST
The spatial distributions with frequency histogram of reference LST and the fused LST in Case 2 are shown in Figure 11. As can be seen from the five maps, their spatial patterns are highly coherent and consistent. Intuitively, the spatial pattern of the fused LST is substantially similar to that of the reference LST. Although each LST has a unique value range, most of its data is concentrated in the range of 20 °C ~ 40 °C. Table 3 summarizes the statistical results of the fused LSTs and MODIS LSTs compared with reference LSTs at different spatial resolutions. As expected, pMSRAFM performs well in the spatial pattern simulation of Case 2, in which the SSIM ranges from 0.80 to 0.88. Table 3 and Figure 11 show the fused LST tends to share a similar spatial pattern of reference LST derived by the SC algorithm. Again, the fusion result of MOD11L2_b has the best evaluation indicators, and MOD11L2 is better than MOD11A1. The fused LSTs and MODIS LSTs were compared pixel-by-pixel with the reference LSTs to explore changes in evaluation indicators before and after fusion. As shown in Table 3, MODIS LSTs are also relatively consistent with the reference LSTs generated by the SC algorithm. Indicators for evaluating the spatial pattern are improved by the fusion process. SSIM, d, and R increase by an average of 30%, 19%, and 9%, respectively. STD, bias, and RMSE are partially improved, but not significantly. MODIS LST and fused LST are roughly similar in these indicators. It is found both the average values of bias and RMSE are above 4 °C at the low spatial resolution before fusion, which indicates Landsat-derived LSTs are 4 °C higher than MODIS LSTs. The average value of bias after fusion is 3.7 °C, which is slightly smaller than that of the original MODIS LSTs, suggesting a minor improvement. Furthermore, it indicates that the fused LST not only possess the spatial pattern of Landsat-derived LST but also inherits most of the temporal properties of MODIS LST at the same time. From the above visual comparison and statistical results, pMSRAFM substantially improved the spatial pattern of MODIS LST products with higher accuracy. The fused LSTs and MODIS LSTs were compared pixel-by-pixel with the reference LSTs to explore changes in evaluation indicators before and after fusion. As shown in Table 3, MODIS LSTs are also relatively consistent with the reference LSTs generated by the SC algorithm. Indicators for evaluating the spatial pattern are improved by the fusion process. SSIM, d, and R increase by an average of 30%, 19%, and 9%, respectively. STD, bias, and RMSE are partially improved, but not significantly. MODIS LST and fused LST are roughly similar in these indicators. It is found both the average values of bias and RMSE are above 4 • C at the low spatial resolution before fusion, which indicates Landsat-derived LSTs are 4 • C higher than MODIS LSTs. The average value of bias after fusion is 3.7 • C, which is slightly smaller than that of the original MODIS LSTs, suggesting a minor improvement. Furthermore, it indicates that the fused LST not only possess the spatial pattern of Landsat-derived LST but also inherits most of the temporal properties of MODIS LST at the same time. From the above visual comparison and statistical results, pMSRAFM substantially improved the spatial pattern of MODIS LST products with higher accuracy.  Figure 12. Figure 12a shows the bias of the HSR fused LSTs and MODIS LSTs against in situ observations. Obviously, the bias is distributed between −3 and 3 • C. After fusion, the absolute value of bias is reduced by 58%, 31%, and 34%, respectively. The fusion process can reduce bias significantly. The relative reduction in bias among three experiments indicates that pMSRAFM performs well in the disaggregation of MODIS LSTs. Figure 12b displays the distribution of errors and their corresponding probabilities. It can be seen 78% of fusion errors in Case 2 are within 2 • C, and the mean absolute error is 1.7 • C. The error distribution shows that the larger the absolute value of the error, the lower the probability of occurrence. The results show that compared with the original MODIS LST product, the fused LST can provide more accurate and reliable information.   Figure 12. Figure 12a shows the bias of the HSR fused LSTs and MODIS LSTs against in situ observations. Obviously, the bias is distributed between −3 and 3°C. After fusion, the absolute value of bias is reduced by 58%, 31%, and 34%, respectively. The fusion process can reduce bias significantly. The relative reduction in bias among three experiments indicates that pMSRAFM performs well in the disaggregation of MODIS LSTs. Figure 12b displays the distribution of errors and their corresponding probabilities. It can be seen 78% of fusion errors in Case 2 are within 2 °C, and the mean absolute error is 1.7 °C. The error distribution shows that the larger the absolute value of the error, the lower the probability of occurrence. The results show that compared with the original MODIS LST product, the fused LST can provide more accurate and reliable information.

Comparison with Reference LSTs with Adjustment
The comparison and analysis of fused LST and reference LST derived from Landsat-8 provide a reasonable assessment of the spatial pattern generated by pMSRAFM. Overall, the above comparisons of the two cases show a high degree of consistency. However, uncertainty remains

Comparison with Reference LSTs with Adjustment
The comparison and analysis of fused LST and reference LST derived from Landsat-8 provide a reasonable assessment of the spatial pattern generated by pMSRAFM. Overall, the above comparisons of the two cases show a high degree of consistency. However, uncertainty remains mainly due to differences in LST observations between the two satellites [45]. The orbit parameters of MODIS and Landsat are approximately equal [43], but the data quality, inversion methods, scale effects, and so on will lead to error. Although the spatial similarity can be evaluated by SSIM and R, the direct comparison pixel values of Landsat-derived LST and fused LST need to be adjusted to reduce the range difference. Theoretically, under clear-sky conditions, the change of LST from 10:00 a.m. to 12:00 a.m. should be approximately a linear upward process. However, these two satellite-derived LSTs did not completely follow this rule in the above two cases (Tables 2 and 3), in which Landsat-derived LSTs are higher than MODIS LSTs. It is worth mentioning that the bias between reference LSTs and fused LSTs is roughly equal to the bias between reference LSTs and MODIS LSTs. That means these differences were transferred from LSR LST to HSR LST. It provides an operational way for adjusting the HSR fused LST, making it closer to the HSR reference LST. So, a cross-scale adjustment was performed by using the bias of the MODIS LST and LST DF LSR as the observation difference (∆LST) of reference LST and fused LST. Because ∆LST only changes the range of the HSR fused LST, without affecting the spatial pattern, d, bias, and RMSE were used to re-evaluate the accuracy of fusion results.
The comparisons between reference LSTs and the fused LSTs before and after adjustments are shown in Table 4. It can be found that the reference LSTs are higher than the fused LSTs with the averaged d of 0.76, bias ranging from −0.03 to 5.42 • C and RMSE ranging from 1.69 • C to 6.04 • C. After the adjustment, d varies from 0.84 to 0.93, bias ranges from −2.0 • C to 2.2 • C, and RMSE ranges from 1.6 • C to 3.4 • C. Except for MOD11L2_b in Case 1, the differences between reference LSTs and fused LSTs are significantly reduced after the adjustment. In Case 1, the adjustment to MOD11L2_b does not reduce errors as much as that to MOD11A1 and MOD11L2_a, which may be related to the data quality of the LSR pixels involved in the fusion. Even so, its evaluation indicator is slightly better than that of MOD11A1. By contrast, the effect of the adjustment is more pronounced in Case 2, with d increased by 15%, bias decreased by 84%, and RMSE decreased by 30%. The improvement of these indicators confirms the rationality of cross-scale adjustment and the spatiotemporal reliability of the fused LSTs. It further indicates that pMSRAFM used for LST fusion can achieve similar results as the Landsat-8 LST retrieval algorithm with a relatively high agreement (the average d of 0.84).

Conclusions
This study proposed an operational data fusion framework, named pMSRAFM, for generating the synthetic LST based on the assumption that the correspondence between LSTs at different resolutions or from different sources can be linearly modeled. The synthetic LST was designed to not only preserve the LSR LST temporal information but also has the spatial pattern of HRS LST. The pMSRAFM was implemented and applied to automatically retrieve LST based on Landsat-8 and MODIS data, which presented several improvements over the previous fusion techniques. The most significant improvement is to incorporate an SWA algorithm to establish the linkage between LSR LST and HSR TIR data. Therefore, it only needs a pair of LSR and HSR data, and the HSR data can be reused. This advantage will be significant when pMSRAFM is applied to reconstruct a long-term LST dataset with high spatial resolution. Another merit of pMSRAFM is to address the spatial heterogeneity using the linear spectral mixture analysis. It disaggregates LSR LST by using land cover data as the main distribution factor and BT, NDVI, DEM as covariates. These auxiliary datasets contribute to pMSRAFM obtain more accurate spatial details to deal with surface heterogeneity and reduce fusion errors. Especially the inclusion of HSR BT data is more effective to account for LST variations. Since the fusion processes are executed automatically to fully explore and make the best use of abundant information derived from the existing satellite data, pMSRAFM can be served as an automated data fusion system for regional environmental change research that expects high-quality LST dataset, but lack of ground observations.
In order to evaluate the performance of pMSRAFM, two cases with six experiments were carried out in the Heihe River basin, China. The fusion accuracy was separately analyzed by comparing the fused LSTs with different reference LSTs. Compared with MODS LSTs, the bias was approximately equal to 0, and RMSE was less than 1.6 • C. When compared with Landsat-derived LSTs, the SSIM was greater than 0.8, d was higher than 0.74, and RMSE was less than 3.4 • C. The errors between the fused LSTs and in situ observations of LST stations within the study area mainly ranged from −2 • C to 2 • C with the average absolute error of 1.3 • C. The comprehensive comparison with MODIS LSTs, Landsat-derived LSTs, and in situ observations proved the rationality and feasibility of pMSRAFM, which could closely capture the spatial pattern of Landsat-derived LST and retain the temporal information of MODIS LST. Besides, the observational differences between these two satellites were fully considered, and the fusion error was estimated by Gaussian distribution. The results revealed that the greater the discrepancy, the lower the probability, and vice versa. Moreover, it also indicated that pMSRAFM could achieve similar results as the Landsat-8 LST retrieval algorithm and the average d was 0.84. Hence, the proposed framework provides a practical method for joint retrieval of LST from MODIS and Landsat data, which would be useful for areas lacking in situ observations. pMSRAFM is not only applicable to the LST fusion of MODIS and Landsat but is also suitable for the data fusion of other LST-related variables to enhance their temporal frequency and spatial resolution. One potential application is to characterize and quantify the spatiotemporal dynamics of regional environment, as well as change detection, target recognition. However, it should be noted that further improvements are needed in three aspects. First, a modular will be introduced to deal with the spatial allocation of fusion error in the follow-up development. Second, subsequent updates will focus more on surface dynamics to improve the reusability of HSR data and predictive ability. Finally, it is essential to incorporate in situ observations into the fusion process for constructing a continuous data series. This study made an initial attempt to develop a multi-source data fusion framework, aiming to generate a more continuous LST series with high spatial resolution. The synthetic products will alleviate the urgent need for timely and accurate monitoring of surface dynamics. In the future, more extensive verification tests will be carried out, with further optimization and improvement of pMSRAFM.
Author Contributions: G.Z. conceived and designed the study. J.T. prepared the observed data. Y.R. and C.L. downloaded and processed the remote sensing imageries. G.Z. developed the framework and wrote the initial manuscript with help from Y.Z. All authors have read and agreed to the published version of the manuscript.