A Scale-Separating Framework for Fusing Satellite Land Surface Temperature Products

: The trade-off between spatial and temporal resolutions of satellite imagery is a long-standing problem in satellite remote sensing applications. The lack of daily land surface temperature (LST) data with ﬁne spatial resolution has hampered the understanding of surface climatic phenom-ena, such as the urban heat island (UHI). Here, we developed a fusion framework, characterized by a scale-separating process, to generate LST data with high spatiotemporal resolution. The scale-separating framework breaks the fusion task into three steps to address errors at multiple spatial scales, with a speciﬁc focus on intra-scene variations of LST. The framework was experimented with MODIS and Landsat LST data. It ﬁrst removed inter-sensor biases, which depend on season and on land use type (urban versus rural), and then produced a Landsat-like sharpened LST map for days when MOIDS observations are available. The sharpened images achieved a high accuracy, with a RMSE of 0.91 K for a challenging heterogeneous landscape (urban area). A comparison between the sharpened LST and the air temperature measured with bicycle-mounted mobile sensors revealed the roles of impervious surface fraction and wind speed in controlling the surface-to-air temperature gradient in an urban landscape.


Introduction
Land surface temperature (LST) is a key driver of surface-air energy exchanges [1]. Satellite LST data are used in a large array of studies, ranging from climate change [2][3][4][5], to agriculture [6,7], forestry [8,9], hydrology [10,11], and ecology [12,13]. Being continuous in space and repetitive in time, satellite-based LST data are especially useful for studies of the urban heat island (UHI) [14][15][16].However, satellite thermal imagery always involves a tradeoff between spatial resolution and temporal coverage [17]. For example, Landsat satellites acquire thermal imageries in fine spatial resolution (better than 120 m) every 16 days. On the other hand, MODIS LST data are provided daily but the spatial resolution is coarser (1 km). This trade-off is a technological bottleneck [18] that limits the utility of satellite LST data in characterizing intra-city variations of the UHI intensity. One solution to overcome the bottleneck is to develop scaling methods and produce LST data at a finer spatial resolution (e.g., the resolution of Landsat) from daily images acquired at a coarser spatial resolution (e.g., MODIS LST data products).
A large body of literature has been published in the past decades on methods to downscale satellite thermal imagery. These methods can be broadly characterized as Disaggregation of Remotely Sensed Land Surface Temperature (DLST) [19]. A common DLST method, termed Thermal Sharpening (TSP), incorporates kernel-based algorithms for downscaling [19,20]. Loosely speaking, the kernel relates the LST derived from the thermal band with the physical properties derived from other bands of the same coarse resolution image. For example, DisTrad [21], or disaggregation procedure for radiometric LST fusion via CNN models is more challenging than surface reflectance fusion for two reasons. First, the surface reflectance is derived from multiple co-registered bands, but the LST data is obtained from a single band. Second, LST images have lower spatial resolutions than surface reflectance images. These characteristics limit the number of learnable features that can be extracted by CNN models for LST fusion.
A third challenge is that CNN models can only be fed with images without missing values. If a portion of the image is contaminated by cloud, the entire image must be discarded from the training set. A possible solution to cloud contamination is training the CNN models using image patches, which are small subsets sampled from the entire scene [17]. The patches with cloud are excluded from training so that the training still makes full use of the clear-sky portion of the scene. The size of patches is a critical parameter that needs careful determination. A large patch size leads to too much data loss when cloudy pixels occur within the patches, while a small patch size limits the within-patch variations of LST that can be learned by the CNN. For this reason, the CNN models in previous studies are based on clear-sky images.
In comparison to a CNN, a regular neural network is trained using single pixels of the LST data. The pixel-level training minimizes the data loss when discarding the cloudy pixels. Although the spatial variation of LST is not learnable for the regular neural network, its non-linearity is still a useful feature to reduce the inter-sensor biases.
In this study, we propose a novel framework for spatiotemporal fusion of LST to achieve the following goals: 1.
To address the non-linearity of inter-sensor LST relationships with incorporation of a neural network, 2.
To capture temporal change in LST in multiple scales, and 3.
To generate high-quality fine resolution LST images in urban areas to support studies of intra-city temperature variations.

Study Area
Our study area is New Haven County, Connecticut, United States ( Figure 1). It encompasses the City of New Haven, West Haven, and their outskirts. New Haven County is located in the temperate climate zone. The mean diurnal change of temperature is about 9.2 • C. The monthly mean air temperature ranges from 2.8 • C to 25.7 • C, with an annual mean of about 12.3 • C. Snowfall occurs from November to April. The total population was 847,000 in 2020 census. New Haven is mostly an industrial city supported by a dense road network (black lines in Figure 1). The compact man-made structures in urban land are surrounded by densely vegetated lands. This highly heterogeneous landscape provides a rigorous test of our model performance.

LST Data
The coarse-resolution LST data is derived from the MODIS satellite. The MOD11A1 Version 6 product provides daily per-pixel LST at 0.928-km spatial resolution. It is available on NASA's Land Processes Distributed Active Archive Center (LP DAAC; https: //lpdaac.usgs.gov/products/mod11a1v006/; accessed date: 20 August 2021). The original sinusoidal pixels were resampled to 100 m then aggregated to 1 km. The local time of observation is 11:48 AM Eastern standard Time.
For fine-resolution LST data, we used Landsat 8 Provisional Surface Temperature (Collection 1), hosted by United States Geological Survey (USGS). The data originated from the Landsat Collection 1 Level-1 thermal infrared (TIR) bands. The TIR band 10 has been converted to surface temperature using an operational calibration methodology [32,33]. In this collection, the atmospheric effects have been corrected using atmospheric profiles from North American Regional Reanalysis (NARR) data. The surface emissivity is obtained from ASTER Global Emissivity Database (GED). Specifically, the ASTER GED emissivity is first corrected to Landsat-8 band 10 by a spectral adjustment. An NDVI-based vegetation

LST Data
The coarse-resolution LST data is derived from the MODIS satellite. The MOD11A1 Version 6 product provides daily per-pixel LST at 0.928-km spatial resolution. It is available on NASA's Land Processes Distributed Active Archive Center (LP DAAC; https://lpdaac.usgs.gov/products/mod11a1v006/; accessed date: 20 August 2021). The original sinusoidal pixels were resampled to 100 m then aggregated to 1 km. The local time of observation is 11:48 AM Eastern standard Time.
For fine-resolution LST data, we used Landsat 8 Provisional Surface Temperature (Collection 1), hosted by United States Geological Survey (USGS). The data originated from the Landsat Collection 1 Level-1 thermal infrared (TIR) bands. The TIR band 10 has been converted to surface temperature using an operational calibration methodology [32,33]. In this collection, the atmospheric effects have been corrected using atmospheric profiles from North American Regional Reanalysis (NARR) data. The surface emissivity is obtained from ASTER Global Emissivity Database (GED). Specifically, the ASTER GED emissivity is first corrected to Landsat-8 band 10 by a spectral adjustment. An NDVI-based vegetation adjustment is then performed to account for inconsistent phenology between ASTER and Landsat scenes [34]. The time of observation is approximately 10:30 AM Eastern Standard Time. The Landsat LST data are archived at 30-m spatial resolution and 16day temporal resolution. In this study, we processed the data to 100-m spatial resolution identical to the original resolving quality of the TIR thermal bands. Cloud, snow and ocean pixels were excluded from model training, but inland water pixels were retained.
In this study, we selected a total of 26 image pairs from 1 January 2013 to 1 May 2020 ( Figure 2). Each pair consists of one Terra MODIS and one Landsat LST image acquired on the same day, the latter of which is referred to as the target Landsat LST. The target Landsat image in each pair has less than 1% missing pixels (due to cloud) in the study domain. All the images were projected to the World Geodetic System 1984 (WGS84) for co-registration. In this study, we selected a total of 26 image pairs from 1 January 2013 to 1 May 2020 ( Figure 2). Each pair consists of one Terra MODIS and one Landsat LST image acquired on the same day, the latter of which is referred to as the target Landsat LST. The target Landsat image in each pair has less than 1% missing pixels (due to cloud) in the study domain. All the images were projected to the World Geodetic System 1984 (WGS84) for co-registration.

Sensor-to-Sensor Biases
We found inter-sensor biases at the scene and the patch scale. For the 26 pairs of MODIS and target Landsat LST images, the scene-average LST shows significant correlation (Figure 3a

Sensor-to-Sensor Biases
We found inter-sensor biases at the scene and the patch scale. For the 26 pairs of MODIS and target Landsat LST images, the scene-average LST shows significant correlation (Figure 3a; R 2 = 0.99). The mean bias (MODIS minus Landsat) is −2.39 K, and the RMSE is 2.8 K. The slope of the regression of MODIS LST versus Landsat LST is less than 1, indicating that MODIS LST is more negatively biased in warmer conditions. The most negative bias (−5.21 K) occurred on 29 July 2019.

Sensor-to-Sensor Biases
We found inter-sensor biases at the scene and the patch scale. For the 26 pairs o MODIS and target Landsat LST images, the scene-average LST shows significant correla tion (Figure 3a; = 0.99). The mean bias (MODIS minus Landsat) is −2.39 K, and th RMSE is 2.8 K. The slope of the regression of MODIS LST versus Landsat LST is less than 1, indicating that MODIS LST is more negatively biased in warmer conditions. The mos negative bias (−5.21 K) occurred on 29 July 2019.
(a) (b) Biases also occurred at the patch-scale, or scale of the MODIS pixels. For patch-scal comparison, we averaged the 10 by 10 Landsat pixies in each MODIS pixel grid to produc the patch-scale, or resampled Landsat LST. We then compared the deviation of the patch  Biases also occurred at the patch-scale, or scale of the MODIS pixels. For patch-scale comparison, we averaged the 10 by 10 Landsat pixies in each MODIS pixel grid to produce the patch-scale, or resampled Landsat LST. We then compared the deviation of the patchscale LST from the scene mean between the resampled Landsat and the MODIS LST in the same image pair. The patch-scale LST is highly correlated (Figure 3b; R 2 = 0.44). However, the slope of the regression (MODIS versus Landsat) is only 0.36. Because the scene-level mean LST values were removed, the scatter seen in Figure 3b was caused by spatial variations across the scene. In other words, spatial variations in the MODIS LST were only 36% of those in the resampled Landsat LST. Furthermore, the linear fit can only explain 43% of the variability, suggesting that a model with higher complexity than linear regression is needed to reduce the inter-sensor error.

Workflow
Our scale-separating framework produces Landsat-like fine-resolution LST with a 3-step workflow. The three steps are: Linear Stretching across Time (LSAT), Neural Network (NN) processing, and Enrichment of Fine-resolution Variation (EOFRV). Figure 4 illustrates the whole workflow. In this example, the goal is to sharpen the MODIS LST observed on 15   In this example, the results can be evaluated (black dashed arrows in Figure 4) at t patch-level (image g versus image f) and at the Landsat pixel resolution (image j vers image k, the target Landsat without scene-scale mean). To sharpen an arbitrary MOD image not in our archive of image pairs, the workflow ends with image j. In this example, the results can be evaluated (black dashed arrows in Figure 4) at the patch-level (image g versus image f) and at the Landsat pixel resolution (image j versus image k, the target Landsat without scene-scale mean). To sharpen an arbitrary MODIS image not in our archive of image pairs, the workflow ends with image j.

Linear Stretching across Time (LSAT)
The seasonality of LST of urban land is generally larger than that of rural land. This contrast is illustrated by the data shown in Figure 5b for an urban (red square, Figure 5a) and a rural patch (green square, Figure 5a). According to the resampled Landsat data, the urban patch experiences a larger seasonal variation of temperature (from 277.32 K on DOY 23, 2015 to 318.73 K on DOY 191, 2018, a range of 41.41 K) than the rural pixel (from 274.54 K to 303.10 K, a range of 28.55 K). Composed of man-made materials, urban patches absorb a large amount of solar radiation, leading to high LST in the summer. For rural patches, evaporative cooling provided by vegetation prevents LST from rising to high values. These urban-rural contrasts in seasonality are much reduced in the MODIS data (range of variation 29.58 K for the urban patch and 29.22 K for the rural patch). The MODIS and Landsat observations of the urban patch produce a linear fit with the slope greater than 1 (slope = 1.32), indicating that the Landsat LST has a larger seasonality than the MODIS LST. In comparison, for the rural patch, the regression slope is slightly less than 1, indicating a relatively slower response of the Landsat LST to seasonal warming than the MODIS LST at the patch scale. The minimum temperature in both MODIS and Landsat LST is slightly higher for the urban patch than for the rural patch, possibly indicating additional heat source due to anthropogenic activities in urban lands. The seasonality of LST of urban land is generally larger than that of rural land. This contrast is illustrated by the data shown in Figure 5b for an urban (red square, Figure 5a) and a rural patch (green square, Figure 5a). According to the resampled Landsat data, the urban patch experiences a larger seasonal variation of temperature (from 277.32 K on DOY 23, 2015 to 318.73 K on DOY 191, 2018, a range of 41.41 K) than the rural pixel (from 274.54 K to 303.10 K, a range of 28.55 K). Composed of man-made materials, urban patches absorb a large amount of solar radiation, leading to high LST in the summer. For rural patches, evaporative cooling provided by vegetation prevents LST from rising to high values. These urban-rural contrasts in seasonality are much reduced in the MODIS data (range of variation 29.58 K for the urban patch and 29.22 K for the rural patch). The MODIS and Landsat observations of the urban patch produce a linear fit with the slope greater than 1 (slope = 1.32), indicating that the Landsat LST has a larger seasonality than the MODIS LST. In comparison, for the rural patch, the regression slope is slightly less than 1, indicating a relatively slower response of the Landsat LST to seasonal warming than the MODIS LST at the patch scale. The minimum temperature in both MODIS and Landsat LST is slightly higher for the urban patch than for the rural patch, possibly indicating additional heat source due to anthropogenic activities in urban lands. The urban patch is dominated by medium-intensity developed area (59%), with the rest area to be high-intensity (31%) and low-intensity development (10%). The rural path is mainly forested (72%), with 25% inland water. The blue square marks a patch of urban land mixed with rural land.
We use a linear stretching model to adjust the MODIS LST pixel-by-pixel (or patchby-patch) according to the LST relationship across time between MODIS and resampled Landsat. From the 26 image pairs, up to 26 LST observations can be found for a given image patch. For a target date, a linear relationship is developed for each patch using the observations on all other dates. In total, there are 22 × 29 linear regression equations (see the example for 29 July 2019 in Figure S1) for the study area. These equations are used to make patch-by-patch correction to produce the adjusted MODIS LST (Figure 4c).

Neural Network
A neural network is used to further reduce the inter-sensor biases. The adjusted MODIS image after the removal of scene average (Figure 4e) is concatenated with three data layers (DOY, IMP, and MODIS NDVI) to produce a 4-band image stack. The DOY band is the DOY value of the target date and is assigned to all the patches. IMP is an index The urban patch is dominated by medium-intensity developed area (59%), with the rest area to be high-intensity (31%) and low-intensity development (10%). The rural path is mainly forested (72%), with 25% inland water. The blue square marks a patch of urban land mixed with rural land.
We use a linear stretching model to adjust the MODIS LST pixel-by-pixel (or patchby-patch) according to the LST relationship across time between MODIS and resampled Landsat. From the 26 image pairs, up to 26 LST observations can be found for a given image patch. For a target date, a linear relationship is developed for each patch using the observations on all other dates. In total, there are 22 × 29 linear regression equations (see the example for 29 July 2019 in Figure S1) for the study area. These equations are used to make patch-by-patch correction to produce the adjusted MODIS LST (Figure 4c).  To evaluate model performance, we selected one image pair at a time for validation and train the model using the other 25 pairs. At the end of each epoch, the validation loss or error is the L1 norm between the predicted LST and the corresponding resampled Landsat LST (Figure 4f). The process is repeated 26 times. The optimal number of epochs (NOE) with the lowest validation error is recorded for each validation.  (Figure 4f). The process is repeated 26 times. The optimal number of epochs (NOE) with the lowest validation error is recorded for each validation.

Enrichment of Fine-Resolution Variations
The within-patch variations at the Landsat pixel scale change with time. The assumption behind the EOFRV is that the temporal change of within-patch variation is caused by the temporal change of local climate if the land cover is unchanged. To validate this, we selected the patch of urban-rural mixture in Figure 5a (blue square) and examined the temporal change of within-patch variations of Landsat LST. For an urban pixel in this patch, the within-patch variation (pixel LST minus patch mean LST) first increases and then decreases with DOY after peaking at DOY 170 (Figure 7, red dots). An opposite trend is observed for a rural pixel, who's within-patch variation first decreases and then rises after DOY 191 (Figure 7, green dots). The variation is positive for the urban pixel and negative for the rural pixel. In other words, the urban pixel is warmer than the patch mean and this deviation is stronger in warm seasons than in cold seasons. On the contrary, the rural pixel is cooler than the patch mean and the cooling tendency is stronger when the local climate becomes warmer. The within-patch variations at the Landsat pixel scale change with time. The assumption behind the EOFRV is that the temporal change of within-patch variation is caused by the temporal change of local climate if the land cover is unchanged. To validate this, we selected the patch of urban-rural mixture in Figure 5a (blue square) and examined the temporal change of within-patch variations of Landsat LST. For an urban pixel in this patch, the within-patch variation (pixel LST minus patch mean LST) first increases and then decreases with DOY after peaking at DOY 170 (Figure 7, red dots). An opposite trend is observed for a rural pixel, who's within-patch variation first decreases and then rises after DOY 191 (Figure 7, green dots). The variation is positive for the urban pixel and negative for the rural pixel. In other words, the urban pixel is warmer than the patch mean and this deviation is stronger in warm seasons than in cold seasons. On the contrary, the rural pixel is cooler than the patch mean and the cooling tendency is stronger when the local climate becomes warmer. The EOFRV predicts within-patch variations using DOY and the patch-level LST, an additional predictor, on the basis of a quadratic equation (Equation (1)), where is the Landsat with-in patch variation for a single pixel and is the mean LST of the patch containing that pixel. In this equation, DOY accounts for seasonal and the day-to-day changes of , and is a dynamic variable accounting for inter-annual changes. The coefficients a, b, c, d, and e were found by a least-square method using the patch and pixel-scale Landsat images on all dates excluding the target date (Figure 4d,k). In all, there are a total of 63,800 sets of coefficients, each corresponding to one Landsat pixel.
The modeled for the urban and the rural pixels in Figure 7 is shown as a function of DOY and patch mean (Figure 8). In this plot, is a convex surface for the urban pixel and a concave surface for the rural pixel. As the patch-level LST increases, the surface skews upward for the urban pixel (Figure 8a), and downward for the rural pixel ( Figure  8b). The model surfaces fit well with the observations (black point in Figure 8). The RMSE values are 0.55 K and 0.56 K for the urban and the rural pixels, respectively. Compared with the ranges (~7 K) of within-patch variations, such error is small. The EOFRV predicts within-patch variations using DOY and the patch-level LST, an additional predictor, on the basis of a quadratic equation (Equation (1)), where T var is the Landsat with-in patch variation for a single pixel and T patch is the mean LST of the patch containing that pixel. In this equation, DOY accounts for seasonal and the day-to-day changes of T var , and T patch is a dynamic variable accounting for inter-annual changes. The coefficients a, b, c, d, and e were found by a least-square method using the patch and pixel-scale Landsat images on all dates excluding the target date (Figure 4d,k). In all, there are a total of 63,800 sets of coefficients, each corresponding to one Landsat pixel. The modeled T var for the urban and the rural pixels in Figure 7 is shown as a function of DOY and patch mean (Figure 8). In this plot, T var is a convex surface for the urban pixel and a concave surface for the rural pixel. As the patch-level LST increases, the surface skews upward for the urban pixel (Figure 8a), and downward for the rural pixel (Figure 8b). The model surfaces fit well with the observations (black point in Figure 8). The RMSE values are 0.55 K and 0.56 K for the urban and the rural pixels, respectively. Compared with the ranges (~7 K) of within-patch variations, such error is small. To apply this model to a sharpening task, is replaced with the predicated patch-level LST (Figure 4h) added the scene-scale mean of adjusted MODIS (Figure 4c). In Figure S2, we show that the model performs similarly well to the results produced by Landsat as the predictor. Unlike the Landsat within-patch variations, the predicted variations of a patch do not strictly center at zero due to statistical noises.

Sharpening an Arbitrary MODIS Image
The workflow for sharpening the MODIS image on an arbitrary date (that is, image not in our archive of 26 image pairs) is similar to that shown in Figure 4, with two differences. First, the LSAT, NN, and the EOFRV models are trained on all the 26 image pairs. Second, steps shown as panels b, d and f are not implemented.

Training and Validation Loss
The training and validation results are summarized Figure S3, and the ensemble mean training and validation losses are shown in Figure S4 To apply this model to a sharpening task, T patch is replaced with the predicated patchlevel LST (Figure 4h) added the scene-scale mean of adjusted MODIS (Figure 4c). In Figure S2, we show that the model performs similarly well to the results produced by Landsat T patch as the predictor. Unlike the Landsat within-patch variations, the predicted variations of a patch do not strictly center at zero due to statistical noises.

Sharpening an Arbitrary MODIS Image
The workflow for sharpening the MODIS image on an arbitrary date (that is, image not in our archive of 26 image pairs) is similar to that shown in Figure 4, with two differences. First, the LSAT, NN, and the EOFRV models are trained on all the 26 image pairs. Second, steps shown as panels b, d and f are not implemented.

Training and Validation Loss
The training and validation results are summarized Figure S3, and the ensemble mean training and validation losses are shown in Figure S4

Accuracy Assessment
Accuracy was evaluated using RMSE against the resampled Landsat after the scenelevel mean value has been subtracted from both. The RMSE was calculated for the whole scene as well as for urban and rural areas. Of note, the scene-level mean was not corrected. In other words, these RMSE values provide accuracy assessment on the spatial variations in LST, not on the LST absolute accuracy. Determination of whether a patch belongs to the urban or rural class was made with the land cover map for 2017 provided by Copernicus Global Land Service (CGLS) (https://lcviewer.vito.be/about; accessed date: 15 July 2021). The map of 100-m resolution was down-sampled to the 1-km patch scale. A patch is classified as urban if half or more of the original pixels were in the urban class. Figure 10 is a summary of the accuracy statistics before (Figure 10a) and after the processing steps (Figure 10b,c). Prior to any processing step, the scene-scale RMSE ranges from 0.77 K to 2.94 K, with a mean of 1.68 K. The urban RMSE ranges from 1.15 K to 4.71 K, with a mean of 2.54 K. The rural RMSE is much lower, ranging from 0.66 to 2.45 K with a mean of 1.41 K. LSAT has reduced these errors, especially in the warm season ( Figure  10b). After LSAT, errors are similar among urban and rural patches. The NN processing has reduced the RMSE further, by an average of 0.12 K (Figure 10c). Although the RMSE change brought by NN is small, the rural and urban errors are nearly identical for most target dates. Furthermore, the NN-predicted LST shows better spatial correlation with the resampled Landsat than the LST with only the LSAT application (see below).

Accuracy Assessment
Accuracy was evaluated using RMSE against the resampled Landsat after the scenelevel mean value has been subtracted from both. The RMSE was calculated for the whole scene as well as for urban and rural areas. Of note, the scene-level mean was not corrected. In other words, these RMSE values provide accuracy assessment on the spatial variations in LST, not on the LST absolute accuracy. Determination of whether a patch belongs to the urban or rural class was made with the land cover map for 2017 provided by Copernicus Global Land Service (CGLS) (https://lcviewer.vito.be/about; accessed date: 15 July 2021). The map of 100-m resolution was down-sampled to the 1-km patch scale. A patch is classified as urban if half or more of the original pixels were in the urban class. Figure 10 is a summary of the accuracy statistics before (Figure 10a) and after the processing steps (Figure 10b,c). Prior to any processing step, the scene-scale RMSE ranges from 0.77 K to 2.94 K, with a mean of 1.68 K. The urban RMSE ranges from 1.15 K to 4.71 K, with a mean of 2.54 K. The rural RMSE is much lower, ranging from 0.66 to 2.45 K with a mean of 1.41 K. LSAT has reduced these errors, especially in the warm season (Figure 10b). After LSAT, errors are similar among urban and rural patches. The NN processing has reduced the RMSE further, by an average of 0.12 K (Figure 10c). Although the RMSE change brought by NN is small, the rural and urban errors are nearly identical for most target dates. Furthermore, the NN-predicted LST shows better spatial correlation with the resampled Landsat than the LST with only the LSAT application (see below).
The improvements brought by LSAT and NN are evident in Figure 11, which compares the original MODIS (panel a), the MODIS after LSAT (panel c) and that after NN (panel d) with the resampled Landsat (panel b) for 23 January 2015. The MODIS LST map without correction appears much smoother than that of the resampled Landsat LST. Their temperature ranges are 3.44 K (Figure 11a) and 6.59 K (Figure 11b), respectively. After LSAT, the temperature range is stretched to 4.56 K (Figure 11c). The resemblance to the resampled Landsat LST is improved with each processing step: the spatial R 2 of the The improvements brought by LSAT and NN are evident in Figure 11, which compares the original MODIS (panel a), the MODIS after LSAT (panel c) and that after NN (panel d) with the resampled Landsat (panel b) for 23 January 2015. The MODIS LST map without correction appears much smoother than that of the resampled Landsat LST. Their temperature ranges are 3.44 K (Figure 11a) and 6.59 K (Figure 11b), respectively. After LSAT, the temperature range is stretched to 4.56 K (Figure 11c). The resemblance to the resampled Landsat LST is improved with each processing step: the spatial of the resampled Landsat is 0.45 with the original MODIS, 0.55 after LSAT and finally 0.86 after both LSAT and NN.  The improvements brought by LSAT and NN are evident in Figure 11, which compares the original MODIS (panel a), the MODIS after LSAT (panel c) and that after NN (panel d) with the resampled Landsat (panel b) for 23 January 2015. The MODIS LST map without correction appears much smoother than that of the resampled Landsat LST. Their temperature ranges are 3.44 K (Figure 11a) and 6.59 K (Figure 11b), respectively. After LSAT, the temperature range is stretched to 4.56 K (Figure 11c). The resemblance to the resampled Landsat LST is improved with each processing step: the spatial of the resampled Landsat is 0.45 with the original MODIS, 0.55 after LSAT and finally 0.86 after both LSAT and NN.  Figure 12 shows the RMSE for the sharpened LST at the 100 m resolution (e.g., Figure 4j). The whole-scene RMSE ranges from 0.58 K to 1.51 K, with a mean of 0.91 K. The RMSE for rural areas shows a similar seasonal variation (0.55 to 1.54 K, mean of 0.91 K). The urban RMSE is also similar, except for the three marked dates. Figure 12 shows the RMSE for the sharpened LST at the 100 m resolution (e.g., Figure  4j). The whole-scene RMSE ranges from 0.58 K to 1.51 K, with a mean of 0.91 K. The RMSE for rural areas shows a similar seasonal variation (0.55 to 1.54 K, mean of 0.91 K). The urban RMSE is also similar, except for the three marked dates.   (Figure 14b). The accuracy is the highest in the spring ( Figure S6), and lowest in the fall (Figure 14a,b).   (Figure 14b). The accuracy is the highest in the spring ( Figure  S6), and lowest in the fall (Figure 14a,b).

Evaluation of Sharpened LST
in the fall (Figure 14a,b).

Comparison with Bilateral Filtering
Our scale-separating (SS) framework has achieved better results than bilateral filtering (BF; Figure 15). We chose to bench-mark our framework against the BF method [27] because the latter provides high-quality LST data for UHI studies. We used a moving window of size 10 × 10 in the bilateral filtering process. The spatial weight and similarity weight were determined with a geometric spread of 0.4 and a temperature spread of 2.5. The weights were found with minimized sharpening error to within a small range (0.1-2.5). The average RMSE for BF is 1.63 K for the entire scene, 1.93 K for urban areas, and 1.53 K for rural areas (Figure 15a). The corresponding RMSE values for our SS framework are 0.91 K, 0.94 K and 0.91 K.

Comparison with Bilateral Filtering
Our scale-separating (SS) framework has achieved better results than bilateral filtering (BF; Figure 15). We chose to bench-mark our framework against the BF method [27] because the latter provides high-quality LST data for UHI studies. We used a moving window of size 10 × 10 in the bilateral filtering process. The spatial weight and similarity weight were determined with a geometric spread of 0.4 and a temperature spread of 2.5. The weights were found with minimized sharpening error to within a small range (0.1-2.5). The average RMSE for BF is 1.63 K for the entire scene, 1.93 K for urban areas, and 1.53 K for rural areas (Figure 15a). The corresponding RMSE values for our SS framework are 0.91 K, 0.94 K and 0.91 K.

Error Analysis
Sharpened LST data are affected by errors at three separate scales (scene, patch and pixel). Inter-sensor biases affect not only the scene-scene mean but also the spatial distri-

Error Analysis
Sharpened LST data are affected by errors at three separate scales (scene, patch and pixel). Inter-sensor biases affect not only the scene-scene mean but also the spatial distribution of LST. In this study, we subtracted the scene-scale mean from the patch-scale temperature to remove the mean bias, but errors in the spatial distribution still remain (Figure 10a).
One established reason for inter-sensor biases is related to the surface emissivity used for MODIS and Landsat LST retrieval. MODIS emissivity is obtained with an NDVI classification method [35]. Landsat emissivity is obtained from the ASTER satellite. The emissivity discrepancy can lead to differences in the conversion from brightness temperature to the LST. However, emissivity parameterization alone cannot fully explain the patch-scale error shown in Figure 3b. A recent study [36] shows that the surface UHI intensity, which is a measure of spatial variations of LST, only changes by about 0.1 K when the Landsat emissivity is replaced with an NDVI-based parameterization used by MODIS.
We postulate that the difference in view angle between MODIS and Landsat sensors may have contributed to these errors. New Haven County is located at the fringe of MODIS swaths, with a view angle of 65 • from zenith and from the west of ground targets [37]. In comparison, the Landsat 8 sensor scans the surface from the nadir point (zenith angle about 0 • ). In other words, the MODIS observations are influenced more by cold shadows than the Landsat observations. As a result of this anisotropic heating, the off-nadir observation will detect weaker thermal radiation than the nadir scan [38,39]. That solar heating difference among ground objects is smaller in the winter and larger in the spring and summer may potentially explain the seasonal variation in the RMSE (Figure 10a).
On several summer dates (DOY 170,191, and 210), urban RMSE is greater than rural RMSE (Figure 10c). We infer from this that the degree of vertical heterogeneity has impacted the quality of the results from LSAT and NN. Natural surfaces are spatially more uniform and vertically more homogeneous than areas occupied buildings extending tens of meters in the vertical direction. These vertical structures will obstruct each other from the satellite view. This inter-occlusion effect in the urban area increases the LST difference between the two satellites, especially in warm seasons when vegetation in rural areas is most homogeneous. As a result, urban areas suffer from larger sharpening errors in the summer than rural areas.
A third type of sharpening error occurs at the pixel scale within the patches. It originates from the prediction of within-patch variations (Figure 4i) made by EOFRV, which are similar but not identical to the within-patch details of target Landsat (Figure 4b). The overall error has increased by 0.22 K after the EOFRV application, indicating that the pixel-scale error is secondary compared with the patch-scale error (0.70 K).
To analyze the source of sharpening error in detail, we now focus on three example dates marked in Figure 12 Figure S11c). This error has propagated to the sharpened MODIS ( Figure S11e), contributing 92% of the final sharpening error (1.48 K), with the rest 9% (0.12 K) explained by the pixel-scale error. The patch-scale error is also the main error source for the other two dates, accounting for 76% and 71% of the final sharpening error for 18 June 2016 and 30 September 2013, respectively. The patch-scale error is consistent because the predicted patch-level LST is biased high by~1 K in rural areas (Figure S11-S13c).
In generally, the pixel-scale error is small and is usually spread homogeneously over the scene. Among all the 1,354,300 pixels, most (93%) are predicted with pixel-scale error below 1 K, and only 130 pixels have significant error over 7 K. The significant errors occurred at fixed locations. For example, the black boxes in Figures S12f and S13f highlight a strip-shape area in rural land with errors up to −12.74 K. The biases in this area are found for nearly all the target dates, sometimes low for the entire shape ( Figure S14a) and sometimes high for parts but low for the others (Figure S14b,c). A satellite map indicates that this area (red dashed line in Figure S15) is a quarry near North Branford, CT. It is mostly bare land in low-lying topography. The quarry was waterlogged from time to time, leading to drastic land cover change in the daily scale. The EOFRV model failed to produce accurate within-patch variations for this area because EOFRV assumes that land cover is constant within the time span of the study.
In addition to quarrying-related land cover changes (labelled as Type A, black points in Figure 16), four other types led to large pixel-scale errors ( Figure S16). Type B is cultivated cropland under the influence of irrigation (yellow points in Figure 16). Type C is herbaceous wetland subject to varying levels of tidal water (blue points in Figure 16). Type D is buildings with white-roof whose reflectivity may have degraded over time (red points in Figure 16). Type E is ponds with occasional green algal blooms (green points in Figure 16). The land cover types of these pixels were variable over different years. Because EOFRV assumes constant land cover types, its predictions were inaccurate in those areas. The influences of such land cover changes are secondary in reference to the overall sharpening errors. According to the error histogram, 99% of the data are within the error bounds of ±1.97 K and 95% within ±1.16 K ( Figure S17).

Comparison with Air Temperature
The scale-separating framework opens new opportunities for UHI studies. To demonstrate this potential, we compared a sharpened MOIDS LST map (Figure 17c) for 12 August 2019 with air temperature (Ta) measured with smart sensors mounted on bicycles [40]. The bicycle measurement took place between 10: 37 and 13:14 local time, along transects that passed through urban core pixels with impervious surface fraction up to 100% as well as rural pixels with nearly zero impervious fraction. The air temperature from mobile sensors was relative to that measured by a stationary sensor. The LST spatial pattern along the bike routes (Figure 17a) bears strong similarity to the spatial variation of Ta (Figure 17b). The spatial correlation between LST and Ta is relatively strong (Figure  17d, p < 0.001). Both Ta and LST increase with increasing impervious surface fraction (Figure 18), conforming the role of impervious surface material in warming the surface and the near-surface atmosphere [41][42][43][44]. Impervious material limits the evaporative cooling

Comparison with Air Temperature
The scale-separating framework opens new opportunities for UHI studies. To demonstrate this potential, we compared a sharpened MOIDS LST map (Figure 17c) for 12 August 2019 with air temperature (Ta) measured with smart sensors mounted on bicycles [40]. The bicycle measurement took place between 10: 37 and 13:14 local time, along transects that passed through urban core pixels with impervious surface fraction up to 100% as well as rural pixels with nearly zero impervious fraction. The air temperature from mobile sensors was relative to that measured by a stationary sensor. The LST spatial pattern along the bike routes (Figure 17a) bears strong similarity to the spatial variation of Ta (Figure 17b). The spatial correlation between LST and Ta is relatively strong (Figure 17d, p < 0.001). Both Ta and LST increase with increasing impervious surface fraction (Figure 18), conforming the role of impervious surface material in warming the surface and the near-surface atmosphere [41][42][43][44]. Impervious material limits the evaporative cooling and leads to higher surface temperature in urban lands than in rural lands [45]. Similar urban-rural differences are formed for air temperature as land surface transports sensible heat to the near-surface atmosphere [46]. Such warming effect is stronger on the surface than on the air, again in agreement with previous modeling studies [41,42]. Compared with rural vegetations, impervious material in urban lands tends to redistribute more solar energy to sensible heat [47], leading to greater vertical temperature gradient (LST minus Ta) [41]. As a result, air temperature is less variable in space than surface temperature and the slope of the linear relationship between LST and Ta is less than unity. Such fine-scale intra-city variability in LST would be impossible to discern using the original MODIS LST because the bicycle transect only covered a small number of MOIDS pixels. Table S1 provides a summary of Ta and LST analysis for the 12 days when both clear sky MODIS and biking data were available. The mean slope of LST against impervious fraction was 0.11. In other words, the range of LST resulting from 0 to 100% imperious fraction is 11.30 K, implying a surface UHI intensity of similar magnitude [48]. The mean slope of Ta against impervious fraction is 0.015, suggesting an air UHI intensity of 1.54 K. Another notable feature is that the regression slope between Ta and LST becomes smaller with increasing wind speed ( Figure S18a), indicating the de-coupling of surface and air temperature in strong-wind conditions. Strong wind also weakens the surface and the air UHI intensity because the slope of LST or Ta versus impervious fraction decreases with increasing wind speed ( Figure S18b,c).  Table S1 provides a summary of Ta and LST analysis for the 12 days when both clear sky MODIS and biking data were available. The mean slope of LST against impervious fraction was 0.11. In other words, the range of LST resulting from 0 to 100% imperious fraction is 11.30 K, implying a surface UHI intensity of similar magnitude [48]. The mean slope of Ta against impervious fraction is 0.015, suggesting an air UHI intensity of 1.54 K. Another notable feature is that the regression slope between Ta and LST becomes smaller (c) (d) Figure 18. Dependence of air temperature (Ta) and LST on impervious surface fraction (IMP) for 12 August 2019.

Conclusions
In this article, we developed a fusion framework to generate new LST data with fine spatiotemporal resolution. The framework is architected on an in-depth understanding of the discrepancy between coarse and fine resolution LST at three separate scales (the scene scale, the patch scale, and the pixel scale). The inter-sensor biases at the patch scale are reduced by a model named Linear Stretching across Time (LSAT) and a neural network (NN) model. The patch-scale outputs of NN are further sharpened to pixel scale (100 m) by a model named Enrichment of Fine-resolution Variations (EOFRV). The framework is designed to improve the intra-scene variation of LST instead of the absolute temperature for advancing UHI studies.
We showed that at the patch (or MODIS pixel) scale, the spatial variations in the MODIS LST are only 36% of those in the resampled Landsat LST. The LSAT model is able to reduce these patch-level biases. The improvement is particularly noteworthy for urban areas, where the error has decreased by 1.63 K. The NN model has improved the accuracy slightly and has homogenized the error distribution over space. We showed that the LST variations within the patches are strongly controlled by the local climate. Based on this understanding, the EOFRV model sharpened the patch-level LST successfully. The overall accuracy of the sharpened MODIS LST is 0.91 K, which outperforms a bilateral filtering method especially in warm seasons. The robustness of the framework is supported by a comparison between the air temperature collected with mobile sensors and the sharpened MODIS LST along an urban street network.
Our scale-separating framework can be improved on several aspects. The remaining error mainly occurs at the patch scale. The error is dependent on season and land cover type. A model that accounts for viewing and illumination effects of ground objects may be able to further quantify and reduce the error. The EOFRV model assumes that no significant change in the landscape has taken place. However, drastic changes in landscape are possible due to human activities (e.g., quarrying, irrigation, and roof coating) and the morphological changes of aquatic organisms. Such land cover changes are found in a few fixed locations in our study domain which can be excluded from the sharpened LST product. Moreover, this assumption may not be accurate in places that have experienced rapid urbanization. Further improvements may be possible with searching for a break point in time when the landscape change took place. Finally, it is possible to refine the cloud-tolerance threshold by incorporating meteorological data so that more partly cloudy image pairs can be added to the image archive for model training.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/ 10.3390/rs14040983/s1, Figure S1: Regression statistics for LSAT, Figure S2: EOFRV results using predicted patch-level LST, Figure S3: Training and validation losses for each target date, Figure S4: Ensemble mean training and validation losses, S5: Sharpening results for 4 May 2017, Figure S6: Correlation plot of sharpening results for 4 May 2017, Figure S7: Sharpening results for 29 July 2019, Figure S8: Correlation plot of sharpening results for 29 July 2019, Figure S9: Sharpening results for 28 November 2017, Figure S10: Correlation plot of sharpening results for 28 November 2017, Figure S11: Patch-scale and pixel-scale error for 4 May 2017, Figure S12: Patch-scale and pixel-scale error for 18 June 2016, Figure S13: Patch-scale and pixel-scale error for 30 September 2013, Figure S14: Sharpening residual of the area with significant pixel-scale error, Figure S15: Satellite map of the area with significant pixel-scale error, Figure S16: Locations where significant biases result from land cover changes, Figure S17: Residuals of pixel-scale variations in the form of histograms, Figure S18: Validation using mobile transect data on multiple dates. Table S1: Statistics of validation for each mobile transect.  Data Availability Statement: Publicly available satellite datasets were analyzed in this study. This data can be found here: https://earthexplorer.usgs.gov (accessed date: 20 August 2021). The mobile sensor data are available on request from the corresponding author. The data are not publicly available for safety of data collectors.