Gap Fill of Land Surface Temperature and Reflectance Products in Landsat Analysis Ready Data

The recently released Landsat analysis ready data (ARD) over the United States provides the opportunity to investigate landscape dynamics using dense time series observations at 30-m resolution. However, the dataset often contains data gaps (or missing data) because of cloud contamination or data acquisition strategy, which result in different capabilities for seasonality modeling. We present a new algorithm that focuses on data gap filling using clear observations from orbit overlap regions. Multiple linear regression models were established for each pixel time series to estimate stable predictions and uncertainties. The model’s training data came from stratified random samples based on the time series similarity between the pixel and data from the overlap regions. The algorithm was first evaluated using four tiles (5000 × 5000 30-m pixels for each tile) from 2018 land surface temperature data (LST) in Atlanta, Georgia. The accuracy was assessed using randomly masked clear observations with an average Root Mean Square Error (RMSE) of 3.88 and an average bias of −0.37, which were comparable to the product accuracy. We also applied the method on ARD surface reflectance bands at Fairbanks, Alaska. The accuracy assessment suggested a majority RMSE of less than 0.04 and a bias of less than 0.0023. The gap-filled time series can be of help for reliable seasonal modeling and reducing artifacts related to data availability. This approach can also be applied to other datasets, vegetation indexes, or spectral reflectance bands of other sensors.


Introduction
Satellite images provide valuable geospatial data for characterizing land cover and land cover dynamics. However, tradeoffs between spatial and temporal resolution often limit applications of different types of satellite images. For example, the Advanced Very-High-Resolution Radiometer (AVHRR) and Moderate Resolution Imaging Spectroradiometer (MODIS) sensors can provide daily global observations that are valuable for monitoring rapid land surface changes. But the coarse spatial resolution is often inadequate for highly heterogeneous areas like urban landscapes. In contrast, Landsat data provide sufficient spatial detail for monitoring land conditions [1,2], but the 16-day revisit cycle has limited use for studying land changes like land surface temperature and vegetation phenology; the revisit cycle is only eight days during times when two Landsat satellites have been operating simultaneously.
Previous efforts have attempted to fuse MODIS and Landsat images to complement the spatial resolution of Landsat with the temporal frequency of MODIS. For example, Gao et al. [3] proposed a spatial and temporal adaptive reflection fusion model (STARFM) to estimate Landsat-scale observations on MODIS acquisition dates. The enhanced STARFM ESTARFM was developed to handle

Surface Reflectance Data for Fairbanks
The Landsat ARD surface reflectance data (tile H07V05) in 2017 were acquired from USGS EarthExplorer [11]. The data were processed to the highest level of geometric and radiometric quality. The tile has a total of 175 scenes with clear observations, with as many as 37 in a single year. The acquired images were clipped to the study area, and only observations that were flagged as clear or water in the pixel QA band were used (Figure 1c). The second study area is Fairbanks, Alaska (Figure 1c), and comprises 689 × 472 30-m pixels. Fairbanks has a subarctic climate with severe winters, no dry season, cool, short summers, and strong seasonality. Fairbanks is the second most populous metropolitan area in Alaska. The monthly average temperature in 2017 ranged from −22.9 • C in January to 18.7 • C in July [29]. According to the NLCD 2016 [30,31], the primary land covers include woody wetland (23%), deciduous forest (17.9%), low density developed (15.7%), and evergreen forest (15.4%).

Landsat Land Surface Temperature (LST) for Atlanta
The Level-2 Provisional Surface Temperature in 2018 was acquired from USGS EarthExplorer [12,13]. The product was processed to a 30-meter spatial resolution in Albers Equal Area (AEA) projection and gridded to the ARD tiling scheme (5000 × 5000 30-m pixels). Four tiles were acquired: H23V13 and H23V14 each have 76 scenes, and H24V13 and H24V14 each have 80 scenes. A validation study suggested that the product had a mean bias (root mean square error) of 0.7 K (2.2 K) and 0.9 K (2.3 K) for the four surface radiation budget network sites, and −0.3 K (0.6 K) and 0.4 K (0.7 K) for inland water bodies (Salton Sea and Lake Tahoe, respectively, in the USA) [32]. The pixel quality assurance (QA) band was also acquired along with the LST product, which flagged cloud-contaminated observations at the pixel level. We used the pixel QA to extract clear observations or water observations that were considered cloud-free in this study. Figure 1b shows the number of clear LST observations in 2018 for the four tiles.

Surface Reflectance Data for Fairbanks
The Landsat ARD surface reflectance data (tile H07V05) in 2017 were acquired from USGS EarthExplorer [11]. The data were processed to the highest level of geometric and radiometric quality. The tile has a total of 175 scenes with clear observations, with as many as 37 in a single year. The acquired images were clipped to the study area, and only observations that were flagged as clear or water in the pixel QA band were used (Figure 1c).

Method
The orbit overlap region often offers better capability in annual land surface modeling because of higher data density [22]. We assume that a given pixel time series in the ARD tile can be modeled as a linear combination of a set of pixel time series from the orbit overlap region. Therefore, we can utilize the time series seasonality from the overlap region to model time series of the given pixel in the same tile [33]. The time series model contains predictions at each time stamp of the overlap region, which is used to fill data gaps in the given pixel. The single band LST was used to test and validate the method at broader and heterogeneous regions; thereafter, the method was applied to multi-bands of surface reflectance for Fairbanks. It is worth noting that this method aims to improve the capability of seasonal modeling. Though LST could change quickly in time and space, we assume that the observed LST from the same date can be used to estimate LST at another location with similar seasonal patterns. For example, the reconstructed LST is derived from LST under clear sky rather than the actual cloudy LST, which could lead to biased LST estimation at the daily scale but have a limited impact on the seasonal modeling.

Gap Filling
In this study, the data gap is defined as absent observations due to the orbit coverage or cloud-contaminated observations based on the pixel QA information. Figure 2 illustrates the overall conceptual approach used in the study. First, we used time series from the orbit overlap region to select the training data source. Second, the training data sources were clustered based on the Euclidean distance (time series similarity), so the training samples could be stratified based on their similarity to the target pixel. Third, we sampled training data from multiple clusters because some mixed pixel time series could have combined seasonality from different landscapes. After that, these training samples were used to create linear regression models to predict time series of any target pixel.  Figure 3 illustrates the workflow of the gap-filling algorithm. In the first step, the overlap region in each tile is extracted using the threshold that divides two peaks (overlap and non-overlap region) in the histogram of clear observations. Because time series images in overlap regions also have cloudcontaminated data, those training data need to be preprocessed. Roy and Yan [22] suggested that the 7-parameter linear harmonic model (Equation (1)) can well represent the annual Landsat time series variation when there are at least 21 annual clear observations. For this study, all overlap regions have more than 21 clear observations, while non-overlap regions often have insufficient data. Thus, the 7parameter linear harmonic model is used to replace those cloud-contaminated data, which maintains the details of seasonality in the training data (Equation (1)) [20,22].
where f(t) is the modeled time series for a single pixel location in the overlap region; a0 describes the mean of f(t) over the time series; am and bm are coefficients for harmonic component m; t is day of year;  Figure 3 illustrates the workflow of the gap-filling algorithm. In the first step, the overlap region in each tile is extracted using the threshold that divides two peaks (overlap and non-overlap region) in the histogram of clear observations. Because time series images in overlap regions also have cloud-contaminated data, those training data need to be preprocessed. Roy and Yan [22] suggested that the 7-parameter linear harmonic model (Equation (1)) can well represent the annual Landsat time series variation when there are at least 21 annual clear observations. For this study, all overlap regions have more than 21 clear observations, while non-overlap regions often have insufficient data. Thus, the 7-parameter linear harmonic model is used to replace those cloud-contaminated data, which maintains the details of seasonality in the training data (Equation (1)) [20,22].  After the training data were prepared, we used the ensemble method to estimate values at gaps in any target pixel. First, we built multiple independent prediction models from randomly selected training data. Then we used the median of multiple model outputs as the final prediction. The median was used to reduce the impact of anomalies in predictions. This model ensemble concept is robust to occasional outliers in the training data (i.e., cloud-contaminated data) [34]. Following is the detailed procedure for each target pixel: 1. Cluster the training data-We clustered the training data into 20 groups using the KMeans analysis [35] and estimated the averaged time series for each cluster as centers.
2. Stratified random selection for the target pixel-We calculated the Euclidean distances from cluster centers to the target pixel and used inverse distances as the weights to randomly sample a total of 100 pixels from the training data. The stratified sampling strategy emphasizes training data with similar seasonality but is not restricted by clusters. Therefore, the approach is robust when the target pixel has a similar distance to multiple cluster centers, which could be the case for After the training data were prepared, we used the ensemble method to estimate values at gaps in any target pixel. First, we built multiple independent prediction models from randomly selected training data. Then we used the median of multiple model outputs as the final prediction. The median was used to reduce the impact of anomalies in predictions. This model ensemble concept is robust to occasional outliers in the training data (i.e., cloud-contaminated data) [34]. Following is the detailed procedure for each target pixel:

1.
Cluster the training data-We clustered the training data into 20 groups using the KMeans analysis [35] and estimated the averaged time series for each cluster as centers.

2.
Stratified random selection for the target pixel-We calculated the Euclidean distances from cluster centers to the target pixel and used inverse distances as the weights to randomly sample a total of 100 pixels from the training data. The stratified sampling strategy emphasizes training data with similar seasonality but is not restricted by clusters. Therefore, the approach is robust when the target pixel has a similar distance to multiple cluster centers, which could be the case for mixed pixels.

3.
Predict the full time series via linear regression-We created a regression model in Equation (2) and predicted values at gaps based on clear observations from the target pixels. In this study, we used the LASSO regression [36] to build regression models, which automatically selects the most relevant training samples.
where β 0 and β i represent the linear parameter to be estimated and represents the error terms. X i and y are sampled training data and the target pixel time series at y are clear observations dates. We repeated the last two steps 10 times for each target pixel and used the median value at each time step as the final prediction. Moreover, we calculated the standard deviation (STD) of iterations for each prediction as an indicator of uncertainty. The results have the same structure as ARD tiles with all missing data filled.

Implementation of the Method on Surface Reflectance
We collected training data from the whole tile to fill in gaps in the study area. This tile has 6 orbit overlaps, but no distinctive overlap versus non-overlap regions. Thus, we used 21 as the region threshold, which is the majority of clear observations and the minimum of clear observations to build the harmonic model (Equation (1)) as suggested by Roy and Yan [22]. As a demonstration, we applied the method to the blue, green, red, near-infrared (NIR), Shortwave infrared 1 (SWIR1), and Shortwave infrared 2 (SWIR2) bands.

Validation Strategies
The algorithm was tested with randomly selected clear observations. In detail, we randomly masked 1,000 clear observations from each tile as reference data to validate model predictions of LST at Atlanta, Georgia. Fairbanks is a much smaller region, so we randomly masked 100 clear observations for validation and conducted the validation 10 times, which also resulted in 1000 validation observations. The root mean square error (RMSE) between the observed and filled pixel values was calculated with respect to the LST and each spectral band.

Gap-Filled LST
Based on the histogram of clear observations, the minimum value between the two peaks (overlap and non-overlap region) was selected as the threshold (H23V13 > 21, H23V14 > 23, H24V13 > 25, H24V14 > 27). In order to avoid possible boundary issues between tiles, we combined all overlap regions as the training data. Figure 4a shows an example of the model-predicted LST for June 18, 2018, in tile H23V13 for Atlanta. The original Landsat LST only covered the eastern part of the tile (Figure 4c) and was contaminated by scattered clouds (Figure 4d). By closely comparing the two maps, the predicted LST (left side of the dashed line in Figure 4f showed a similar value range to the Landsat LST observations (right side of the dashed line in Figure 4f. Both observations and predictions showed high LST at bare surface and low LST along the river (Figure 4f,g). The spatial distribution is smoother on the left side of Figure 4f than the right side. The uncertainty map (Figure 4b) shows 51.3% of total pixels have the prediction uncertainty less than 4 • Kelvin, and 47% of total pixels have valid observations. The spatial pattern of high uncertainty pixels (>4 • Kelvin, brown color in Figure Figure 5 represents the time series of observed and predicted land surface temperature at eight ground station locations in Atlanta. The air temperature is also displayed as a reference of the seasonal temperature pattern at the study area. The predicted values followed well with observations except for several scattered outliers in Figure 5d,f, while both observed and predicted summer LSTs are higher than the station air temperatures. The two outliers in Figure 5d with very low LST came from two Landsat 8 images (26 May 2018 and 29 July 2018) that were mostly covered by clouds, but pixel QA only flagged part of the images as cloud or cloud shadow. Similarly, the low LST at the peak of the summer in Figure 5f also came from a cloud-contaminated observation (Landsat 7 image from 12 July 2018).  Figure 5 represents the time series of observed and predicted land surface temperature at eight ground station locations in Atlanta. The air temperature is also displayed as a reference of the seasonal temperature pattern at the study area. The predicted values followed well with observations except for several scattered outliers in Figure 5d,f, while both observed and predicted summer LSTs are higher than the station air temperatures. The two outliers in Figure 5d with very low LST came from two Landsat 8 images (26 May 2018 and 29 July 2018) that were mostly covered by clouds, but pixel QA only flagged part of the images as cloud or cloud shadow. Similarly, the low LST at the peak of the summer in Figure 5f also came from a cloud-contaminated observation (Landsat 7 image from 12 July 2018).  Figure 6 shows the model-estimated LST in relation to 1000 randomly masked clear observations from each tile in Atlanta. All the data in the scatterplots fall close to the 1:1 line, indicating the algorithm can capture the diverse LST caused by seasonality and cover types. In order to assess the accuracy quantitatively, the RMSE and bias (average difference) of observed and predicted LST were calculated ( Table 1). The model-estimated LST has a similar range of RMSE and bias compared to the Landsat LST product accuracy, indicating the algorithms have successfully predicted the LST. The dot colors in Figure 6 represent the standard deviation of predictions. Those predictions with high uncertainty or observations distant from the linear relationship mostly occurred at low temperature  Figure 6 shows the model-estimated LST in relation to 1000 randomly masked clear observations from each tile in Atlanta. All the data in the scatterplots fall close to the 1:1 line, indicating the algorithm can capture the diverse LST caused by seasonality and cover types. In order to assess the accuracy quantitatively, the RMSE and bias (average difference) of observed and predicted LST were calculated ( Table 1). The model-estimated LST has a similar range of RMSE and bias compared to the Landsat LST product accuracy, indicating the algorithms have successfully predicted the LST. The dot colors in Figure 6 represent the standard deviation of predictions. Those predictions with high uncertainty or observations distant from the linear relationship mostly occurred at low temperature observations that might come from cloud-contaminated observations that were labeled as clear by pixel QA.

Validation Against the Landsat LST Observations
Remote Sens. 2020, 12, x FOR PEER REVIEW 11 of 17 observations that might come from cloud-contaminated observations that were labeled as clear by pixel QA.       The validation shows that the estimated surface reflectance closely matches the actual observations (1:1 line) (Figure 8). Some estimations show large variation while observations are consistent at low or high values in visible bands. As those validation data were randomly picked during the process, it is impossible to examine their condition. However, we went through all 175 gap-filled images and found cloud or cloud shadows were sometimes labeled as clear observation by the Landsat pixel QA band, which could lead to the consistently high or low values at the visible bands. The cloud impact also occurred at shortwave infrared (SWIR) bands as negative values. For all six bands, the majority of RMSEs are less than 0.04 and all biases was less than 0.003 (Table 2). We also conducted t-tests ( Table 2) that evaluated whether the predicted values were significantly The validation shows that the estimated surface reflectance closely matches the actual observations (1:1 line) (Figure 8). Some estimations show large variation while observations are consistent at low or high values in visible bands. As those validation data were randomly picked during the process, it is impossible to examine their condition. However, we went through all 175 gap-filled images and found cloud or cloud shadows were sometimes labeled as clear observation by the Landsat pixel QA band, which could lead to the consistently high or low values at the visible bands. The cloud impact also occurred at shortwave infrared (SWIR) bands as negative values. For all six bands, the majority of RMSEs are less than 0.04 and all biases was less than 0.003 (Table 2). We also conducted t-tests ( Table 2) that evaluated whether the predicted values were significantly different from the observations. As p-values suggested, the prediction and observations had no significant difference at the 1% level, though they tended to differ in bands with a small value range.

Gap-Filled Surface Reflectance
Remote Sens. 2020, 12, x FOR PEER REVIEW 13 of 17 different from the observations. As p-values suggested, the prediction and observations had no significant difference at the 1% level, though they tended to differ in bands with a small value range. Figure 8. Scatterplots of the observed (x-axis) and the predicted reflectance (y-axis) for the blue, green, red, and near-infrared, shortwave infrared 1, and shortwave infrared 2 bands. Figure 8. Scatterplots of the observed (x-axis) and the predicted reflectance (y-axis) for the blue, green, red, and near-infrared, shortwave infrared 1, and shortwave infrared 2 bands.

Discussion
Filling data gaps and reducing the variation of data availability is essential for common time series studies [18,22]. Many gap-filling approaches have been developed to fill missing data [22][23][24][25][26], but gap filling in large areas is often challenging using those approaches, especially in heterogeneous regions [8]. This paper presents an effective method to fill gaps in the Landsat ARD products. We used clustering analysis to find orbit overlap region pixels with similar time series to the given pixel and used LASSO regression to estimate their weights. More importantly, we iteratively predicted values using multiple random selections of pixels, which reduced the potential impacts of noise in training data and allowed us to estimate the uncertainty of model predictions. The model uncertainty revealed the stability of model predictions instead of the prediction error. Qualitatively, the results for predicted LST are promising when compared to the spatial patterns of the high-resolution image (Figure 4). The predicted LST showed more spatial detail and variation than LST observations because the raw LST product was resampled to 30-m from 60-m Landsat 7 and 100-m Landsat 8 while the gap-filling procedure was conducted at the 30-m scale. Temporally, the gap-filled data revealed more detailed seasonal variation ( Figure 5), which was also confirmed by Yan and Roy [38].
Quantitative validations were undertaken to gain insights into the LST prediction accuracy. Randomly masked 1000 LST observations were compared to model-estimated LST for each tile. The average −0.37 bias and 3.88 RMSE was comparable to the LST product bias (RMSE) of 0.7 K (2.2 K) and 0.9 K (2.3 K) for the four surface radiation budget network sites [32]. For all six spectral bands, the majority of RMSEs were less than 0.04 and all biases were less than 0.0023. In contrast, the MODIS fusion method spatial and temporal adaptive reflectance fusion model (STARFM) had prediction biases of 0.0026 (green band), 0.0040 (red band), and 0.0060 (NIR band) [4]. The enhanced spatial and temporal adaptive reflectance fusion model (ESTARFM) had prediction biases of 0.0028 (green band), 0.0021 (red band), and 0.0022 (NIR band) [4]. Some cloud-contaminated observations were labeled as clear by pixel QA (Figure 8), which could lead to the overestimation of RMSE in both LST and surface reflectance bands in this study.
Some concerns need further investigation. The first concern is data gaps in the overlap region, which could also be considerable. For the current study areas and target time, the number of clear observations (>21) within a year is enough to generate seven-parameter harmonic models for all four Atlanta tiles. According to Egorov et al. [21], the average annual numbers of clear observations are 4.85 (Landsat 4 TM), 16.41 (Landsat 5 TM), 15.03 (Landsat 7 ETM+), and 21.22 (all three Landsat sensors) across the contiguous United States. Therefore, it could be challenging to model observations before 1999 when only one Landsat sensor was available, or for areas with frequent cloud cover or snow cover. An alternative solution is to use the five-parameter nonlinear harmonic model, which needs at least 15 clear observations [22]. Another concern is abrupt changes and short-term changes in data gaps. Conceptually, predicting those changes relies on whether they are captured by training data. But the temporal fitting in the training data may overlook abrupt changes [8]. Using auxiliary data such as climate information or MODIS LST may help avoid the issue. For example, Zhu et al. [5] attempted to predict the changes by fusing MODIS data.
With more and more sensors (e.g., Landsat, Sentinel-2) harmonized, characterizing landscape dynamics at high spatiotemporal resolution attracts groundbreaking attention. However, it also causes complicated data availability variance, while the most traditional time series analysis was designed for data with the same data availability. Our approach attempts to solve the issue, which could be a critical preprocessing step for many time series analyses. Our approach makes the most of original Landsat observations from the same acquisition date, and its cloud-contaminated training data is robust. The current approach used 20 clusters, 100 random samples for each of the 10 iterations, which can be customized according to the computation efficiency. The source code will be available at the GitHub. Finally, Landsat ARD were used in this study for availability and validation convenience, but the approach can be easily applied to another dataset and other vegetation indexes or spectral reflectance bands.

Conclusions
This paper presents the results of a new method of time series gap filling that is designed for multi-sensor and multi-time data harmonization. This method uses pixels from the orbit overlap region to fill data gaps based on time series similarity, which retains the observation variation. Model assembling procedures were used to estimate stable predictions, which is not only robust to occasionally cloud-contaminated training data, but also allowed us to estimate the uncertainty of the predictions.
The LST from four ARD tiles (5000 × 5000 30-m pixels for each tile) for Atlanta, Georgia, was tested and validated; the tiles contained heterogeneous land cover types and various temporal variations. The validation procedure suggested good linear relationships with 1000 randomly masked clear observations. More importantly, the application on spectral bands in Fairbanks, Alaska, showed that this approach can effectively fill data gaps related to cloud contamination and sensor orbit models. The gap-filled time series can be of help with reliable seasonal modeling and reducing artifacts related to data availability.