All-Sky 1 km MODIS Land Surface Temperature Reconstruction Considering Cloud Effects Based on Machine Learning

: A high spatio-temporal resolution land surface temperature (LST) is necessary for various research ﬁelds because LST plays a crucial role in the energy exchange between the atmosphere and the ground surface. The moderate-resolution imaging spectroradiometer (MODIS) LST has been widely used, but it is not available under cloudy conditions. This study proposed a novel approach for reconstructing all-sky 1 km MODIS LST in South Korea during the summer seasons using various data sources, considering the cloud effects on LST. In South Korea, a Local Data Assimilation and Prediction System (LDAPS) with a relatively high spatial resolution of 1.5 km has been operated since 2013. The LDAPS model’s analysis data, binary MODIS cloud cover, and auxiliary data were used as input variables, while MODIS LST and cloudy-sky in situ LST were used together as target variables based on the light gradient boosting machine (LightGBM) approach. As a result of spatial ﬁve-fold cross-validation using MODIS LST, the proposed model had a coefﬁcient of determination ( R 2 ) of 0.89–0.91 with a root mean square error ( RMSE ) of 1.11–1.39 ◦ C during the daytime, and an R 2 of 0.96–0.97 with an RMSE of 0.59–0.60 ◦ C at nighttime. In addition, the reconstructed LST under the cloud was evaluated using leave-one-station-out cross-validation (LOSOCV) using 22 weather stations. From the LOSOCV results under cloudy conditions, the proposed LightGBM model had an R 2 of 0.55–0.63 with an RMSE of 2.41–3.00 ◦ C during the daytime, and an R 2 of 0.70–0.74 with an RMSE of 1.31–1.36 ◦ C at nighttime. These results indicated that the reconstructed LST has higher accuracy than the LDAPS model. This study also demonstrated that cloud cover information improved the cloudy-sky LST estimation accuracy by adequately reﬂecting the heterogeneity of the relationship between LST and input variables under clear and cloudy skies. The reconstructed all-sky LST can be used in a variety of research applications including weather monitoring and forecasting.


Introduction
Land surface temperature (LST) is the radiative temperature of the Earth's surface. LST plays a key role in the energy balance between the atmosphere and the ground surface [1,2]. A high spatiotemporal resolution LST is widely required for various research fields such as heat flux monitoring, the biogeochemical cycle, and land surface process modeling [3][4][5][6][7]. Therefore, it is essential to produce spatiotemporally seamless LST in an accurate manner.
LST can be monitored with high temporal resolution at weather stations. However, LST measured at weather stations is point-scale, inherently aspatial. The number of weather stations is insufficient due to the considerable labor and cost associated with their management and maintenance [8]. So, there is a limitation on spatially monitoring LST based on weather stations in vast areas, particularly when topographical complexity is high. Owing to the exponential growth of satellite remote sensing fields over the past decades, Remote Sens. 2022, 14, 1815 3 of 20 This study aims to construct an all-sky 1 km MODIS LST over South Korea using the light gradient boosting machine (LightGBM) approach, considering the cloud effects on LST. LightGBM uses a gradient boosting framework with high computational speed and efficient memory usage without compromising performance. MODIS data, in situ observations, and the high-resolution Local Data Assimilation and Prediction System (LDAPS) output were used for MODIS LST reconstruction. The proposed approach is expected to produce reliable all-sky LST by considering the relationships between LST and predictand variables both under clear and cloudy skies. The key objectives of this study were to (1) develop an all-sky 1 km MODIS LST reconstruction model, (2) evaluate the developed model based on systematic validations using two different LST sources (i.e., MODIS LST and in situ observations), and (3) examine whether the model incorporating the cloud effect improves the estimation accuracy of cloudy-sky LST.

Study Area
The study area is South Korea (~99,840 km 2 ; 124 • -130 • E, 33 • -39 • N), which is located in Northeast Asia ( Figure 1). South Korea has a hot and humid summer owing to the North Pacific air mass and cold winters due to dry continental high pressure. The annual mean temperature in South Korea is approximately 10-15 • C, with a mean temperature of 23-26 • C in August, the hottest month. Additionally, due to the East Asian monsoon, it experiences extremely humid summers with concentrated precipitation (i.e., the annual rainfall was 1306.3 mm on average across 1991-2020 with a mean summer rainfall of 710.9 mm). This results in a high cloud cover rate in South Korea during the summer. Therefore, this research focused on summer seasons (i.e., June to August across 2013-2020).
found that the use of in situ cloudy-sky LST observations can be beneficial to the c eration of the cloud effects on LST in all-sky LST reconstruction models. However, the sparse distribution of in situ measurements, relying exclusively on in situ observ may limit spatial representation. Thus, it is necessary to develop an effective all-sky MODIS LST reconstruction approach that combines MODIS and in situ cloudy-sky and high-resolution numerical models.
This study aims to construct an all-sky 1 km MODIS LST over South Korea usi light gradient boosting machine (LightGBM) approach, considering the cloud effe LST. LightGBM uses a gradient boosting framework with high computational spee efficient memory usage without compromising performance. MODIS data, in situ vations, and the high-resolution Local Data Assimilation and Prediction System (LD output were used for MODIS LST reconstruction. The proposed approach is expec produce reliable all-sky LST by considering the relationships between LST and pred variables both under clear and cloudy skies. The key objectives of this study were develop an all-sky 1 km MODIS LST reconstruction model, (2) evaluate the deve model based on systematic validations using two different LST sources (i.e., MOD and in situ observations), and (3) examine whether the model incorporating the clo fect improves the estimation accuracy of cloudy-sky LST.

Study Area
The study area is South Korea (~99,840 km 2 ; 124°-130°E, 33°-39°N), which is lo in Northeast Asia ( Figure 1). South Korea has a hot and humid summer owing North Pacific air mass and cold winters due to dry continental high pressure. The a mean temperature in South Korea is approximately 10-15 °C, with a mean temperat 23-26 °C in August, the hottest month. Additionally, due to the East Asian mons experiences extremely humid summers with concentrated precipitation (i.e., the a rainfall was 1306.3 mm on average across 1991-2020 with a mean summer rainfall o mm). This results in a high cloud cover rate in South Korea during the summer. The this research focused on summer seasons (i.e., June to August across 2013-2020).

MODIS Data
This study used daily MODIS LST products (MOD11A1 for Terra and MYD11A1 for Aqua) with a spatial resolution of 1 km. The MODIS LST is retrieved from two different TIR bands (i.e., band 31 with 10.78-11.28 µm and band 32 with 11.77-12.27 µm) via a Remote Sens. 2022, 14, 1815 4 of 20 generalized split-window algorithm [41]. Terra MODIS provides LST observed at 10:30 a.m. and 10:30 p.m., and Aqua MODIS provides LST observed at 1:30 p.m. and 1:30 a.m. local time. The MODIS LST data from 2013 to 2020 were obtained through Earthdata Search (https://search.earthdata.nasa.gov/search; assessed on 1 August 2020). In addition, annual MODIS land cover (MCD12Q1) was employed for data processing and analysis.

In Situ LST Data
Hourly in situ LST observations (i.e., 1 a.m./p.m., 2 a.m./p.m., 10 a.m./p.m., and 11 a.m./p.m.) were acquired from automated surface observing systems (ASOS) managed by the Korea Meteorological Administration (KMA). ASOS provides LST data measured directly at 0 cm via a platinum resistance temperature sensor. There are 102 ASOSs in South Korea, but the stations that changed their location between 2013 and 2020 were excluded. In addition, we excluded the stations with a calibration error (root mean square error; RMSE) >3 • C from the bias correction described in Section 3.1 to minimize the spatial discrepancy between in situ measurements and satellite data. Finally, data collected from a total of 22 ASOS stations were used in this study.

LDAPS Model Data
The KMA is currently operating the LDAPS model based on UK Met Office's Unified Model (UM model), which uses non-hydrostatic dynamics, semi-Lagrangian advection, and semi-implicit time-stepping [42]. LDAPS uses the Global Data Assimilation and Prediction System (GDAPS) with a coarse spatial resolution of 10-25 km as initial and boundary conditions. LDAPS produces analysis data eight times a day with 3 h intervals (i.e., 0, 3, 6, 12, 15, 18, and 21 UTC). It has a spatial resolution of 1.5 km and 70 vertical levels in the Korean Peninsula and the surrounding seas. This study used several surface grid data from LDAPS: LST and four parameters-air temperature (Tair), relative humidity (RH), wind speed (WS), and precipitation (Ppt)-closely related to the spatio-temporal changes of LST [43].

Auxiliary Data
The elevation, slope, impervious area ratio, mean WS, day of the year (DOY), longitude, and latitude were used as auxiliary variables to consider the spatial and temporal variation of LST. The elevation, slope, and impervious area ratio provide information on topography and landcover that affect the spatial variability of LST [43,44]. Elevation was constructed from the Shuttle Radar Topography Mission (SRTM) Digital Elevation Model (DEM) with a spatial resolution of 90 m (https://earthexplorer.usgs.gov; assessed on 1 August 2021). The slope was retrieved using DEM data with the "Slope" tool of Spatial Analyst Toolbox in ArcGIS. The Global Human Settlement (GHS) built-up dataset with a 250 m spatial resolution was employed as built-up area information (https://ghsl.jrc.ec.europa.eu/ghs_ bu2019.php; assessed on 1 August 2021). The mean WS was extracted from the Global Wind Atlas, which offers local wind climates with a 250 m spatial resolution. Longitude and latitude were also extracted from the grid of MODIS LST to account for clear-sky LST pixels in the vicinity of missing LST under the cloud. Lastly, DOY was used after conversion to a range between −1 and 1 through cosine transformation to represent the temporal variability of LST.

Data Processing
This study used the MODIS LST time series filtered to include only high-quality data via the quality flag. However, since MODIS LST is of lower quality in built-up areas than in other land covers [45], there are insufficient samples for a machine learning model to learn LST characteristics in built-up areas. To develop an LST reconstruction model that can reflect all land surface characteristics in the study area, this study used MODIS LST in the built-up areas without the quality-screening process. In addition, the MODIS quality flag contains information about the existence of clouds in each pixel. We extracted the cloud cover from the MODIS quality flag and used it to consider the varied relationship between LST and input variables on clear and cloudy skies.
The LDAPS data with 3 h intervals were linearly interpolated to the MODIS observation time and bilinearly resampled to 1 km spatial resolution. The hourly observed in situ LST was also linearly interpolated to match the MODIS view time. The point-scale in situ LST measurements showed a trend of increasing more rapidly than the 1 km MODIS LST at high temperatures in summer. To reduce the spatial scale discrepancy between MODIS LST and in situ observations, a polynomial regression-based bias correction was applied [15]. Considering the distribution range of daytime LST lowered by cloud in summer, MODIS and in situ LSTs from May to October were used for the bias correction. Figure 2 summarizes the entire procedure of this study. The LDAPS model's analysis data (i.e., LST, Tair, RH, WS, and Ppt), binary MODIS cloud cover, and seven auxiliary variables (i.e., elevation, slope, impervious area ratio, mean WS, DOY, longitude, and latitude) were used as input data, while the MODIS LST and bias-corrected in situ cloudysky LST were used as reference data. The MODIS LST was sampled over the entirety of South Korea, while the cloudy-sky LST was collected from 22 ASOS stations. Using LightGBM, an all-sky MODIS LST reconstruction model was developed for each MODIS view time.

Data Processing
This study used the MODIS LST time series filtered to include only high-quality data via the quality flag. However, since MODIS LST is of lower quality in built-up areas than in other land covers [45], there are insufficient samples for a machine learning model to learn LST characteristics in built-up areas. To develop an LST reconstruction model that can reflect all land surface characteristics in the study area, this study used MODIS LST in the built-up areas without the quality-screening process. In addition, the MODIS quality flag contains information about the existence of clouds in each pixel. We extracted the cloud cover from the MODIS quality flag and used it to consider the varied relationship between LST and input variables on clear and cloudy skies.
The LDAPS data with 3 h intervals were linearly interpolated to the MODIS observation time and bilinearly resampled to 1 km spatial resolution. The hourly observed in situ LST was also linearly interpolated to match the MODIS view time. The point-scale in situ LST measurements showed a trend of increasing more rapidly than the 1 km MODIS LST at high temperatures in summer. To reduce the spatial scale discrepancy between MODIS LST and in situ observations, a polynomial regression-based bias correction was applied [15]. Considering the distribution range of daytime LST lowered by cloud in summer, MODIS and in situ LSTs from May to October were used for the bias correction. Figure 2 summarizes the entire procedure of this study. The LDAPS model's analysis data (i.e., LST, Tair, RH, WS, and Ppt), binary MODIS cloud cover, and seven auxiliary variables (i.e., elevation, slope, impervious area ratio, mean WS, DOY, longitude, and latitude) were used as input data, while the MODIS LST and bias-corrected in situ cloudysky LST were used as reference data. The MODIS LST was sampled over the entirety of South Korea, while the cloudy-sky LST was collected from 22 ASOS stations. Using LightGBM, an all-sky MODIS LST reconstruction model was developed for each MODIS view time.

Light Gradient Boosting Machine
Ensemble machine learning approaches (e.g., random forest (RF) and extreme gradient boosting (XGBoost)) have been widely used for producing an all-sky LST in many previous studies [14,36,40]. Both RF and XGBoost grow multiple decision trees and then

Light Gradient Boosting Machine
Ensemble machine learning approaches (e.g., random forest (RF) and extreme gradient boosting (XGBoost)) have been widely used for producing an all-sky LST in many previous studies [14,36,40]. Both RF and XGBoost grow multiple decision trees and then average their outputs to obtain a final prediction for regression, generally yielding better performance than single decision trees. However, they are slow and ineffective when training with a large number of samples [46]. One solution to this problem is to reduce the number of samples, but there is no clear method for optimizing data sampling.
LightGBM was proposed to accelerate the training process without a reduction in accuracy [47,48]. LightGBM belongs to the Gradient Boosting Decision Tree (GBDT) algorithm, which approximates a gradient decent step in the direction of minimizing the loss function (residual errors). Existing GBDT methods (e.g., XGBoost) use a level-wise tree growth strategy that keeps the trees balanced, which often takes a great deal of time to optimize. On the other hand, LightGBM uses the leaf-wise tree growth strategy, which splits the leaf with the most loss and does not bother the remaining leaves on the same level, saving processing time. Additionally, LightGBM employs Gradient-Based One Side Sampling (GOSS), a sampling method based on gradients. The key concept of GOSS is that the instances with large gradients contribute much more to growing a decision tree than those with small gradients. Consequently, GOSS keeps the instances with large gradients, but it performs random sampling on those with small gradients. Through the leaf-wise tree growth strategy and GOSS, LightGBM has a fast training speed with low memory consumption and often performs better than other boosting algorithms [47,49]. Therefore, this study adopted LightGBM to produce all-sky LST for a relatively long-term period (i.e., 2013-2020), with more than 10 million samples per MODIS view time [50,51].
We implemented LightGBM using the 'lightgbm' package in Python. The hyperparameters of LightGBM include max_depth, min_data_in_leaf, num_leaves, and n_estimators. The max_depth parameter presents the maximum depth of a tree. The min_data_in_leaf and the num_leaves parameters determine the minimum number of samples in a leaf and the number of leaves, respectively. These three parameters control the overfitting of the LightGBM model. Lastly, the n_estimators parameter indicates the number of trees. Table 1 shows the combination of hyperparameters tested for model optimization in this study. The LightGBM optimization process in this study is described in Section 3.3.

Performance Evaluation of the Proposed Approach
Three different cross-validation approaches were conducted to evaluate the robustness and generalization of the proposed model. First, random five-fold cross-validation (RDCV) was performed using MODIS LST to evaluate the reconstructed clear-sky LST. To evaluate the spatial distribution of the simulated LST, spatial five-fold cross-validation (SPCV) was also conducted. Finally, leave-one-station-out cross-validation (LOSOCV) using 22 ASOSs was performed to evaluate the estimated cloudy-sky LST. In the cross-validation process, the training and validation sets were randomly divided by 8:2 from the total dataset, except for the independent test data. For example, RDCV randomly divided the total dataset into five subsets, where one subset was used for test data and the remaining four subsets were divided again by 8:2 for training and validation. The optimal hyperparameters were selected based on the lowest validation RMSE (Equation (1)) through the grid search method. This process was repeated five times, and then RDCV performance was obtained by averaging the test result for each fold. This hyperparameter tuning and evaluation approach based on cross-validation has been widely adopted in recent studies that used machine learning [52][53][54].
where y i andŷ i are measured and predicted values, respectively. In addition, this study analyzed the effects of the learning strategy using in situ cloudysky LST as a target variable and the cloud cover as the input variable for cloudy-sky LST estimation. Two scenarios-LightGBM without both in situ cloudy-sky LST and cloud cover data (scenario 1) and LightGBM without cloud cover data (scenario 2)-were further compared to the proposed LightGBM model with all input variables. We analyzed the spatial difference between the LSTs from the LDAPS and LightGBM models and MODIS LST (Figure 4). LDAPS showed a generally higher LST distribution than MODIS over most of the study area. In particular, LDAPS provided much higher LST We analyzed the spatial difference between the LSTs from the LDAPS and LightGBM models and MODIS LST (Figure 4). LDAPS showed a generally higher LST distribution than MODIS over most of the study area. In particular, LDAPS provided much higher LST than MODIS in urban areas during both day and night, but significantly lower LST than MODIS in coastal areas during the daytime. This might be because GDAPS with a 10-25 km coarse spatial resolution is used as the initial conditions of LDAPS, resulting in high uncertainty and variation in coastal areas. On the other hand, the proposed LightGBM had similar spatial distribution as MODIS LST, yielding little difference between the two.  The performances of LDAPS and LightGBM by land cover are depicted in Figure 5. The LightGBM outperformed LDAPS in the four different land covers for all MODIS view times, producing higher R 2 and lower RMSE. Except for built-up areas, the proposed LightGBM model had an R 2 ≥ 0.80 and RMSE ≤ 1.6 °C during the daytime, and an R 2 ≥ 0.95 and RMSE ≤ 0.7 °C at nighttime. Among four different land covers, the LightGBM model generally had the highest accuracy in forest areas but was less accurate in the built-up class than other land covers. The performances of LDAPS and LightGBM by land cover are depicted in Figure 5. The LightGBM outperformed LDAPS in the four different land covers for all MODIS view times, producing higher R 2 and lower RMSE. Except for built-up areas, the proposed LightGBM model had an R 2 ≥ 0.80 and RMSE ≤ 1.6 • C during the daytime, and an R 2 ≥ 0.95 and RMSE ≤ 0.7 • C at nighttime. Among four different land covers, the LightGBM model generally had the highest accuracy in forest areas but was less accurate in the built-up class than other land covers.

Results
times, producing higher R 2 and lower RMSE. Except for built-up areas, the proposed LightGBM model had an R 2 ≥ 0.80 and RMSE ≤ 1.6 °C during the daytime, and an R 2 ≥ 0.95 and RMSE ≤ 0.7 °C at nighttime. Among four different land covers, the LightGBM model generally had the highest accuracy in forest areas but was less accurate in the built-up class than other land covers.

Clear-Sky
Cloudy-Sky  Figure 6). LightGBM resulted in lower RMSE values than LDAPS at most stations under cloudy conditions for all four MODIS view times. Additionally, we analyzed the temporal variation of LST estimated by the LDAPS and LightGBM models at station 8 in August 2013, when it was very hot with a high cloud cover rate ( Figure S1). For all four MODIS view times, LDAPS tended to overestimate LST compared to the in situ measurements. On the other hand, LightGBM showed good agreement with the in situ LST without such a bias. Figures 7 and 8 depict the spatial maps of the average all-sky 1 km LST of the LDAPS and LightGBM models for daytime and nighttime, respectively, in June-August from 2013 to 2020. The daytime and nighttime LSTs of both models showed a negative spatial relationship with elevation (see Figure 1). While LDAPS showed a relatively smoothed spatial pattern of LST, LightGBM produced spatially detailed LST distribution. Figure 9 displays the original MODIS LSTs and the corresponding all-sky LSTs of LDAPS and LightGBM at both daytime and nighttime on 6 August 2016. As shown in Figure 4, the daily LST difference between MODIS and LDAPS was evident. Although the overall LST spatial patterns of LDAPS were similar to MODIS, they were spatially smoothed. On the contrary, the proposed model successfully filled the missing LST pixels without the smoothing effect.   Figure 1). While LDAPS showed a relatively smoothed spatial pattern of LST, LightGBM produced spatially detailed LST distribution. Figure 9 displays the original MODIS LSTs and the corresponding all-sky LSTs of LDAPS and LightGBM at both daytime and nighttime on 6 August 2016. As shown in Figure 4, the daily LST difference between MODIS and LDAPS was evident. Although the overall LST spatial patterns of LDAPS were similar to MODIS, they were spatially smoothed. On the contrary, the proposed model successfully filled the missing LST pixels without the smoothing effect.

Model Evaluation using MODIS LST and In Situ LSTs
The performances of the LDAPS and LightGBM models during the daytime were lower than at night (Figure 3). In particular, both models showed the highest rRMSE for Aqua daytime and the lowest rRMSE for Aqua nighttime. This result is because the daytime LST becomes more unstable than the nighttime LST due to downward solar radiation, which causes a more heterogeneous spatial thermal distribution on the land surface in the daytime [5,37,56]. Unlike the proposed model, LDAPS showed a significant performance difference between day and night of up to 2 °C RMSE. LDAPS has a spatially smoothed thermal distribution due to the surface properties roughly considered in the model [57][58][59], which may result in a high error during the daytime with a relatively high spatial variation in the thermal distribution.

Model Evaluation Using MODIS LST and In Situ LSTs
The performances of the LDAPS and LightGBM models during the daytime were lower than at night (Figure 3). In particular, both models showed the highest rRMSE for Aqua daytime and the lowest rRMSE for Aqua nighttime. This result is because the daytime LST becomes more unstable than the nighttime LST due to downward solar radiation, which causes a more heterogeneous spatial thermal distribution on the land surface in the daytime [5,37,56]. Unlike the proposed model, LDAPS showed a significant performance difference between day and night of up to 2 • C RMSE. LDAPS has a spatially smoothed thermal distribution due to the surface properties roughly considered in the model [57][58][59], which may result in a high error during the daytime with a relatively high spatial variation in the thermal distribution.
In addition, the LDAPS LST showed a high spatial difference compared to the MODIS LST (Figure 4), which could be because LDAPS considers land cover types but does not adequately account for environmental variables such as vegetation abundance and soil moisture [60]. On the other hand, the LightGBM-derived LST showed a spatial distribution very similar to that of MODIS. These results imply that the proposed model produced MODIS-like LST well under clear-sky conditions. Among several types of vegetation, the LightGBM model generally had the highest accuracy in forest areas, with relatively lower performance in grass and cropland areas ( Figure 5). This result is consistent with previous studies [15,37]. The built-up class showed a relatively low accuracy compared to other land covers, which may be due to the high uncertainty of MODIS LST over urban areas [19,45].
Clear-sky LSTs from MODIS, LDAPS, and LightGBM showed a relatively low correlation with the bias-corrected in situ observations during the daytime (Table 2). To analyze the low correlation between MODIS and in situ LSTs under clear-sky conditions during the daytime, we analyzed the correlation between the clear-sky LST observed at 22 ASOS stations for all MODIS view times ( Figure S2). The temporal correlation matrix between in situ clear-sky LSTs observed at 22 ASOS stations mostly showed R 2 > 0.6 at nighttime, while it tended to have low R 2 < 0.4 during the daytime. Since the spatial distribution of LSTs becomes unstable due to incoming solar radiation during the daytime [5,37,56], it is not surprising that the MODIS and in situ LSTs have a low correlation under clear-sky conditions during the daytime.
During the daytime under cloudy conditions, LDAPS and LightGBM showed relatively higher R 2 values than the clear-sky conditions, which is likely due to the decreased incoming solar radiation by cloud [3,61], causing LST to have a somewhat smaller dynamic range. The RMSE of LightGBM differed by station during the daytime (Figure 6). There was a significant, positive correlation (R 2~0 .46) between the standard deviation of the bias-corrected in situ cloudy-sky LST and the RMSE of LightGBM. One of the reasons may be that the proposed approach considered the cloud effects using an instantaneous binary cloud cover, which can make it difficult to represent the temporal variability of the cloudy-sky LST. In the future, the temporal variation of cloud cover should be considered to further improve the cloudy-sky LST estimation accuracy. Nevertheless, LightGBM showed higher cloudy-sky LST estimation accuracy than LDAPS overall. Therefore, the validation results using MODIS LST and in situ measurements demonstrated that the proposed model successfully reconstructed all-sky MODIS LSTs, outperforming LDAPS.

Variable Importance and Effects of Using a Cloud Cover for Estimating Cloud-Sky LST
This study additionally used in situ LST measurements under cloudy skies as the dependent variable, and binary cloud cover as input data to reflect the cloud effects on LST when reconstructing all-sky MODIS LST. The variable importance of the LightGBM was assessed to analyze which variables contributed to the all-sky LST reconstruction model ( Figure 10). The LDAPS model's meteorological variables (e.g., RH, WS, and LST) generally showed high variable importance, as they are closely related to the spatial and temporal variability of LST [43]. The topographical variables such as elevation, latitude, and longitude also had high variable importance, which may have contributed to enhancing the spatial details of the LDAPS data. However, the importance of cloud cover was low. This could be because the relative variable importance of LightGBM was calculated by counting the times a feature was used to split a node in training samples. There were much fewer cloudy-sky LSTs (approximately 14,000) sampled from 22 ASOS stations compared to the clear-sky MODIS LST samples (more than 10 million), which indicates the variable importance of the proposed LightGBM model was greatly affected by the clear-sky LST samples. Thus, the variable importance results (Figure 10) should be interpreted mainly for clear-sky conditions.
Under the cloudy conditions, we further investigated the impact of using the in situ cloudy-sky LST and cloud cover data in the LightGBM model by selectively excluding them (Table 3). Comparing the proposed model and two scenarios (i.e., scenario 1 and scenario 2) for clear-sky LST, they showed comparable performance, having slightly different R 2 (~0.02) and RMSE (~0.02 • C) for all MODIS view times.
For cloudy-sky LST, however, the three models showed a noticeable difference in the accuracy metrics. Among the three models (i.e., the proposed model, scenario 1, and scenario 2), scenario 1 had the lowest R 2 and highest RMSE for each MODIS view time, showing a warm bias ≥1 • C during the daytime and cold bias ≤−1 • C at nighttime. Scenario 2 yielded better LST estimation accuracy metrics under the cloud than scenario 1, which implies that it is beneficial to learn the LST information under the cloud from in situ observations for all-sky LST reconstruction. Compared to scenario 1, scenario 2 had a smaller bias, but still greater than 1 • C, except for Terra daytime. The warm bias during the daytime and the cold bias at nighttime in both scenarios 1 and 2 are due to the cloud effects on LST, with clouds cooling the LST during the daytime by reducing downwelling shortwave radiation and warming the LST at nighttime by increasing downwelling longwave radiation [22,23]. Although both MODIS LST and in situ cloudy-sky LST were used to train the model, the majority of training samples were clear-sky MODIS LST, which may have limited the model's learning of cloudy-sky LST characteristics. In comparison to the two scenarios, the proposed model had the highest accuracy metrics with a much lower bias. These results demonstrate that the binary cloud cover input data adequately reflected the cloud effects considering the heterogeneity of the relationship between LST and other input variables under both clear and cloudy skies. Under the cloudy conditions, we further investigated the impact of using the in situ cloudy-sky LST and cloud cover data in the LightGBM model by selectively excluding them (Table 3). Comparing the proposed model and two scenarios (i.e., scenario 1 and scenario 2) for clear-sky LST, they showed comparable performance, having slightly different R 2 (~0.02) and RMSE (~0.02 °C) for all MODIS view times.
For cloudy-sky LST, however, the three models showed a noticeable difference in the accuracy metrics. Among the three models (i.e., the proposed model, scenario 1, and scenario 2), scenario 1 had the lowest R 2 and highest RMSE for each MODIS view time, showing a warm bias ≥1 °C during the daytime and cold bias ≤−1 °C at nighttime. Scenario 2 yielded better LST estimation accuracy metrics under the cloud than scenario 1, which implies that it is beneficial to learn the LST information under the cloud from in situ observations for all-sky LST reconstruction. Compared to scenario 1, scenario 2 had a smaller bias, but still greater than 1 °C, except for Terra daytime. The warm bias during the daytime and the cold bias at nighttime in both scenarios 1 and 2 are due to the cloud effects  Table 3. LOSOCV results of the proposed LightGBM and two scenario models (i.e., scenario 1 and scenario 2) for clear and cloudy-sky LSTs.

MODIS Time
Proposed model Scenario 1 Scenario 2

Spatial Distribution Assessment of All-Sky LSTs
Both LDAPS and LightGBM had a similar dynamic range of LST during the daytime, whereas LDAPS showed significantly wider nighttime LST ranges than LightGBM (Figures 7 and 8). Under clear-sky conditions, LDAPS had a higher LST range than MODIS during both the daytime and nighttime. In addition, the comparison between the LDAPS LST and 22 in situ observations ( Figure S3) revealed that the LDAPS had a warm bias in cloudy-sky conditions as well as in clear skies at nighttime, but LightGBM did not show such a bias (see Table 2). Moreover, the cloud makes LST cooler during the daytime and warmer during the nighttime. That is, LDAPS had a higher LST than MODIS under clear skies during the daytime, but it was smoothed by the cooling effect of the clouds, which resulted in similar LST ranges for both LDAPS and LightGBM. However, due to the warm bias of LDAPS under both clear and cloudy skies and the warming effect of clouds at night, LDAPS has a wider LST range than LightGBM.
The daytime and nighttime LSTs of LDAPS and LightGBM over the forest areas were relatively lower than those of the other land covers because forests contain various types of vegetation capable of photosynthesis and transpiration that help reduce the heat over the regions [62,63]. Meanwhile, the built-up areas had a higher LST than other land covers due to the huge amount of impervious surfaces [5,64]. These LST spatial distributions are coincident with the results of Yoo et al. (2020) [15]. Compared to LDAPS, LightGBM produced LST spatial distribution maps reflecting the spatial variation of LST according to the land cover and topography in South Korea. These spatial pattern differences were clearly seen in the metropolitan areas (boxes 1 and 2 in Figures 7 and 8). LDAPS resulted in clustered hotspots that did not reflect the complex urban environment, whereas LightGBM not only yielded a higher LST in cities than their surrounding areas but also showed the spatial variability of LST within the cities. The LightGBM successfully reconstructed the all-sky MODIS LST (Figure 9), showing spatial patterns consistent with the original MODIS LST. These results indicate that the missing MODIS LST, mainly due to clouds, can be effectively filled seamlessly through the proposed reconstruction model.

Novelty and Limitations
Most previous LST gap-filling studies applied the relationship between the clear-sky LST and input variables to fill the missing MODIS LST by cloud. To consider the cloud effects on LST, this study used in situ LST measurements under cloud and cloud cover data along with high-resolution LDAPS meteorological data based on the LightGBM machine learning model. The proposed strategy in this study showed much higher cloudy-sky LST estimation accuracy metrics with RMSE of 2.41-3.00 • C during the daytime and 1.31-1.37 • C at nighttime than the approach proposed by , which is the same as scenario 2 in this study [15]. In addition, all-sky 1 km MODIS-like LST can be produced near real-time since this study used the analysis data of the operational LDAPS model. Thus, the generated all-sky LST data can be used for practical purposes such as initializing numerical weather forecasting models or assisting in emergency planning.
However, the proposed approach in this study has some limitations. Firstly, since LDAPS produces meteorological data at 3-h intervals, we linearly interpolated them for each MODIS view time and then used them as input data. Because LST has a relatively high diurnal variation, especially during the daytime, the time difference between LDAPS production and MODIS observation can introduce potential uncertainty in the LST reconstruction process. It is expected that the accuracy of LST reconstruction could be further improved when numerical model data with an hour interval or higher temporal resolution are used. Another limitation is the insufficient in situ observations, especially in urban areas where MODIS LST has relatively high uncertainty. Since the all-sky LST spatial distributions of LDAPS and LightGBM were qualitatively compared in two metropolitan cities in South Korea, additional quantitative validation across multiple cities is necessary using in situ observations.

Conclusions
This study developed an all-sky 1 km MODIS LST reconstruction model based on LightGBM to fill missing MODIS LST caused mainly by clouds in South Korea during summer seasons. The analysis data of LDAPS, seven auxiliary variables, and binary MODIS cloud cover were used as input variables, while MODIS LST and bias-corrected in situ cloudy-sky LST were used as dependent variables together. The RDCV and SPCV using MODIS LST showed that the proposed model produced a better performance than LDAPS, with an R 2~0 .9 and RMSE ≤ 1.4 • C and 0.6 • C during the daytime and nighttime, respectively. We evaluated the proposed model under cloudy conditions through LOSOCV using in situ data from 22 weather stations. Under cloudy conditions, LDAPS had an R 2 of 0.54-0.59 and RMSE of 2.84-3.34 • C during the daytime, and an R 2 of 0.72-0.75 and RMSE of 1.96-2.21 • C at nighttime. The proposed model had an R 2 comparable to LDAPS, but it had a lower RMSE of 2.41-3.00 • C during the daytime and 1.31-1.36 • C at nighttime. These results revealed that the proposed model successfully reconstructed the all-sky 1 km MODIS LST. The model performance was improved by incorporating the in situ cloudy-sky LST and MODIS cloud cover data, which adequately reflected the cloud effect on LST.
In this study, instantaneous cloud information was used to reconstruct the all-sky 1 km MODIS LST. In the future, if the temporal variations of cloud-covered conditions such as cumulative incoming solar radiations suggested by Zhao and Duan (2020) are considered using geostationary satellite data, the all-sky MODIS LST reconstruction can be further improved [61]. Although this study used the LDAPS model available for South Korea, we believe that the proposed model could be effectively applied in other countries using WRF or other high-resolution numerical models for all-sky MODIS LST reconstruction.
Supplementary Materials: The following supporting information can be downloaded at: https:// www.mdpi.com/article/10.3390/rs14081815/s1, Figure S1: Time-series of the estimated LST by LDAPS and LightGBM models, and bias-corrected in situ LST at station 8 for four MODIS view times (i.e., Terra and Aqua for daytime and nighttime) in August 2013. Figure