Estimating Land Surface Temperature from Satellite Passive Microwave Observations with the Traditional Neural Network, Deep Belief Network, and Convolutional Neural Network

Neural networks, especially the latest deep learning, have exhibited good ability in estimating surface parameters from satellite remote sensing. However, thorough examinations of neural networks in the estimation of land surface temperature (LST) from satellite passive microwave (MW) observations are still lacking. Here, we examined the performances of the traditional neural network (NN), deep belief network (DBN), and convolutional neural network (CNN) in estimating LST from the AMSR-E and AMSR2 data over the Chinese landmass. The examinations were based on the same training set, validation set, and test set extracted from 2003, 2004, and 2009, respectively, for AMSR-E with a spatial resolution of 0.25◦. For AMSR2, the three sets were extracted from 2013, 2014, and 2016 with a spatial resolution of 0.1◦, respectively. MODIS LST played the role of “ground truth” in the training, validation, and testing. The examination results show that CNN is better than NN and DBN by 0.1–0.4 K. Different combinations of input parameters were examined to get the best combinations for the daytime and nighttime conditions. The best combinations are the brightness temperatures (BTs), NDVI, air temperature, and day of the year (DOY) for the daytime and BTs and air temperature for the nighttime. By adding three and one easily obtained parameters on the basis of BTs, the accuracies of LST estimates can be improved by 0.8 K and 0.3 K for the daytime and nighttime conditions, respectively. Compared with the MODIS LST, the CNN LST estimates yielded root-mean-square differences (RMSDs) of 2.19–3.58 K for the daytime and 1.43–2.14 K for the nighttime for diverse land cover types for AMSR-E. Validation against the in-situ LSTs showed that the CNN LSTs yielded root-mean-square errors of 2.10–4.72 K for forest and cropland sites. Further intercomparison indicated that ~50% of the CNN LSTs were closer to the MODIS LSTs than ESA’s GlobTemperature AMSR-E LSTs, and the average RMSDs of the CNN LSTs were less than 3 K over dense vegetation compared to NASA’s global land parameter data record air temperatures. This study helps better the understanding of the use of neural networks for estimating LST from satellite MW observations.


AMSR-E and AMSR2 Data
AMSR-E is a twelve-channel and six-frequency passive microwave radiometer that measures the BTs of the Earth's surface at 6.925, 10.65, 18.7, 23.8, 36.5, and 89.0 GHz. Vertically and horizontally polarized measurements are taken at all frequencies. AMSR-E was onboard the Aqua satellite from 2002 to 2011. The native spatial resolution is~5 km at 89.0 GHz and 60 km at 6.9 GHz. The AMSR-E BTs used here are from the AMSR-E level 3 product with a spatial resolution of 0.25 • . This product is the daily average of BT in level 1B product and projected by equi-rectangular. As the successor of AMSR-E, AMSR2 is onboard the Global Change Observation Mission 1st-Water (GCOM-W1) satellite and has seven frequencies (i.e., 6.925, 7.3, 10.65, 18.7, 23.8, 36.5, and 89.0 GHz) in both horizontal and vertical polarizations. Compared with AMSR-E, AMSR2 has higher spatial resolutions. The native spatial resolution of AMSR2 is 3 × 5 km at 89.0 GHz and 35 × 62 km at 6.925 GHz. The AMSR2 BTs used here are from the AMSR2 level 3 product with a spatial resolution of 0.1 • . This product is also daily average data of level 1B data and projected by equi-rectangular. The overpass time of AMSR-E and AMSR2 are both~13:30 (ascending) and~01:30 (descending) local solar time. The level 3 products of AMSR-E and AMSR2 were downloaded at https://gportal.jaxa.jp/gpr/.

MODIS Land Surface Products
Three MODIS land surface products were used to parameterize MW emissivity, including MODIS/Aqua Snow Cover Daily L3 (MYD10C1), MODIS/Terra+Aqua Land Cover Type Yearly L3 (MCD12C1), and MODIS/Aqua Vegetation Indices 16-Day L3 (MYD13C1). Spatially averaged MYD11C1 LST was used as the target LST of the samples. All products were in version 6. MODIS LST is currently the most accurate and widely used TIR satellite LST product and has been used as the target LST to establish the MW LST retrieval model [12,17,20,34] and to retrieve MW LSE [28,36,37]. Validation results over homogeneous surfaces showed that MODIS LST products generally have higher accuracies than 1.0 K [38,39]. The aforementioned MYD**C1 products have a spatial resolution of 0.05 • and were derived from Aqua observations. Thus, these products achieved better spatial and temporal matching with the AMSR-E and AMSR2 pixels. The snow cover (SC, from MYD10C1), Remote Sens. 2020, 12, 2691 4 of 27 NDVI (from MYD13C1), land cover type percent (LCTP, from MCD12C1, i.e., the percent cover of 17 IGBP classes at each 0.05 • pixel), and LST (from MYD11C1) were upscaled to the spatial resolution of AMSR-E and AMSR2 by spatial averaging. In addition, the latest MODIS LST and emissivity product MYD21A1 with a spatial resolution of 1 km was used to (i) derive the MODIS channel emissivities for the calculation of the broadband emissivity (BBE) of the ground stations and (ii) to quantify the thermal heterogeneity within the MW pixels. These MODIS land products were downloaded from EARTHDATA (https://search.earthdata.nasa.gov).

Reanalysis and Assimilation Datasets
To quantify the atmospheric conditions for the corresponding MW observations, we collected the 2-meter air temperature (AT-2m) and total precipitable water vapor (TPWV) from the second Modern-Era Retrospective analysis for Research and Applications (MERRA-2) from https://disc.gsfc. nasa.gov. In addition, the surface skin temperature was also extracted to synchronize the Aqua MODIS with AMSR-2 observations. The dataset used in this study was tavg1_2d_slv_Nx, which has a grid size of 0.5 • × 0.625 • and temporal resolution of 1 hour from 00:30 UTC.
The soil moisture (SM) values in four layers (i.e., 0-10 cm, 10-40 cm, 40-100 cm, and 100-200 cm) were extracted from the Global Land Data Assimilation System (GLDAS) NOAH025_3H (in version 2.1) with a grid size of 0.25 • × 0.25 • and temporal resolution of 3 hours. The GLDAS dataset was also downloaded from https://disc.gsfc.nasa.gov. The 1-hourly MERRA-2 product and 3-hourly GLDAS product were interpolated to the overpass time of the MW sensors and the spatial resolution of the MW BT products.

In-Situ Measurements
The in-situ measurements of five ground stations (CBS, TYU, DXI, SDQ, and HMO) with ground longwave radiation were collected to validate the LST estimates. Details about the five stations are shown in Table 1. Among these five stations, the measurements of CBS were from ChinaFlux (http://www.chinaflux.org/); the measurements of TYU were from the CEOP Asia-Australia Monsoon Project (CAMP) [40]; the measurements of DXI were from the Haihe experiments in the Hai River Basin, China [41]; and the measurements of SDQ and HMO were from the Heihe Watershed Allied Telemetry Experimental Research (HiWATER) program [42][43][44]. The elevations of these stations range from 20 m to 1050 m. The instruments for measuring the outgoing and incoming longwave radiation are the Kipp and Zonen CNR1 (spectral range: 5-50 µm; uncertainty in daily total: 10%) at CBS, DXI, and HMO; Kipp and Zonen CG4 pyrgeometer (spectral range: 4.5-42 µm; uncertainty in daily total: 3%) at TYU; and Kipp and Zonen CNR4 (spectral range: 4.5-42 µm; uncertainty in daily total: <10%) at SDQ. Although CNR1, CG4, and CNR4 have different uncertainties, the average errors of incoming and outgoing longwave radiation are~6 and 3 W/m 2 [45,46], respectively, leading to an in-situ LST uncertainty of~0.6 K [3]. According to Stefan-Boltzmann's law, in-situ LST can be calculated from the outgoing and incoming longwave radiation, and the BBE was calculated from the emissivities of MODIS channels 29, 31, and 32 [47]. Liang et al. [47] stated a theoretical residual standard error of 0.0289 for the BBE, leading to in-situ LST uncertainties of~0.7 K for CBS, TYU, DXI, and SDQ, and~0.9 K for HMO. The in-situ LST total uncertainties caused by measurements and BBE are~0.9 K for CBS, TYU, DXI, and SDQ and~1.1 K for HMO. Additionally, the 3σ (standard deviations, STDs) criterion was used to filter the matched-up CNN LST and in-situ LST and remove the outliers [45,48,49]. As shown in Table 1, the mounting heights of these instruments vary from 3 m to 28 m above the ground surface, resulting in field-of-view (FOV) diameters ranging from 22.39 m to 208.99 m. However, the FOV is very small when compared with the spatial resolutions of AMSR-E and AMSR2. To assess the homogeneities of the MW pixels containing the stations, the percentage of the IGBP classes within the MW pixels were determined (Table 1). Furthermore, the LST differences between the MW pixels (reprojected by spatially averaging from MYD21A1 LST) and the 1-km MODIS pixels containing the ground stations were determined to evaluate thermal heterogeneities within the MW pixels ( Figure 1). The statistical results were based on the MYD21A1 LST data of the time period of the in-situ measurements. For the daytime conditions, the LST of the MW spatial resolution has an underestimation for CBS, overestimation for DXI and SDQ, and no obvious systemic bias for TYU and HMO when compared with the 1-km MODIS LST. The underestimation of CBS may be induced by the elevation distribution, and the overestimations of DXI and SDQ are induced by over 30% of built-up surfaces and barren within the MW pixels, respectively. As for the nighttime conditions, only CBS has an underestimation induced by the elevation distribution. In addition, larger STDs can be seen during the daytime. emissivities of MODIS channels 29, 31, and 32 [47]. Liang et al. [47] stated a theoretical residual standard error of 0.0289 for the BBE, leading to in-situ LST uncertainties of ~0.7 K for CBS, TYU, DXI, and SDQ, and ~0.9 K for HMO. The in-situ LST total uncertainties caused by measurements and BBE are~0.9 K for CBS, TYU, DXI, and SDQ and ~1.1 K for HMO. Additionally, the 3σ (standard deviations, STDs) criterion was used to filter the matched-up CNN LST and in-situ LST and remove the outliers [45,48,49].
As shown in Table 1, the mounting heights of these instruments vary from 3 m to 28 m above the ground surface, resulting in field-of-view (FOV) diameters ranging from 22.39 m to 208.99 m. However, the FOV is very small when compared with the spatial resolutions of AMSR-E and AMSR2. To assess the homogeneities of the MW pixels containing the stations, the percentage of the IGBP classes within the MW pixels were determined (Table 1). Furthermore, the LST differences between the MW pixels (reprojected by spatially averaging from MYD21A1 LST) and the 1-km MODIS pixels containing the ground stations were determined to evaluate thermal heterogeneities within the MW pixels ( Figure 1). The statistical results were based on the MYD21A1 LST data of the time period of the in-situ measurements. For the daytime conditions, the LST of the MW spatial resolution has an underestimation for CBS, overestimation for DXI and SDQ, and no obvious systemic bias for TYU and HMO when compared with the 1-km MODIS LST. The underestimation of CBS may be induced by the elevation distribution, and the overestimations of DXI and SDQ are induced by over 30% of built-up surfaces and barren within the MW pixels, respectively. As for the nighttime conditions, only CBS has an underestimation induced by the elevation distribution. In addition, larger STDs can be seen during the daytime. Figure 1. The LST differences between the MW pixels and the 1-km MODIS pixels containing the ground stations for the daytime (a) and nighttime (b). For CBS, TYU, and DXI, the spatial resolutions of the MW pixels are 0.25°; for SDQ and HMO the spatial resolutions of the MW pixels are 0.1°. The symbol and error bar represent the mean value and STD of the LST difference, respectively. Figure 1. The LST differences between the MW pixels and the 1-km MODIS pixels containing the ground stations for the daytime (a) and nighttime (b). For CBS, TYU, and DXI, the spatial resolutions of the MW pixels are 0.25 • ; for SDQ and HMO the spatial resolutions of the MW pixels are 0.1 • . The symbol and error bar represent the mean value and STD of the LST difference, respectively.

Other Datasets
In addition to the aforementioned datasets, the ESA's GlobTemperature AMSR-E LST (14 × 8 km) [26,27] and the NASA's global land parameter data record (LPDR) air temperatures (ATs) (25 km) [50][51][52][53] are used for intercomparison. DEM data with a spatial resolution of 90 m were downloaded from the Geospatial Data Cloud (http://www.gscloud.cn/). The DEM data were used to filter the valid samples rather than directly as input parameters. More details can be found in Section 3.3.

Neural Networks
Three different neural networks, i.e., NN, DBN, and CNN, were selected for assessments of their performances in the estimation of LST from the aforementioned AMSR-E and AMSR2 observations. Their architectures are shown in Figure 2. NN consists of three layers, including an input layer, hidden layer, and output layer (see Figure 2a). In this study, NN was designed with 2 hidden layers, and the numbers of hidden neurons were set as 64 and 48, respectively. Discussions on this point can be found in Section 5. The output of the hidden layer h and the output layerŷ can be written as follows: where x is the input vector; W 1 is the weight matrix between the input vector and output vector of the hidden layer; b 1 is the bias matrix of the hidden layer; h is the output vector of the hidden layer; W 2 is the weight matrix between the hidden layer and the output layer; b 2 is the bias matrix of output layer; and σ is the activation function. The activation function is used to build a nonlinear model, and the activation function used in this study was the rectified linear unit (ReLU), which has the following formula: the RBM was set as 112 (see discussion). For CNN, the neurons between two adjacent layers are partly connected (see Figure 2c), which is different from NN. A CNN consists of five parts, including an input layer, convolutional layer, pooling layer, fully connected layer, and output layer. Convolutional layer is used to extract local features, and the pooling layer is used to reduce the number of parameters in the network. However, the pooling layer has little effect on the accuracy of CNN [54]. The fully connected layer is used to integrate the local features extracted by the previous convolution layer. In this study, CNN was designed with one convolutional layer and one fully connected layer. The kernel size was set as 1 × 7, the number of convolution kernels was set as 96, and the number of neurons in the fully connected layer was set as 96 (see discussion). The output of the convolutional layer y l and the output layer ŷ can be written as follows: where W l is the convolution kernel weight; x l is the input of the convolutional layer; b l is the bias of the convolutional layer; W 1 and W 2 are the weights of the fully connected layer and output layer, respectively; b 1 and b 2 are the biases of the fully connected layer and output layer, respectively; and σ is the activation function.

Determination of Inputs for the Networks
The MW brightness temperatures measured by satellite sensors can be written as follows [20]: DBN (see Figure 2b) is a generative model that generates training data according to the maximum probability by training the weights between its neurons. DBN is a stack of layers of restricted Boltzmann machine (RBM), which has only two layers of neurons, including the visible layer and hidden layer [29]. The characteristics of RBM are as follows: given the state of the visible units, the activation conditions of each hidden unit are independent; when determining the state of the hidden layer units, the activation conditions of the visible layer units are also independent. The significance of training an RBM is adjusting the parameters of the model to fit the given input data and ultimately making the probability distribution of the visible units consistent with the input data. In an RBM, the activation probabilities of hidden unit h j and visible unit v i can be written as follows: where x is the input data of RBM; W is the RBM weight matrix; b is the bias vector for the visible units; c is the bias vector for the hidden units; and σ is the activation function.
In contrast with NN, DBN uses a layer-by-layer training method to update the network parameters. The first RBM is trained, and then the output of the previous RBM is used as the input of the next RBM. After the training of the RBMs, a fully connected layer is set to obtain the final output. In this study, the number of RBMs was 1 and the number of hidden units in the RBM was set as 112 (see discussion).
For CNN, the neurons between two adjacent layers are partly connected (see Figure 2c), which is different from NN. A CNN consists of five parts, including an input layer, convolutional layer, pooling layer, fully connected layer, and output layer. Convolutional layer is used to extract local features, and the pooling layer is used to reduce the number of parameters in the network. However, the pooling layer has little effect on the accuracy of CNN [54]. The fully connected layer is used to integrate the local features extracted by the previous convolution layer. In this study, CNN was designed with one convolutional layer and one fully connected layer. The kernel size was set as 1 × 7, the number of convolution kernels was set as 96, and the number of neurons in the fully connected layer was set as 96 (see discussion). The output of the convolutional layer y l and the output layerŷ can be written as follows: where W l is the convolution kernel weight; x l is the input of the convolutional layer; b l is the bias of the convolutional layer; W 1 and W 2 are the weights of the fully connected layer and output layer, respectively; b 1 and b 2 are the biases of the fully connected layer and output layer, respectively; and σ is the activation function.

Determination of Inputs for the Networks
The MW brightness temperatures measured by satellite sensors can be written as follows [20]: where the subscript p is the polarization (p ∈ {v, h}); T p , T s , T ↑ a , and T ↓ a are the MW brightness temperature, LST, atmospheric upwelling brightness temperature, and atmospheric downwelling brightness temperature in K, respectively; τ is the atmospheric transmissivity; and ε p is the MW land surface emissivity (LSE).
Equation (5) demonstrates that LSE and three atmospheric parameters are important factors in the estimation of the MW LST. However, these four parameters are difficult to obtain directly. Fortunately, the LSE of MW is closely related to soil moisture, snow, and vegetation [23,55]. Therefore, here, we used surface parameters including NDVI, SC, LCTP, and SM to implicitly present the LSE in the subsequent networks. In addition, atmospheric parameters rely on some easily available meteorological variables, such as the TPWV [20]. Thus, we used the TPWV and AT-2m to implicitly parameterize T ↑ a , T ↓ a , and τ. The reason for including AT-2m is that it is better related to atmospheric radiances in conditions with high air temperature but low atmospheric humidity [38].
Soil temperatures at different depths are important parameters for solving the problem of TSD [23,28,56,57]. Currently, soil temperatures mainly come from reanalysis and assimilation datasets. For GLDAS, the first layer depth of soil temperature is 0-10 cm, which means that the soil temperature of the first layer may be very close to the LST. The accuracy of the LST retrieval model may greatly depend on the accuracy of the soil temperature, and microwave BTs will have small contributions when the soil temperature is used as the input parameter. Considering that MW BT products are more reliable than soil temperature products from reanalysis or assimilation datasets, we did not take soil temperatures as input parameters. In addition to the aforementioned parameters, DOY is used to characterize the changes in LST at the annual scale. Finally, the input parameter set includes BTs (all frequencies and all polarizations), NDVI, SC, LCTP, SM, AT-2m, TPWV, and DOY.

Extraction of the Samples
The overpass time of AMSR2 is not the same as that of Aqua MODIS. This difference in the overpass time prevents the MYD11C1 LST from representing the LST at the overpass time of AMSR2. The daytime and nighttime LST at the overpass time of AMSR2 and Aqua MODIS is close to the maximum and minimum LST of the day, respectively. Furthermore, we found that over 90% of the overpass time differences between Aqua MODIS and AMSR2 over China are less than 20 minutes. Therefore, it is reasonable to assume that MERRA-2 skin temperature changes linearly with time within the hour that is nearest to the overpass time of AMSR2. First, the MERRA-2 surface skin temperatures were interpolated over time to obtain the skin temperature at the overpass times of MODIS (T MERRA-2 (t MODIS )) and AMSR2 (T MERRA-2 (t AMSR2 )). Then, the difference ∆T(t MODIS ) between T MODIS (t MODIS ) and T MERRA-2 (t MODIS ) at the overpass time of MODIS can be obtained by: Assuming that ∆T(t AMSR2 ) is equal to ∆T(t MODIS ), T MODIS (t AMSR2 ) (the LST at the AMSR2 overpass time) can be obtained by: The accuracy of neural networks relies on the validity of the training samples. To ensure the validity of the samples, we used two criteria to filter the valid samples. The first criterion is that the pixel quality is flagged as "good" (LST error < 1 K) and the view zenith angle (VZA) is less than 40 • for all MYD11C1 pixels within the corresponding MW pixel. The second criterion is that the STD of the elevation in MW pixel is less than the threshold (i.e., 100 m for AMSR-E with a spatial resolution of 0.25 • and 10 m for AMSR2 with a spatial resolution of 0.1 • ). Compared with AMSR-E, AMSR2 was filtered more strictly because it has a higher resolution (which means more samples) and lower training sample accuracy (affected by synchronization of MODIS and AMSR2).
The extracted samples were categorized into 4 groups, including daytime AMSR-E (Group I), nighttime AMSR-E (Group II), daytime AMSR2 (Group III), and nighttime AMSR2 (Group IV). Each group contains a training set, a validation set, and a test set. For AMSR-E, the training set, validation set, and test set were the samples from 2003, 2004, and 2009, respectively; for AMSR2, the training set, validation set, and test set were the samples from 2013, 2014, and 2016, respectively. Details for the extracted four groups of samples are shown in Table 2.

Implementation of the Networks
In this study, the three networks performed based on Tensorflow using a GPU with a computation capability of 5.0. Figure 3 shows the flowchart of this study. The process can be divided into three stages. Stage I was to examine the performances of NN, DBN, and CNN and to determine the best performing neural network. The accuracy indices in the comparison included the mean bias deviation (MBD), root-mean-square differences (RMSD), and coefficient of determination (R 2 ) based on the training set, validation set, and test set. For neural networks, more relevant input parameters may lead to better accuracy. However, some parameters may have small contributions and are difficult to obtain directly. Thus, in stage II, different input parameter combinations were examined to get the best combination for the daytime and nighttime. In stage III, the LST estimated from the MW observations with the determined best network and the best combination were validated based on the in-situ LST. Additionally, the latest MYD21A1 LST product, ESA's GlobTemperature AMSR-E LST, and NASA's LPDR ATs were also investigated for intercomparison purposes. Note that the criteria for filtering valid pixels are only applicable to the extraction of samples with the purpose of obtaining an accurate BT-LST conversion relationship. During the comparison in stage III, CNN LST estimates were from all available samples. Before intercomparison, GT LST, LPDR ATs, and MODIS LST were reprojected to the MW spatial resolutions (namely, 0.25 • for AMSR-E and 0.1 • for AMSR2).

Performances of the Different Neural Networks
The statistical results of the three networks on the training set, validation set, and test set are shown in Table 3. Table 3 shows that the LSTs estimated by the three networks have no obvious systematic deviation when compared to the MYD11C1 LST. The differences between the validation RMSDs and test RMSDs were lower than 0.30 K (lower than 0.10 K for Groups I, II, and IV), indicating the stability of the three network models. However, CNN had the lowest RMSDs: the RMSDs on the validation set and test set for CNN were lower by ~0.1 K and ~0.3 K than NN and DBN, respectively, for the four groups. In addition, we found that the accuracies of NN and CNN were extremely close. Therefore, a statistical test was used to evaluate the differences between the NN estimates and CNN estimates. Here, equivalence testing was chosen to reduce the risk of Type I inferential error [58,59]. The null hypothesis H0 was |T1 − T2| ≥ ΔT, and the alternative hypothesis H1 was |T1 − T2| < ΔT. Note that the H0 would be rejected when both one-sided components would be rejected [58]. In this study, ΔT was set as 0.1 K and the test results are shown in Table 4. It can be seen that the CNN estimates were different (null hypothesis is accepted) from the NN estimates for both AMSR-E and AMSR2 during the daytime. In contrast, they were equivalent (null hypothesis is rejected) on the validation and test sets for AMSR-E and the training and validation sets for AMSR2 during the nighttime.

Performances of the Different Neural Networks
The statistical results of the three networks on the training set, validation set, and test set are shown in Table 3. Table 3 shows that the LSTs estimated by the three networks have no obvious systematic deviation when compared to the MYD11C1 LST. The differences between the validation RMSDs and test RMSDs were lower than 0.30 K (lower than 0.10 K for Groups I, II, and IV), indicating the stability of the three network models. However, CNN had the lowest RMSDs: the RMSDs on the validation set and test set for CNN were lower by~0.1 K and~0.3 K than NN and DBN, respectively, for the four groups. In addition, we found that the accuracies of NN and CNN were extremely close. Therefore, a statistical test was used to evaluate the differences between the NN estimates and CNN estimates. Here, equivalence testing was chosen to reduce the risk of Type I inferential error [58,59]. The null hypothesis H 0 was |T 1 − T 2 | ≥ ∆T, and the alternative hypothesis H 1 was |T 1 − T 2 | < ∆T. Note that the H 0 would be rejected when both one-sided components would be rejected [58]. In this study, ∆T was set as 0.1 K and the test results are shown in Table 4. It can be seen that the CNN estimates were different (null hypothesis is accepted) from the NN estimates for both AMSR-E and AMSR2 during the daytime. In contrast, they were equivalent (null hypothesis is rejected) on the validation and test sets for AMSR-E and the training and validation sets for AMSR2 during the nighttime.
Then, NN, DBN, and CNN were further compared for different NDVI-based land cover types and different seasons. We classified the microwave pixels of the test sets into three types based on NDVI [3]: (i) barren land (NDVI < 0.2); (ii) sparsely vegetated (0.2 ≤ NDVI ≤ 0.5); and (iii) densely vegetated (NDVI > 0.5). For the convenience of statistics, the spring represents March, April, and May; the summer represents June, July, and August; the autumn represents September, October, and November; and the winter represents December, January, and February. The accuracies of the three networks for different NDVI-based land cover types and different seasons are shown in Figures 4 and 5, respectively. It can be seen that CNN had the lowest RMSDs in almost all conditions and the RMSDs of NN and DBN were higher than those of CNN by 0.1 K-0.5 K. Then, NN, DBN, and CNN were further compared for different NDVI-based land cover types and different seasons. We classified the microwave pixels of the test sets into three types based on NDVI [3]: (i) barren land (NDVI < 0.2); (ii) sparsely vegetated (0.2 ≤ NDVI ≤ 0.5); and (iii) densely vegetated (NDVI > 0.5). For the convenience of statistics, the spring represents March, April, and May; the summer represents June, July, and August; the autumn represents September, October, and November; and the winter represents December, January, and February. The accuracies of the three networks for different NDVI-based land cover types and different seasons are shown in Figures 4 and 5, respectively. It can be seen that CNN had the lowest RMSDs in almost all conditions and the RMSDs of NN and DBN were higher than those of CNN by 0.1 K-0.5 K.      From the previous analysis, it is clear that CNN outperforms NN and DBN. The training times of NN, DBN, and CNN were related to the number of training samples and were 6, 8, and 12 min for Group I, respectively. Although CNN had the highest time cost among these three networks, it was still acceptable. Therefore, only CNN was employed in further analysis, including the analysis of input parameters, validation based on in-situ LST, and intercomparison with other products.

Determination of the Best Input Parameter Combinations
Considering that BTs of all frequencies are acquired at the same time, we only examined the following input parameters: NDVI, SC, LCTP, SM, TPWV, AT-2m, and DOY. Due to space limitations, only some representative input parameter combinations and their total numbers of parameters are listed in Table 5. In these combinations, combination C0 contains all parameters in the input parameter set; C1-C4 were used for examining the surface parameters; C5-C7 were used for examining AT-2m and TPWV; C8 is for DOY; and C9-C14 were designed based on the results from C1-C8 to determine the best combinations for the daytime and nighttime conditions. The RMSD differences (ΔRMSDs) between combination C1-C14 and C0 for the validation sets and test sets are shown in Figure 6. The best input parameter combinations were selected based on (i) ΔRMSD less than 0.3 K; (ii) fewer input parameters. From the previous analysis, it is clear that CNN outperforms NN and DBN. The training times of NN, DBN, and CNN were related to the number of training samples and were 6, 8, and 12 min for Group I, respectively. Although CNN had the highest time cost among these three networks, it was still acceptable. Therefore, only CNN was employed in further analysis, including the analysis of input parameters, validation based on in-situ LST, and intercomparison with other products.

Determination of the Best Input Parameter Combinations
Considering that BTs of all frequencies are acquired at the same time, we only examined the following input parameters: NDVI, SC, LCTP, SM, TPWV, AT-2m, and DOY. Due to space limitations, only some representative input parameter combinations and their total numbers of parameters are listed in Table 5. In these combinations, combination C0 contains all parameters in the input parameter set; C1-C4 were used for examining the surface parameters; C5-C7 were used for examining AT-2m and TPWV; C8 is for DOY; and C9-C14 were designed based on the results from C1-C8 to determine the best combinations for the daytime and nighttime conditions. The RMSD differences (∆RMSDs) between combination C1-C14 and C0 for the validation sets and test sets are shown in Figure 6. The best input parameter combinations were selected based on (i) ∆RMSD less than 0.3 K; (ii) fewer input parameters. Table 5. Some of the input parameter combinations and the corresponding number of parameters. √ represents the parameter contained in the input parameters, and × is the opposite. The numbers outside (within) the brackets represent the total number of input parameters for AMSR-E (AMSR2).

Combination BTs Surface Parameters Atmospheric Related Parameters DOY Number of Parameters NDVI LCTP SM* and SC
AT-2m TPWV (14) Note: SM* is soil moisture values in four layers. LCTP contains the percent of the 17 IGBP classes.
Remote Sens. 2019, 11, x FOR PEER REVIEW 14 of 28 Table 5. Some of the input parameter combinations and the corresponding number of parameters. √ represents the parameter contained in the input parameters, and × is the opposite. The numbers outside (within) the brackets represent the total number of input parameters for AMSR-E (AMSR2).

Surface Parameters Atmospheric Related Parameters DOY Number of Parameters NDVI LCTP SM* and SC
AT-2m TPWV (14) Note: SM* is soil moisture values in four layers. LCTP contains the percent of the 17 IGBP classes. For the daytime condition, SM and SC have small contributions to the improvement of model accuracy in surface parameters. When the input parameters did not contain SM and SC (i.e., combination C1), the ΔRMSDs were below 0.15 K for both the validation sets and the test sets. If NDVI and LCTP were not removed at the same time, the increase in ΔRMSD was small. For example, on the basis of C1, only LCTP (NDVI) was removed for C2 (C3), and the increases in ΔRMSD were below 0.10 K; however, the increases in ΔRMSD were above 0.25 K when NDVI and LCTP were removed at the same time (C4). AT-2m had a higher impact than TPWV. When For the daytime condition, SM and SC have small contributions to the improvement of model accuracy in surface parameters. When the input parameters did not contain SM and SC (i.e., combination C1), the ∆RMSDs were below 0.15 K for both the validation sets and the test sets. If NDVI and LCTP were not removed at the same time, the increase in ∆RMSD was small. For example, on the basis of C1, only LCTP (NDVI) was removed for C2 (C3), and the increases in ∆RMSD were below 0.10 K; however, the increases in ∆RMSD were above 0.25 K when NDVI and LCTP were removed at the same time (C4). AT-2m had a higher impact than TPWV. When AT-2m was not used as an input parameter (i.e., combination C6), the ∆RMSD was higher than 0.20 K. In addition, DOY (i.e., combination C8) can help improve the accuracies of the daytime models. In these combinations, ∆RMSEs of C10 and C11 were both below 0.30 K. Considering that NDVI is a quantitative parameter and C10 has fewer input parameters, we finally selected C10 as the best combination of input parameters for the daytime models.
For the nighttime conditions, SM, SC, and TPWV have small contributions, and AT-2m has a large contribution. This finding is similar to that under daytime conditions. However, the nighttime NDVI, LCTP, and DOY values have fewer contributions than those in the daytime. For example, the ∆RMSDs of C13 (only BTs and AT-2m) were all below 0.30 K. In these combinations, only C14 had fewer parameters than C13. However, it can be observed that C14 had the RMSDs of 0.5-0.8 K. Therefore, C13 was finally selected as the best combination of input parameters for the nighttime models.
The statistical results of the best input parameter combinations for different NDVI-based land cover types and seasons are shown in Figures 7 and 8. For the three NDVI-based land cover types, the highest accuracies occurred in the densely vegetated pixels and lowest accuracies occurred in the barren land pixels. During the daytime, the RMSDs of the best combinations were 2.19 K, 2.66 K, and 3.58 K for densely vegetated pixels, sparsely vegetated pixels, and barren land pixels for AMSR-E, respectively. The corresponding RMSDs were 1.43 K, 1.73 K, and 2.14 K during the nighttime. For AMSR2, there were higher RMSDs on all land cover types compared to AMSR-E (~0.6 K and~0.3 K for the daytime and nighttime conditions, respectively). The lowest accuracies for the barren land may be concluded by the difficulty in determining the LSE of barren land. Figure 8 shows that the RMSDs of spring and summer are higher than those of autumn and winter for the daytime conditions. Spring and summer are the seasons of vegetation growth, increasing the heterogeneity within the pixels, and, thus, the upscaling method of spatial averaging introduces more uncertainty than other seasons. In contrast, the RMSDs of all seasons are below 2.20 K and 2.80 K for AMSR-E and AMSR2 at night, respectively.
For the nighttime conditions, SM, SC, and TPWV have small contributions, and AT-2m has a large contribution. This finding is similar to that under daytime conditions. However, the nighttime NDVI, LCTP, and DOY values have fewer contributions than those in the daytime. For example, the ΔRMSDs of C13 (only BTs and AT-2m) were all below 0.30 K. In these combinations, only C14 had fewer parameters than C13. However, it can be observed that C14 had the RMSDs of 0.5-0.8 K. Therefore, C13 was finally selected as the best combination of input parameters for the nighttime models.
The statistical results of the best input parameter combinations for different NDVI-based land cover types and seasons are shown in Figures 7 and 8. For the three NDVI-based land cover types, the highest accuracies occurred in the densely vegetated pixels and lowest accuracies occurred in the barren land pixels. During the daytime, the RMSDs of the best combinations were 2.19 K, 2.66 K, and 3.58 K for densely vegetated pixels, sparsely vegetated pixels, and barren land pixels for AMSR-E, respectively. The corresponding RMSDs were 1.43 K, 1.73 K, and 2.14 K during the nighttime. For AMSR2, there were higher RMSDs on all land cover types compared to AMSR-E (~0.6 K and ~0.3 K for the daytime and nighttime conditions, respectively). The lowest accuracies for the barren land may be concluded by the difficulty in determining the LSE of barren land. Figure 8 shows that the RMSDs of spring and summer are higher than those of autumn and winter for the daytime conditions. Spring and summer are the seasons of vegetation growth, increasing the heterogeneity within the pixels, and, thus, the upscaling method of spatial averaging introduces more uncertainty than other seasons. In contrast, the RMSDs of all seasons are below 2.20 K and 2.80 K for AMSR-E and AMSR2 at night, respectively.

Validation Based on In-Situ LST
Based on the best combinations of input parameters determined in Section 4.2, the CNN LST was validated against the in-situ LST. In addition, the MODIS LST derived from the MYD21A1 products was also compared. Since the obtained in-situ measurements were from different years, the validations of the retrieved LST from AMSR-E were based on CBS, TYU, and DXI; the validations of the retrieved LST from the AMSR2 observations were based on SDQ and HMO. The results are shown in Figures 9 and 10.
For CBS, the MODIS LST had a slight overestimation and underestimation in the daytime and nighttime values, respectively. This could be related to the built-up surfaces around the station. Although only ~2% of MW pixels were covered by built-up surfaces, it can be seen from Google Earth that the built-up surfaces were concentrated near the station. The CNN LST is close to the in-situ LST during the daytime and had an evident underestimation during the nighttime: the MBE values were 0.75 K and -3.06 K during the daytime and nighttime, respectively; the corresponding RMSE values were 2.10 K and 3.53 K. The underestimations of the nighttime values are induced by the higher elevation (the highest elevation in MW pixel is 1210 m) at the south of the MW pixel.
For TYU, the MODIS LSTs and the CNN LSTs had no obvious systemic biases in either the daytime or nighttime. The absolute values of the MBEs were all below 0.50 K. The RMSEs of CNN LST were below 3.5 K and 3 K for the daytime and nighttime, respectively. Although the AMSR-E pixel containing the TYU station was not a pure pixel, the main land cover types, i.e., croplands and grasslands, had very similar thermal properties and LSTs. Thus, it is understandable that CNN LST was close to the in-situ LST and only a slight RMSD difference existed between the CNN LST and the MODIS LST.
For DXI, an overestimation of the daytime values and an underestimation of the nighttime values were observed for both the MODIS LST and the CNN LST. This may be induced by the built-up surfaces. Table 1 shows that over 30% of the AMSR-E pixel was covered by built-up surfaces. The MBE values were 1.95 K and -2.84 K for the CNN LST during the daytime and nighttime, respectively. The corresponding RMSE values were 3 K and 3.43 K.
For SDQ, the CNN LST had an RMSE of 4.72 K (MBE is 3.64 K) for the daytime conditions. Over 30% of the AMSR2 pixels were covered by barren, resulting in an overestimation of above 3 K. For the nighttime values, the MODIS LST had good performance: the MBE and RMSE were -0.56 K and 1.32 K, respectively. In contrast, the corresponding values were 1.84 K and 4.08 K for the CNN LST. From Figure 10d, evident overestimation can be observed when LST is less than

Validation Based on In-Situ LST
Based on the best combinations of input parameters determined in Section 4.2, the CNN LST was validated against the in-situ LST. In addition, the MODIS LST derived from the MYD21A1 products was also compared. Since the obtained in-situ measurements were from different years, the validations of the retrieved LST from AMSR-E were based on CBS, TYU, and DXI; the validations of the retrieved LST from the AMSR2 observations were based on SDQ and HMO. The results are shown in Figures 9 and 10.
For CBS, the MODIS LST had a slight overestimation and underestimation in the daytime and nighttime values, respectively. This could be related to the built-up surfaces around the station. Although only~2% of MW pixels were covered by built-up surfaces, it can be seen from Google Earth that the built-up surfaces were concentrated near the station. The CNN LST is close to the in-situ LST during the daytime and had an evident underestimation during the nighttime: the MBE values were 0.75 K and -3.06 K during the daytime and nighttime, respectively; the corresponding RMSE values were 2.10 K and 3.53 K. The underestimations of the nighttime values are induced by the higher elevation (the highest elevation in MW pixel is 1210 m) at the south of the MW pixel.
For TYU, the MODIS LSTs and the CNN LSTs had no obvious systemic biases in either the daytime or nighttime. The absolute values of the MBEs were all below 0.50 K. The RMSEs of CNN LST were below 3.5 K and 3 K for the daytime and nighttime, respectively. Although the AMSR-E pixel containing the TYU station was not a pure pixel, the main land cover types, i.e., croplands and grasslands, had very similar thermal properties and LSTs. Thus, it is understandable that CNN LST was close to the in-situ LST and only a slight RMSD difference existed between the CNN LST and the MODIS LST.
For DXI, an overestimation of the daytime values and an underestimation of the nighttime values were observed for both the MODIS LST and the CNN LST. This may be induced by the built-up surfaces. Table 1 shows that over 30% of the AMSR-E pixel was covered by built-up surfaces. The MBE values were 1.95 K and -2.84 K for the CNN LST during the daytime and nighttime, respectively. The corresponding RMSE values were 3 K and 3.43 K.
For SDQ, the CNN LST had an RMSE of 4.72 K (MBE is 3.64 K) for the daytime conditions. Over 30% of the AMSR2 pixels were covered by barren, resulting in an overestimation of above 3 K. For the nighttime values, the MODIS LST had good performance: the MBE and RMSE were -0.56 K and 1.32 K, respectively. In contrast, the corresponding values were 1.84 K and 4.08 K for the CNN LST. From Figure 10d, evident overestimation can be observed when LST is less than 270 K. Based on the discriminant function algorithm (DFA) proposed by Wang et al. [60], we found that this overestimation is due to the frozen soil. Frozen soil has higher emissivity than unfrozen soil. Thus, higher BTs can be observed when soil freezing occurs [61], resulting in an overestimation of the CNN LST. significant outliers compared with the CNN LSTs under clear-sky conditions in all cases, indicating that CNN models established with clear-sky samples have good ability when extended to cloudy conditions. Overall, the CNN LST agrees well with the in-situ LST, with R 2 values ranging from 0.94 to 0.98. Although the RMSE of CNN LST was above 4 K in a few cases, the RMSE was mainly due to the different land cover types and elevation changes within the MW pixels.  For HMO, both the MODIS LST and the CNN LST have high systematic overestimation (above 4 K) during the daytime and have better accuracy at nighttime (RMSE was 1.39 K for the MODIS LST and 2.63 K for the CNN LST). The significant overestimation of the CNN LST during the daytime may be induced by the overestimation of the MODIS LST, which was the basis for the training of the CNN models. In addition, an overestimation also occurred when the LST was less than 270 K at night, which is similar to that at SDQ.
The validation results of the CNN LST under clear-sky conditions and cloudy conditions are also shown in Figures 9 and 10. It can be seen that the RMSE differences of the clear-sky conditions and the cloudy conditions were lower than 0.5 K in most cases. Exceptions occurred during the daytime of HMO and the nighttime of SDQ. The reason was the uneven distribution of the clear-sky and the cloudy samples. However, the CNN LSTs under cloudy conditions do not show significant outliers compared with the CNN LSTs under clear-sky conditions in all cases, indicating that CNN models established with clear-sky samples have good ability when extended to cloudy conditions. Overall, the CNN LST agrees well with the in-situ LST, with R 2 values ranging from 0.94 to 0.98. Although the RMSE of CNN LST was above 4 K in a few cases, the RMSE was mainly due to the different land cover types and elevation changes within the MW pixels.

Intercomparison with GlobTemperature AMSR-E LST and LPDR Air Temperature
After the validations with the in-situ LST, the CNN LST was intercompared with the AMSR-E LST provided by the GlobTemperature (hereafter termed GT LST) [26,27]. Since GT LST was only available from 2008 to 2010, the CNN LST, GT LST, and MODIS LST were intercompared based on the data of 2009 for AMSR-E ( Figure 11). The pixel percentages of the RMSD difference between CNN LST and GT LST in different ranges over China are shown in Table 6. The statistical results show that ~50% of the pixels of the CNN LST have smaller RMSDs and only ~20% of the pixels of the CNN LST have larger RMSDs than GT LST. It can be observed that the pixels with significant underestimation for CNN LST are concentrated on the boundaries of the Tibetan Plateau. This phenomenon could be partly due to snow. Based on MYD10C1 product, we found that these pixels are frequently covered with snow, and snow pixels tend to have larger bias [27]. By analyzing the BTs of these pixels, we found an evident underestimation for the BTs, especially at high frequency (i.e., 36.5 and 89.0 GHz), resulting in the underestimation of the CNN LST for snow pixels. This corresponds to the strong volume scattering of snow [27,62].
In addition, the overpass times of AMSR-E and AMSR2 were both ~13:30 and 01:30 local solar time, which are close to the time when daily maximum and minimum AT appear [12]. Fily et al. [16] showed LST was close to AT at surface level over dense vegetation. Thus, a comparison between the CNN LST and the LPDR ATs was also performed. Figure 12 shows maps of RMSDs and R 2 between the CNN LST and the LPDR ATs. It can be observed that the CNN LST agrees well with the LPDR ATs for most pixels (over 80% of pixel R 2 values are above 0.8) and there are small RMSDs for pixels with high vegetation coverage (e.g., the average RMSDs of South China and Northeast China are less than 3 K). This phenomenon further confirms that the CNN LST and

Intercomparison with GlobTemperature AMSR-E LST and LPDR Air Temperature
After the validations with the in-situ LST, the CNN LST was intercompared with the AMSR-E LST provided by the GlobTemperature (hereafter termed GT LST) [26,27]. Since GT LST was only available from 2008 to 2010, the CNN LST, GT LST, and MODIS LST were intercompared based on the data of 2009 for AMSR-E ( Figure 11). The pixel percentages of the RMSD difference between CNN LST and GT LST in different ranges over China are shown in Table 6. The statistical results show that~50% of the pixels of the CNN LST have smaller RMSDs and only~20% of the pixels of the CNN LST have larger RMSDs than GT LST. It can be observed that the pixels with significant underestimation for CNN LST are concentrated on the boundaries of the Tibetan Plateau. This phenomenon could be partly due to snow. Based on MYD10C1 product, we found that these pixels are frequently covered with snow, and snow pixels tend to have larger bias [27]. By analyzing the BTs of these pixels, we found an evident underestimation for the BTs, especially at high frequency (i.e., 36.5 and 89.0 GHz), resulting in the underestimation of the CNN LST for snow pixels. This corresponds to the strong volume scattering of snow [27,62].
In addition, the overpass times of AMSR-E and AMSR2 were both~13:30 and 01:30 local solar time, which are close to the time when daily maximum and minimum AT appear [12]. Fily et al. [16] showed LST was close to AT at surface level over dense vegetation. Thus, a comparison between the CNN LST and the LPDR ATs was also performed. Figure 12 shows maps of RMSDs and R 2 between the CNN LST and the LPDR ATs. It can be observed that the CNN LST agrees well with the LPDR ATs for most pixels (over 80% of pixel R 2 values are above 0.8) and there are small RMSDs for pixels with high vegetation coverage (e.g., the average RMSDs of South China and Northeast China are less than 3 K). This phenomenon further confirms that the CNN LST and AT have similar annual variations and are close over dense vegetation. Therefore, it can be concluded that CNN LST estimates have high accuracy.

Discussion
Considering that the comparison of the three networks should be performed when they are optimal respectively, we examined the structural parameters of NN, DBN, and CNN. The  Table 6. The pixel percentages of the RMSD difference between the CNN LST and the GT LST in different ranges: (i) ∆RMSD < -0.5 K represents CNN LST has better accuracy; (ii) −0.5 K ≤ ∆RMSD ≤ 0.5 K represents CNN LST has similar accuracy to GT LST; (iii) ∆RMSD > 0.5 represents GT LST has better accuracy.

Discussion
Considering that the comparison of the three networks should be performed when they are optimal respectively, we examined the structural parameters of NN, DBN, and CNN. The

Discussion
Considering that the comparison of the three networks should be performed when they are optimal respectively, we examined the structural parameters of NN, DBN, and CNN. The structural parameters were examined with the samples from Group I. The kernel size of the CNN was first examined, and the examination results based on the validation set are shown in Figure 13. As the size of the convolution kernel increased, the RMSD first decreased significantly; when the kernel size was greater than 1 × 7, the RMSD remained relatively stable. The MBDs did not exhibit a clear pattern but were lower than 0.15 K. Therefore, 1 × 7 was designated as the eventual kernel size of the CNN model. structural parameters were examined with the samples from Group I. The kernel size of the CNN was first examined, and the examination results based on the validation set are shown in Figure  13. As the size of the convolution kernel increased, the RMSD first decreased significantly; when the kernel size was greater than 1 × 7, the RMSD remained relatively stable. The MBDs did not exhibit a clear pattern but were lower than 0.15 K. Therefore, 1 × 7 was designated as the eventual kernel size of the CNN model. The examination results for layers are shown in Figure 14. The DBN models with two or three RBMs had similar accuracies to one RBM with more than 112 hidden units. For NN and CNN, the accuracies of the two-layer models had an increase of more than 0.10 K compared to the one-layer models. However, the accuracies of three-layer models were almost the same as those of two-layer models and were sometimes even lower. This situation shows that two hidden layers for NN, one RBM for DBN, and one convolutional layer and one fully connected layer for CNN are sufficient to describe the nonlinear relationship between the LST and the input parameters. For NN, the numbers of hidden neurons were designated as 64 and 48 for the first hidden layer and the second hidden layer, respectively. For DBN, the number of hidden units was designated as 112 for the RBM. For CNN, the number of convolution kernels was designated as 112, and the number of neurons in the fully connected layer was designated as 64.
In general, the accuracies for AMSR2 were lower than those for AMSR-E. The possible reasons are: (i) AMSR2 observations had higher spatial resolution than AMSR-E, resulting in greater uncertainty for AMSR2 samples during the process of spatial matching with MERRA-2 product and GLDAS product, and (ii) there still existed deviation between the synchronized MYD11C1 LST and "true" LST at the overpass time of AMSR2. The nighttime models had better accuracies than the daytime models, and the nighttime RMSDs were more than 1 K lower than those of the daytime RMSDs. The reason is that the daytime data are noisier than the nighttime data.
The best combinations of input parameters are BTs, NDVI, AT-2m, and DOY for the daytime conditions and BTs and AT-2m for the nighttime conditions. The small contributions of SM, SC, and TPWV can be explained by the fact that these parameters can be expressed by the MW BTs [51,62]. NDVI can implicitly characterize the proportion of different land cover types within a pixel, and it is also the reason why NDVI and LCTP have similar contributions. Thus, taking NDVI as an input can improve the accuracies of the daytime models. In contrast, the different The examination results for layers are shown in Figure 14. The DBN models with two or three RBMs had similar accuracies to one RBM with more than 112 hidden units. For NN and CNN, the accuracies of the two-layer models had an increase of more than 0.10 K compared to the one-layer models. However, the accuracies of three-layer models were almost the same as those of two-layer models and were sometimes even lower. This situation shows that two hidden layers for NN, one RBM for DBN, and one convolutional layer and one fully connected layer for CNN are sufficient to describe the nonlinear relationship between the LST and the input parameters. For NN, the numbers of hidden neurons were designated as 64 and 48 for the first hidden layer and the second hidden layer, respectively. For DBN, the number of hidden units was designated as 112 for the RBM. For CNN, the number of convolution kernels was designated as 112, and the number of neurons in the fully connected layer was designated as 64.
In general, the accuracies for AMSR2 were lower than those for AMSR-E. The possible reasons are: (i) AMSR2 observations had higher spatial resolution than AMSR-E, resulting in greater uncertainty for AMSR2 samples during the process of spatial matching with MERRA-2 product and GLDAS product, and (ii) there still existed deviation between the synchronized MYD11C1 LST and "true" LST at the overpass time of AMSR2. The nighttime models had better accuracies than the daytime models, and the nighttime RMSDs were more than 1 K lower than those of the daytime RMSDs. The reason is that the daytime data are noisier than the nighttime data. possible reason is that the LST of a whole year can be divided into three temporal components (i.e., the annual temperature component-ATC, the diurnal temperature component-DTC, and the weather-change temperature component-WTC) [3,56], and ATC is a function of DOY. In contrast, the small contributions of DOY to the nighttime models can be concluded to the smaller residuals of the models themselves.  The "conv layer" and "fc layer" mean the convolutional layer and fully connected layer, respectively.
The best combinations of input parameters are BTs, NDVI, AT-2m, and DOY for the daytime conditions and BTs and AT-2m for the nighttime conditions. The small contributions of SM, SC, and TPWV can be explained by the fact that these parameters can be expressed by the MW BTs [51,62]. NDVI can implicitly characterize the proportion of different land cover types within a pixel, and it is also the reason why NDVI and LCTP have similar contributions. Thus, taking NDVI as an input can improve the accuracies of the daytime models. In contrast, the different land cover types within the pixels have similar thermal properties and LSTs at nighttime. Hence, it is understandable that NDVI has a small contribution to the nighttime models. As for DOY, it is interesting that we found DOY can improve the accuracies of the daytime models. However, DOY is irrelevant to Equation (5). The relationship between the MODIS LST (T MODIS ) and the CNN LST (f i (x i )) can be written as: where x i is the input parameters of combination C i ; δ i is the residual corresponding to C i and denotes the tiny fraction that cannot be interpreted by input parameters; and f i is the conversion relationship established by CNN based on C i . The purpose of inputting multiple parameters is to decompose the residuals as much as possible. Generally, the component temperature differences between barren land and vegetation within the pixels are larger in summer compared to those in winter. Therefore, it is understandable that the component temperature differences can be expressed as a function with DOY as the independent variable. In other words, DOY helps to characterize the component temperature differences within the pixels. This information has potential value for the LST retrieval, and, thus, can help decompose the residuals. Another possible reason is that the LST of a whole year can be divided into three temporal components (i.e., the annual temperature component-ATC, the diurnal temperature component-DTC, and the weather-change temperature component-WTC) [3,56], and ATC is a function of DOY. In contrast, the small contributions of DOY to the nighttime models can be concluded to the smaller residuals of the models themselves.
It is noticeable that this study was not the first time that CNN was applied to estimate the MW LST. As described in Section 1, Tan et al. [34] constructed the CNN model. However, only BTs were used as inputs (i.e., C14 in this study). Figure 6 shows that the accuracy of C14 is lower by 0.8 K during the daytime and 0.3 K during the nighttime than the best combination. Thus, it can be concluded that adding some auxiliary data (e.g., AT-2m) on the basis of BTs can significantly improve the accuracies of the retrieval models.
In this study, the ground station FOVs were greatly different from the spatial resolutions of the MW pixels. To quantify the representativeness errors introduced by the scale mismatching between the ground stations and the MW pixels, we calculated the errors based on the biases and STDs from Figure 1 and the validation results of the MODIS LST. Similar to Huang et al. [63], by assuming that MODIS LST is the "true LST" on the 1-km scales, the representativeness error for each station is given by: where MBE GtoMW , MBE Gto1KM , and MBE 1KMtoMW are systematic errors introduced by the scale mismatching between the ground station and the MW pixel, between the ground station and the 1-km MODIS pixel, and between the 1-km MODIS pixel and the MW pixel, respectively; STD GtoMW , STD Gto1KM , and STD 1KMtoMW are the corresponding STD values. The statistical results of the representativeness errors of the five stations and the corresponding validation results of the CNN LST are shown in Table 7. It can be seen that in most cases, the MBE CNN and STD CNN were close to the MBE GtoMW and STD GtoMW (MBE differences and STD differences were both lower than 1 K), respectively. Exceptions occurred in the nighttime of SDQ and HMO, and this phenomenon was caused by frozen soil (see Section 4.3). Therefore, Table 7 further confirms that the main reason for the errors of the CNN LST against in-situ LST is the scale mismatching between the ground stations and the MW pixels. For CNN, we only tried to convolve on the input vector of a single pixel. Another method to build the CNN model is to use the parameter images (e.g., BT images and NDVI images) of the entire study area as inputs. In this situation, the spatial relationships between the adjacent pixels may be helpful in improving model accuracy. Nevertheless, one should keep in mind that it is difficult to apply this method: MW images have coarse spatial resolutions, resulting in weak spatial connections between the adjacent pixels (which means a small contribution to the LSTs of the surrounding pixels). In addition, accurate and cloud-free LST images for large study areas are usually not available, which limits the extraction of LST for the training samples. Further studies to incorporate the spatial relationships in the estimation of LST from MW observations are needed.
A well-recognized feature of neural networks is the 'black box' problem, which means that it is difficult to understand the physical mechanisms for obtaining the output based on the inputs. The opacity of neural network makes it hard to further improve the accuracy of the neural network.
For the estimation of LST from MW observations, three approaches may contribute to the improvement of the accuracy. The first approach is to increase the sample size. Larger sample size would be helpful to reduce overfitting and, thus, improve the accuracy of the output. In this study, the upscaling method for the MODIS LST was spatial averaging, which may introduce uncertainty to the LST of the training samples. Therefore, the second approach is to improve the accuracy of the training samples. The third approach is to incorporate the physical models into the neural networks by using the physical models as the boundaries of the networks, to provide initial guesses, and/or to construct new input features for the networks. For example, the thermal sampling depth correction model developed by Zhou et al. [57] may be incorporated to help neural networks better estimate the LST over barren land.
Finally, there still exist some issues in this study. The first one is the training of the neural network was performed with the MODIS LST product. Although the MODIS LST has high accuracy, the uncertainty of MODIS LST will inevitably increase when reprojected to match with the MW pixels. Therefore, an accurate upscaling method related to land cover types and elevation distribution within the MW pixels is critical for reducing the uncertainty in the spatial matching process. The second one is the scale mismatch between the ground stations and the MW pixels. This issue is critical for the validation of low-resolution LST. Thus, it is essential to find an evaluation method of spatial representativeness of the ground station to realize the conversion of in-situ LST from station FOV to coarser resolution.

Conclusions
This study examined the performances of NN (i.e., the traditional neural network) and two deep learning methods (i.e., DBN and CNN) in the estimation of LST from satellite MW observations. The input parameter set for these networks included BTs, soil moistures, NDVI, snow cover, land cover type percent, air temperature at 2 m above the ground surface, total precipitable water vapor, and DOY. Based on the training, validation, and test sets derived from the MODIS LST, microwave BTs, and the aforementioned input parameters, the results demonstrate that CNN outperformed NN and DBN. The LSTs estimated by CNN from the AMSR-E and AMSR2 data were closer to MODIS LST: the RMSD values were 3 K during the daytime and 1.74 K at night for AMSR-E; the corresponding RMSD values were 3.48 K and 2.10 K for AMSR2. In contrast, the RMSDs of NN and DBN were higher by 0.1 K and 0.4 K, respectively. Additionally, CNN was more prominent than NN and DBN for different land cover types and seasons. Therefore, it should be concluded that CNN performs better than the NN and DBN in the estimation of LST from satellite MW observations.
More details for the impacts of the input parameters on the performances of CNN were obtained. Among the surface parameters used to implicitly parameterize the LSE, the NDVI and land cover type percent have greater impacts than the other parameters during the daytime, and NDVI and land cover type percent cannot be removed at the same time; nevertheless, their impacts decrease at nighttime. For the two parameters used to implicitly quantify the atmospheric effects, the air temperature plays a more important role than the total precipitable water vapor. In addition, DOY is helpful in improving the accuracy of the CNN model in the daytime. Therefore, the best combinations of input parameters are the BTs, NDVI, air temperature, and DOY for the daytime and BTs and air temperature for the nighttime. The CNN LST estimate from AMSR-E with the best combinations yields RMSDs of 2.19-3.58 K for daytime and 1.43-2.14 K for nighttime for diverse land cover types.
Thorough validation of the LST estimates of CNN was conducted based on the in-situ LST. The RMSEs range from 2.10 K to 5.34 K during the daytime and 2.63 K to 4.08 K during the nighttime. The CNN LST agrees well with the in-situ LST, with R 2 values ranging from 0.94 to 0.98. Although the accuracy of the CNN LST is not as satisfactory as the TIR LST in a few cases, the differences between the CNN LSTs and the in-situ LSTs are mainly due to the different land cover types and elevation distributions within the MW pixels containing the stations. Further intercomparison indicates that thẽ 50% CNN LST estimates are closer to MODIS LSTs than ESA's GlobTemperature AMSR-E LST and the average RMSDs are less than 3 K over dense vegetation compared to NASA's LPDR ATs. Findings