Towards a Unified and Coherent Land Surface Temperature Earth System Data Record from Geostationary Satellites

Our objective is to develop a framework for deriving long term, consistent Land Surface Temperatures (LSTs) from Geostationary (GEO) satellites that is able to account for satellite sensor updates. Specifically, we use the Radiative Transfer for TOVS (RTTOV) model driven with Modern-Era Retrospective Analysis for Research and Applications (MERRA-2) information and Combined ASTER and MODIS Emissivity over Land (CAMEL) products. We discuss the results from our comparison of the Geostationary Operational Environmental Satellite East (GOES-E) with the MODIS Land Surface Temperature and Emissivity (MOD11) products, as well as several independent sources of ground observations, for daytime and nighttime independently. Based on a six-year record at instantaneous time scale (2004–2009), most LST estimates are within one std from the mean observed value and the bias is under 1% of the mean. It was also shown that at several ground sites, the diurnal cycle of LST, as averaged over six years, is consistent with a similar record generated from satellite observations. Since the evaluation of the GOES-E LST estimates occurred at every hour, day and night, the data are well suited to address outstanding issues related to the temporal variability of LST, specifically, the diurnal cycle and the amplitude of the diurnal cycle, which are not well represented in LST retrievals form Low Earth Orbit (LEO) satellites.


Introduction
Land surface temperature is an important climate parameter due to its control of the components of the surface energy budget, such as turbulent heat and moisture fluxes, and upward terrestrial radiation [1]. For climate applications, information is needed on large scales, and ideally, the diurnal cycle needs to be resolved. In this study, we develop an approach to derive information on LST which is applicable to the GOES satellites across multiple missions and multiple satellite sensors. We report on results obtained during the period (2004)(2005)(2006)(2007)(2008)(2009) at hourly time intervals, at about 5-km spatial resolution.
Since surface ground observations are limited, shelter temperature has been widely used as a proxy to surface skin temperature to meet large-scale needs. Issues emerging from such an approach have been addressed previously [2]. While observations from satellites are considered useful for inferring LST, only a few satellite sensors observe all the necessary parameters needed to derive LST with high accuracy. Some lack sufficient number of channels to account simultaneously for atmospheric effects (as needed for implementing the "split window" approach) [3][4][5][6][7]. Others do not observe the Earth surface at high frequency to resolve the diurnal cycle, or at high spatial resolution, to minimize the presence of clouds.
Information on surface emissivity is also not readily available at sufficient spectral resolution [8,9]. Moreover, land surface emissivity is generally less than one, and therefore, part of the atmospheric downward radiation is reflected by the surface and has to be accounted for [10] when converting ground observations of radiative flux measurements to LST (which is not always done). The number of successful attempts to derive LST from satellites has been substantial (especially using the well-established split window approach). A full review of what was done is beyond the scope of this paper, but a comprehensive summary can be found in Li et al. [11], and is briefly presented below.
The early effort to retrieve LST from satellites over agricultural land made by Price [3] was done by adopting the Advanced Very High Resolution Radiometer (AVHRR) Sea Surface Temperature (SST) split window algorithm [12,13]. Becker and Li [14] extended the split window method of McMillin [15] for SST to LST and took into account the spectral variability in land surface emissivity. This so called "generalized split window" LST algorithm has been widely used. Additional efforts include the work of Prata, [16], Sobrino et al. [7], Wan and Dozier [5], Francois and Ottle [17], Coll and Caselles [18], Trigo et al. [19], and Wan [20]. The approach for accounting for emissivity has evolved from assignments based on land use to the use of the Normalized Difference Vegetation Index (NDVI) [21][22][23]; however, the NDVI concept is not applicable for every surface type. Currently, surface emissivity is derived from the Advanced Space Thermal Emission and Reflection Radiometer (ASTER), the Thermal Infrared (TIR) Multispectral Scanner (TIMS), and the Moderate-resolution Imaging Spectroradiometer (MODIS) [24][25][26][27][28][29] culminating in the Combined ASTER and MODIS Emissivity over Land (CAMEL) product to be used here [30,31].
Most of the above referenced studies focus on polar orbiting satellites such as the National Oceanic and Atmospheric Administration (NOAA)-AVHRR, the Along-Track Scanning Radiometer (ATSR) and the MODIS instrument aboard Terra and Aqua satellites; the temporal measurement frequency of these satellites is approximately twice per day. The Land Surface Diurnal Temperature Range (DTR) is an important element of the climate system and is not captured by the polar orbiting satellites. Geostationary satellites provide diurnal coverage and observe the surface continuously at a nadir pixel resolution of about 4 km [32] which led to the development of several algorithms for GEO satellites [33][34][35].
While the principles of retrieval methodologies have not changed drastically over time, the development in auxiliary information, quality of such information, and availability of long term records of satellite observations make it feasible to formulate a homogeneous approach across various satellite sensors that can culminate in climatic records of LST. The primary objective of this study is to present a methodology that can be implemented with different GOES observing systems, using consistent auxiliary information of highest possible quality and utilizing radiative transfer models that account for the vertical profiles of atmospheric states for each retrieval. From mid-2004 to 2017, only one thermal channel is available on the GOES series; the focus of this study is on retrievals using such a single channel in order that a consistent, long term record can be generated from all the GOES satellites (including those that allow the implementation of the split window approach). In Section 2, materials used are described; retrieval algorithm development is presented in Section 3; evaluation of GOES-E based LST estimates is presented in Section 4; a discussion is provided in Section 5; and a summary is presented in Section 6.

GOES Satellite Data
The GOES system is operated by the National Oceanic and Atmospheric Administration, National Environmental Satellite, Data and Information Service (NESDIS). The GOES system is based on the use of satellites designed to operate at an orbit of 35,790 km above the earth, remaining stationary to a given point on the ground. The GOES provides data at high temporal frequency (15 min) with continental-scale coverage (N. and S. America). In this study observations from GOES-12 (4/1/2003-4/14/2010) will be utilized (Table 1). Typically, the GOES imager includes five spectral channels (one visible, four infrared). For GOES 8-10 the channels are located at 3.9, 6.75, 10.7, and 12.2 µm whereas for GOES 11-15, the 6.75 µm channel was moved to 6.5 µm and the 12 µm channel was moved to 13.3 µm. The visible, mid-infrared and 11 µm band are typically used for cloud screening while the two thermal infra-red (TIR) bands (10.2-11.2 µm and 11.5-12.5 µm) are used in what is known as a "split-window" approach to retrieve LST. Note: GOES-8 information is provided since cloud algorithm was originally developed for GOES-8. Channel 2 is separated into the reflected solar radiation component (R2) and the emitted infrared radiation component (T2) [52]; Channel 3 is not used in any of the cloud screening tests.
Aspects of GOES satellite systems that need to be addressed before deriving LST include the spectral characteristics of the GOES sensors and their filter functions, calibration of visible and IR channels, cloud screening methodology (that requires snow analysis information) and the selection of cloud screening tests as appropriate for each satellite configuration, with special distinction between night-time and day-time conditions. Over the period of this study, two separate operational GOES Imagers, one located at a longitude of −75 • (referred to as GOES-East) and one located at a longitude of −135 • (referred to as GOES-West) continuously provided imagery over North and South America. In this study, observation from GOES-East only will be utilized for years [2004][2005][2006][2007][2008][2009]. The temporal sampling of the GOES Imager is every 30-min over North America and every 3-h over the full disk. Spectral distribution of the GOES 8-15 series are provided in NOAA NESDIS STAR GOES Imager LST ATBD (Version 3.0).

Visible and Thermal Channel Calibration
Visible channels are used in the cloud screening part of data processing. Their calibration is done in two stages. First, applied are the pre-launch calibration coefficients to get the first step "nominal reflectance" (Apre), and then, the post-lunch calibration coefficients are applied to obtain the final calibrated nominal reflectance (Apost).
The nominal reflectance is defined as the ratio of reflected radiance to nominal solar irradiance given as: where pi = 3.141593, R is satellite observed upwelling radiance, and F 0 is the solar irradiance at local zenith and mean Sun-Earth distance. The pre-launch nominal reflectance is: where k is the pre-launch calibration constant to convert satellite observed digital counts to nominal reflectance. X is the instrument raw digital count, Xspace is the raw count of the space scene (has been adjusted to 29 for all GOES imagers at NOAA). The Post-lunch calibration is applied by multiplying the Apre by a coefficient: The pre-launch coefficient for GOES-12 for radiance calibration was 0.5771 and for nominal reflectance it was 0.001141.
NOAA/STAR monitors the GOES imager and updates the post-launch coefficients every month. Coefficients are considered optimal for the days on or after the second Tuesday of the month following the coefficient generation month. For more details, see: http://www.star.nesdis.noaa.gov/smcd/spb/ fwu/homepage/GOES_Imager_Vis_OpCal.php.
Inter-calibration of the infrared channels on the GOES series of satellites has been performed using under-passes of the well calibrated NASA Atmospheric Infrared Sounder (AIRS) sensor on the Aqua platform [36]. Infrared imager data from GOES are stored in GOES Variable Format (GVAR) counts and radiances can be derived from GVAR counts by applying the calibration and scaling coefficients using a procedure described in [37].
Calibration of IR channels of GOES imager data is also done in two stages: (1) converting the imager GVAR raw count to scene radiance; (2) converting the radiance to temperature. To convert a 10-bit GVAR count to scene radiance we use: where X is the GVAR raw count (10-bit, range from 0 to 1023), m and b are calibration coefficients. The values for m and b depend on channel selected, but are constant for a given channel. The obtained radiance can be convert to effective temperature (K) by inverse of the Planck function: T eff = C 2 *n/ln (1 + C 1 *n 3 /R) ], C 2 = 1.438833 (K/cm −1 ), "n" is the central wave-number of the channel and varies from instrument to instrument. The effective temperature T eff is further converted to actual temperature by: where "a", "b" and "g" are coefficients and their values and central wave-numbers can be found at: https://www.ospo.noaa.gov/Operations/GOES/calibration/gvar-conversion.html#radiance. Evaluation of calibration was done by comparison of radiances in each channel with NOAA values. CAMEL ESDR is produced globally at 5-km resolution at mean monthly time-steps and for 13 bands from 3.6-14.3 micron and extended to 417 bands using a Principal Component (PC) regression approach. This product has been used in our retrievals of LST.

MERRA-2 Data
The 6 hourly MERRA-2 re-analysis data (MERRA-2 inst6_3d_ana_Np version 5.12.4) (https: //disc.gsfc.nasa.gov/datasets/M2I6NPANA_V5.12.4/summary) are used to specify the atmospheric conditions [38]. The re-analysis data are available at [00, 06, 12, 18] hours with a resolution of 0.5 • × 0.625 • , for 42 pressure levels. Properties include sea-level pressure, surface pressure, geopotential height, air temperature, wind components, and specific humidity. Since the LST is retrieved at hourly time scale, the MERRA-2 data are linearly interpolated in time to give the atmospheric state in between the four analysis times (as needed for input to RTTOV) and linearly interpolated to 0.05 • × 0.05 • in space to match the formulation of this study.

MOD11
The GOES based LST estimates will be evaluated against LST retrievals from MOD11 Version 6 Land Surface Temperature and Emissivity product [20] (we use both MOD11_L2 data and MOD11C3 data). The MODIS LST data products are produced as a series of nine products. The sequence begins as a swath at a nominal pixel spatial resolution of 1 km at nadir and a nominal swath coverage of 2030 or 2040 lines along track by 1354 pixels per line in the daily LST product [39]. There are two algorithms used in the daily MODIS LST processing: the generalized split-window LST algorithm [5] and the day/night LST algorithm. New refinements made to these two algorithms are described by Wan [20]. The MOD11_L2 version 6 swath product provides per-pixel land surface temperature (LST) and emissivity. It is produced daily in 5-min temporal increments of satellite acquisition and has a pixel size of 1 km. The MOD11C3 Version 6 product provides monthly land surface temperature (LST) and emissivity values in a 0.05 (5600 m × 5600 m) degree latitude/longitude climate modeling grid (CMG), which has a Geographic grid with 7200 columns and 3600 rows representing the entire globe. The MOD11C3 granule consists of day and night LST and their corresponding quality indicator (QC) layers.

Ground Observations
• SURFRAD/BSRN NOAA established the Surface Radiation Budget Network (SURFRAD) in 1993 [40] to support climate research by providing accurate, continuous, long-term measurements of the surface radiation budget over the United States. These became the continental U.S. contingent of the International Baseline Surface Radiation Network (BSRN) [41] as described in [42]. . BSRN sites that provide data at 1 or 3-min frequency, which makes them suitable for generating information to match the satellite observations. The upwelling and downwelling longwave radiative fluxes are measured with a precision infrared radiometer, which is sensitive in the spectral range from 3000 to 50,000 nm. The instrument used to observe the skin temperature is the Infrared Thermometer (IRT). It is a ground-based radiation pyrometer that measures the equivalent blackbody brightness temperature of the scene in its field of view. The downwelling version has a narrow field of view for measuring sky temperature and detecting clouds.
The upwelling version has a wide field of view for measuring the narrowband radiating temperature of the ground surface (https://www.arm.gov/capabilities/instruments/irt). Time series and scatter plots are produced and inspected to compare surface temperature measured by the IRT and a precision infrared radiometer (PIR). The temperature measuring range is from 173 to 473 K. The accuracy is the greater value of a) ±0.5 K + 0.7% of the temperature difference between the internal reference temperature and the object measured or b) the temperature resolution. The spectral sensitivity is from 9.6 to 11.5 µm [43]. •

Oklahoma MESONET
The Oklahoma MESONET is an automated network of over 110 remote, meteorological stations across Oklahoma (http://www.mesonet.org). The surface types are predominantly grassland/wooded, grassland/cropland. In 1999, infrared temperature (IRT) sensors (Apogee Instruments, Inc.) were installed at 89 of the MESONET sites. A combination of automated and manual tests was applied using simultaneous soil and atmospheric measurements to inter-compare observations and ensure that the skin temperature observations are of research quality [44]. The measurements collected by the MESONET provided a unique opportunity to inter-compare observations. Fiebrich et al. [45] provide an evaluation of 5-min-resolution field measurements collected using the sensors. This sensor was chosen for use because it is water resistant and was designed for continuous outdoor use. Sensor The sensor is installed at a height of 1.5 m and has a field of view of a diameter circle of 0.5 m. The energy detected by the sensor is converted to a temperature using the Stefan-Boltzmann law and an assumed surface emissivity of 1.0. Slight underestimation is caused because the true emissivity of the land surface is less than 1.0. In addition, slight overestimation is caused by reflected longwave radiation from the target [46]. While surface reflection of downwelling longwave radiation is ignored, Sun et al. [32] discussed the effects of these two factors and found that the total effect may be a slight underestimation of the skin temperature. Generally, estimated impact of uncertainty in relevant parameters on in situ LST are as follows: radiometric calibration uncertainty of ±0.2 to 0.5 K can impact LST as much as 0.2 K; emissivity uncertainty of ±1% can impact LST by as much as 0.3 K; downwelling atmospheric radiance uncertainty of ±10% can impact LST by as much as 0.1 K [47].

Cloud Detection
For cloud masking, various ancillary data are needed such as surface type, snow-free channel-1 radiance, and a pixel position information, all in the same dimension and location as the satellite images. The land cover data used in this study for cloud screening implementation, are generated at 1-km resolution [48]. This product includes 14 International Geosphere-Biosphere Programme (IGBP) classes and the underlying surface types are aggregated according to the IGBP classification.
A Coupled Cloud and Snow Detection Algorithm (CCSDA) that was developed initially for use with GOES-8 satellite is adjusted as appropriate for each GOES satellite is used. The algorithm is described in [49,50]. Variants of the approach were tested and evaluated in several publications [51]. In the case of the GOES-8 imager four channels were used to detect clouds, snow, and to perform background analysis for each hour of the diurnal cycle. Beginning with GOES-12, Channel 5 is no longer available (Table 1) [52].
The CCSDA algorithm is capable of producing its own snow analysis using an algorithm that applies three tests using three GOES channels. Alternatively, there is a switch to allow the use of a snow analysis from a different source. The advantage of using the snow analysis generated by the CCSDA algorithm is that it is updated hourly, which provides a more accurate analysis of the expected Remote Sens. 2019, 11, 1399 7 of 23 background when applying the cloud tests. If a daily snow analysis is used, the snow conditions cannot change for each hour of the cloud analysis, and this may introduce error.
In Table 2 we present a description of the cloud screening tests used for GOES-8, along with an explanation of how the tests are assembled to determine a clear probability. The ultimate clear probability (P) can be assembled in various ways on the basis of individual test results. In this method: P i is the clear probability from each individual cloud screening test and n is the total number of cloud screening tests. This assembly method guarantees that the target pixel is cloudy (P = 0) if any individual test identifies it as cloudy (P i = 0). Otherwise, it compiles the confidence levels from all of the individual tests to obtain an overall clear probability.
Referring to Table 2, there was only one cloud screening test that required Channel 5, namely, the FMFT test. In the cloud screening algorithm for GOES-12 and beyond, the FMFT test will not be used, and the test assembly method described for GOES-8 is implemented with one less cloud test.
For each cloud test, threshold levels are used to differentiate between clear and cloudy pixels. For each new satellite, it is necessary to test the thresholds and modify as needed. Here, the cloud mask method applies two spatial tests and one threshold test on an 11-3.7 µm difference image. This fourth test compares the temperature from the 11 µm channel to a 20-day clear-sky composite of 11 µm temperatures, and labels the pixel as cloudy if the difference is greater than the threshold. The pixel level data were gridded to 0.05 • and compared to the Pathfinder Atmospheres-Extended (PA TMOS-X) product [34]. Agreement above 95% for various times of the day was found. The only region which showed slight disagreement between the two independent cloud masks is over areas of complex terrain in the Western US, but even over these regions the two cloud masks agreed over 85% of the time.
The major difference between the day time and night time algorithms (Table 2) is that there are no Reflectance Gross Contrast Test (RGCT, when visible channel is missing) and Channel-2 Albedo Test (C2AT) during the night time. This will mostly affect the detection of reflective clouds. For GOES-12, the FMFT test could not be used. We have tested the cloud screening algorithm for use at nighttime. We applied the nighttime algorithm on daytime data and compared to results when the full daytime algorithm is used. Evaluation of LST estimates in each case is presented in Figure 1. As this figure shows, the daytime algorithm provides better agreement with ground observations than the nighttime one yet, the differences are small as illustrated in Figure 1.
Test (C2AT) during the night time. This will mostly affect the detection of reflective clouds. For GOES-12, the FMFT test could not be used. We have tested the cloud screening algorithm for use at nighttime. We applied the nighttime algorithm on daytime data and compared to results when the full daytime algorithm is used. Evaluation of LST estimates in each case is presented in Figure 1. As this figure shows, the daytime algorithm provides better agreement with ground observations than the nighttime one yet, the differences are small as illustrated in Figure 1.

LST Retrieval Algorithm Development for GOES Satellites
Satellite observed radiance R ↑ o , can be expressed as where is surface emissivity, B(T s ) is blackbody emission at surface temperature T s , X denotes the atmospheric transmittance, R ↑ a and R ↓ a are atmospheric emission to space and surface, respectively. With known surface emissivity and simulated atmospheric emission and transmittance, the surface temperature can be retrieved where B −1 denote the inverse of Planck function for GOES-12 channel 4.
Here, the approach is based on the Radiative Transfer for TOVS (RTTOV) model v11.2 [53][54][55] adjusted for the GEO characteristics and driven with MERRA-2 reanalysis fields. The CAMEL data are also implemented in the method. The advantage of this approach is that it is consistent with the retrieval approach used at JPL to generate the MOD21 product [56]. The processing sequence is described in Figure 2.

LST Retrieval Algorithm Development for GOES Satellites
Satellite observed radiance ↑ , can be expressed as where is surface emissivity, ( ) is blackbody emission at surface temperature , denotes the atmospheric transmittance, ↑ and ↓ are atmospheric emission to space and surface, respectively. With known surface emissivity and simulated atmospheric emission and transmittance, the surface temperature can be retrieved where denote the inverse of Planck function for GOES-12 channel 4. Here, the approach is based on the Radiative Transfer for TOVS (RTTOV) model v11.2 [53][54][55] adjusted for the GEO characteristics and driven with MERRA-2 reanalysis fields. The CAMEL data are also implemented in the method. The advantage of this approach is that it is consistent with the retrieval approach used at JPL to generate the MOD21 product [56]. The processing sequence is described in Figure 2.
Data processing sequence starts from raw digital counts from GOES satellites. Calibration is applied to all channels. Channel 4 (10.2 to 11.2 um) was used for LST retrieval. All channels except channel 3 (6.7 um) were used in the cloud detection algorithm. After cloud screening, GOES observations were resampled to a uniform grid of 0.05° resolution. The atmospheric radiation and transmittance were simulated with the RTTOV model using MERRA-2 fields as input. The MERRA-2 fields were first temporally interpolated to satellite observation time and then collocated to satellite locations. RTTOV calculated upwelling, downwelling radiances and atmospheric transmittance combined with CAMEL surface emissivity to retrieval the LST according to equation 11.

Evaluation of GOES-E Based LST Estimates
We will present results of evaluation for UMD LST retrievals against MOD11 products, the BSRN/SURFRAD network over the USA, the ARM/SGP C1 site over the Southern Great Plains, and the MESONET network over Oklahoma. The issue of evaluation of satellite products of LST against ground measurements is complex, primarily, due to scale issues and known large spatial variability of LST. A comprehensive discussion on all aspects of validation issues are described by Guillevic et Data processing sequence starts from raw digital counts from GOES satellites. Calibration is applied to all channels. Channel 4 (10.2 to 11.2 um) was used for LST retrieval. All channels Remote Sens. 2019, 11, 1399 9 of 23 except channel 3 (6.7 um) were used in the cloud detection algorithm. After cloud screening, GOES observations were resampled to a uniform grid of 0.05 • resolution. The atmospheric radiation and transmittance were simulated with the RTTOV model using MERRA-2 fields as input. The MERRA-2 fields were first temporally interpolated to satellite observation time and then collocated to satellite locations. RTTOV calculated upwelling, downwelling radiances and atmospheric transmittance combined with CAMEL surface emissivity to retrieve the LST according to Equation (11).

Evaluation of GOES-E Based LST Estimates
We will present results of evaluation for UMD LST retrievals against MOD11 products, the BSRN/SURFRAD network over the USA, the ARM/SGP C1 site over the Southern Great Plains, and the MESONET network over Oklahoma. The issue of evaluation of satellite products of LST against ground measurements is complex, primarily, due to scale issues and known large spatial variability of LST. A comprehensive discussion on all aspects of validation issues are described by Guillevic et al. [43] and Göttsche et al. [57].

Scale Issues Related to Satellite and Ground Observations
The ground observations are point observations while the satellite LST product is at pixel level gridded to 0.05 • . To assess the homogeneity of each site, we use the ASTER Global Emissivity Dataset at 1-km V003 (DOI: 10.5067/Community/ASTER_GED/AG1km.003) available for the period of 2000-2008. It is based on observations from the Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER) Global Emissivity Dataset (GED) land surface temperature and emissivity data products using the ASTER Temperature Emissivity Separation (TES) algorithm with a Water Vapor Scaling (WVS) atmospheric correction method with MODIS MOD07 atmospheric profiles and the MODTRAN 5.2 radiative transfer model. The spatial distribution of the emissivity values is illustrated in Figure 3a and their frequency distribution is shown in Figure 3b. As shown, except for the DRA site, the 0.05 • boxes show a high degree of homogeneity at the 1-km scale. As seen from Figure 3b, the emissivity values range between 0.965-0.980 with two distinct peaks of 0.965 and 0.975 with some lower values (0.948) at the DRA site. As also shown in the study of Hulley and Hook [58], who compared ASTER emissivity band 11 (8.6 µm) at 90 m spatial resolution to the same at 1 km, the agreement was very good. The spatial matching of ground and satellite observations is done by taking the weighted average of the pixels that fall in the cell box (0.05 • × 0.05 • ) around the target location of the station. The time matching is done by taking the averages of ±15 min around the start scanning time of GOES12 (this interval is selected based on the duration of the satellite scan).

Evaluation against MOD11
We have conducted an inter-comparison between MOD11_L2 and our LST retrievals. Before comparison, the MOD11_L2 data are rescaled to 0.05 × 0.05 degree latitude/longitude grids. Figure 4 shows an example of comparison between the GOES_RTTOV LST and MOD11_L2 LST on 11 June 2004 UTC 17:15.

Evaluation against MOD11
We have conducted an inter-comparison between MOD11_L2 and our LST retrievals. Before comparison, the MOD11_L2 data are rescaled to 0.05 × 0.05 degree latitude/longitude grids. Figure 4 shows an example of comparison between the GOES_RTTOV LST and MOD11_L2 LST on 11 June 2004 UTC 17:15.
We compared 25 match-up cases for which the two products have overlap and both start scan at 15 min after same hour in 2004. For most cases, the total number of points in the overlap area is more than 40,000. Figure 5 presents the correlation coefficients (corr), the mean bias (bias), the standard deviation (std) and the root mean square error (rmse) between the two products. As seen, in most cases, the two products yield close correlation. Only in one case the coefficient is less than 0.8. The averaged corr of all cases is 0.91. More than 50% cases have mean bias less than 2 K, and the averaged value is 1.7 K. The averaged std and rmse are 2.7 and 3.3 K respectively.

Evaluation against ARM SGP Site at Instantaneous Time Scale
The IRT data are available from two levels of a tower; one instrument was located at 25 m and one at 10 m above ground. The probability distribution of differences between GOES_RTTOV_LST and ARM IRT is shown in Figure 6 using all available retrievals. Obviously, most of the differences fall within the interval of 1 std. Less than 20% of the cases are beyond 1 std. The correlation between the two data sets is high for both levels, (>than 0.80 for all cases). The mean differences at daytime are smaller than at nighttime at both levels and the same applies to std and rmse. Numerical values for the cases of Figure 6 are shown in Table 3. Differences due to the height exposure of the instrument can be caused by differences in the field of view of the instrument, and as such, different shading effects We compared 25 match-up cases for which the two products have overlap and both start scan at 15 min after same hour in 2004. For most cases, the total number of points in the overlap area is more than 40,000. Figure 5 presents the correlation coefficients (corr), the mean bias (bias), the standard deviation (std) and the root mean square error (rmse) between the two products. As seen, in most cases, the two products yield close correlation. Only in one case the coefficient is less than 0.8. The averaged corr of all cases is 0.91. More than 50% cases have mean bias less than 2 K, and the averaged value is 1.7 K. The averaged std and rmse are 2.7 and 3.3 K respectively.

Evaluation against ARM SGP Site at Instantaneous Time Scale
The IRT data are available from two levels of a tower; one instrument was located at 25 m and one at 10 m above ground. The probability distribution of differences between GOES_RTTOV_LST and ARM IRT is shown in Figure 6 using all available retrievals. Obviously, most of the differences fall within the interval of 1 std. Less than 20% of the cases are beyond 1 std. The correlation between the two data sets is high for both levels, (>than 0.80 for all cases). The mean differences at daytime are smaller than at nighttime at both levels and the same applies to std and rmse. Numerical values for the cases of Figure 6 are shown in Table 3. Differences due to the height exposure of the instrument can be caused by differences in the field of view of the instrument, and as such, different shading effects. Remote Sens. 2019, 11, x FOR PEER REVIEW 12 of 23           the ARM IRT observations, yet, the difference between them is less than 1% and the std and rmse are also around 1%; the performance at daytime is better than at nighttime, most likely, due to better cloud detection during the daytime when observations from the visible channel are available. underestimates the ARM IRT observations, yet, the difference between them is less than 1% and the std and rmse are also around 1%; the performance at daytime is better than at nighttime, most likely, due to better cloud detection during the daytime when observations from the visible channel are available.

Evaluation against SURFRAD/BSRN
The SURFRAD/BSRN network observes upwelling (F↑) and down-welling ( ↓) radiative fluxes which are converted to temperature as follows: where is the surface broadband emissivity assigned by surface type, is the Stefan-Boltzmann constant and is equal to 5.669 × 10 −8 J m −2 s −1 K −4 . Then The approach we use was also applied by others. The main issue in the conversion is the value of emissivity. Heidinger et al. [34] use a broadband longwave emissivity assumed to be 0.97. They indicate that a 0.1 error in emissivity equates to an error in the SURFRAD LST not exceeding 0.25 K. Yu et al. [59] also used the SURFRAD data to evaluate their LST retrievals following the same procedures. In their approach, the emissivity is estimated by mapping surface type classification of Snyder et al. [60] to emissivity (an approach that was popular for some time when direct information on emissivity was not available). They assume that the mean broadband emissivity of the satellite sensor is applicable. We use the CAMEL emissivity which is derived spectrally and integrated to the window spectral interval of the satellite used, and variable at monthly time scale; namely, for each month and for each location the spectral values are integrated to give a new broadband value. This is the most advanced use of surface emissivity in such retrievals.

Evaluation against SURFRAD/BSRN
The SURFRAD/BSRN network observes upwelling (F↑) and down-welling ( F ↓ ) radiative fluxes which are converted to temperature as follows: where ε IR is the surface broadband emissivity assigned by surface type, σ is the Stefan-Boltzmann constant and is equal to 5.669 × 10 −8 J m −2 s −1 K −4 . Then The approach we use was also applied by others. The main issue in the conversion is the value of emissivity. Heidinger et al. [34] use a broadband longwave emissivity assumed to be 0.97. They indicate that a 0.1 error in emissivity equates to an error in the SURFRAD LST not exceeding 0.25 K. Yu et al. [59] also used the SURFRAD data to evaluate their LST retrievals following the same procedures. In their approach, the emissivity is estimated by mapping surface type classification of Snyder et al. [60] to emissivity (an approach that was popular for some time when direct information on emissivity was not available). They assume that the mean broadband emissivity of the satellite sensor is applicable. We use the CAMEL emissivity which is derived spectrally and integrated to the window spectral interval of the satellite used, and variable at monthly time scale; namely, for each month and for each location the spectral values are integrated to give a new broadband value. This is the most advanced use of surface emissivity in such retrievals.
The scatter plots of the instantaneous GOES_RTTOV LST against SURFRAD sites for both daytime and nighttime are shown in Figure 8. As seen, the satellite estimates and the ground observations have very high correlation, mostly above 0.98. For daytime (left panel Figure 8) the differences ranged between 0.4 (0.2%) to 1.16 (0.4%) while the std ranged between 1.88 (0.6% to 2.53 (0.9%) respectively. For nighttime (right panel Figure 8)  The scatter plots of the instantaneous GOES_RTTOV LST against SURFRAD sites for both daytime and nighttime are shown in Figure 8. As seen, the satellite estimates and the ground observations have very high correlation, mostly above 0.98. For daytime (left panel Figure 8) the differences ranged between 0.4 (0.2%) to 1.16 (0.4%) while the std ranged between 1.88 (0.6% to 2.53 (0.9%) respectively. For nighttime (right panel Figure 8) the results are comparable to daytime.

Evaluation against the Oklahoma MESONET Sites
The distribution of sites used in current evaluation is illustrated in Figure 3a. The evaluations are carried out against all the stations for both daytime and nighttime in January and July during [2004][2005][2006][2007]. Results are presented in Figure 9 where outliers outside 1 std were removed. Red color designates results that have bias smaller than 1 std, ranging from 1.74 to 2.47 K. The retrieved data have high correlation with the in-situ data (>than 0.9). Results of daytime and nighttime for January and July are comparable (unlike the results of Figure 6 where all outliers were used).

Evaluation against the Oklahoma MESONET Sites
The distribution of sites used in current evaluation is illustrated in Figure 3a. The evaluations are carried out against all the stations for both daytime and nighttime in January and July during [2004][2005][2006][2007]. Results are presented in Figure 9 where outliers outside 1 std were removed. Red color designates results that have bias smaller than 1 std, ranging from 1.74 to 2.47 K. The retrieved data have high correlation with the in-situ data (>than 0.9). Results of daytime and nighttime for January and July are comparable (unlike the results of Figure 6 where all outliers were used).

Applications
• Seasonal distribution of LST at monthly scale Since many users of LST data are interested in monthly mean values, we have conducted a comprehensive comparison at such scale. This was possible due to the availability of both ground observations and satellite estimates for a period of six years.
The evaluation was expanded to include ground observations at the USA BSRN/SURFRAD sites and we also used information from MOD11C3, Version: 006 at 0.05° (https://lpdaac.usgs.gov/dataset_discovery/modis/modis_products_table/mod11c3_v006). Derived statistics includes mean values, standard deviation, maximum/minimum, and medium values for each month are shown in Figure 10. It is clearly that the GOES_RTTOV LST has very close distribution pattern as it of SURFRAD/BSRN. It has the ability to describe the annual variability of the LSTs at SURFRAN/BSRN sites. At DRA, MOD11C3v6 LST yields higher estimations against SURFRAD/BSRN for most seasons, the annual mean LST for all study years is 306.6 K, which is 3.1 K higher than the value of SURFRAD/BSRN. While the GOES_RTTOV estimations are much closer to the site value. The annul mean LST of GOES_RTTOV is 301.6 K. For BON, both of the MOD11C3v6

Applications
• Seasonal distribution of LST at monthly scale Since many users of LST data are interested in monthly mean values, we have conducted a comprehensive comparison at such scale. This was possible due to the availability of both ground observations and satellite estimates for a period of six years. The evaluation was expanded to include ground observations at the USA BSRN/SURFRAD sites and we also used information from MOD11C3, Version: 006 at 0.05 • (https://lpdaac.usgs.gov/dataset_ discovery/modis/modis_products_table/mod11c3_v006). Derived statistics includes mean values, standard deviation, maximum/minimum, and medium values for each month are shown in Figure 10. It is clearly that the GOES_RTTOV LST has very close distribution pattern as it of SURFRAD/BSRN. It has the ability to describe the annual variability of the LSTs at SURFRAN/BSRN sites. At DRA, MOD11C3v6 LST yields higher estimations against SURFRAD/BSRN for most seasons, the annual mean LST for all study years is 306.6 K, which is 3.1 K higher than the value of SURFRAD/BSRN. While the GOES_RTTOV estimations are much closer to the site value. The annul mean LST of GOES_RTTOV is 301.6 K. For BON, both of the MOD11C3v6 and GOES_RTTOV estimations are close to the site values, except April, May and June. The annul mean LSTs of SUFRAD/BON, GOES_RTTOV and MOD11 are 287.9 K, 288.0 K, 289.8 K respectively. And for GWN, the GOES_RTTOV has relatively lower estimation of annual mean LST which is 293.6 K. The site value is 295.1 K. The MOD11 is 295.7 K. The performance of the satellite estimations at FPK is similar as DRA. The MOD11 annual mean LST is 288.4 K, the GOES_RTTOV is 283.8 K, and the site value is 285.2 K.   Figure 11 for illustration of the product.
As shown, for July, the differences in surface temperature during these two hours (close to representing daily max and min), are large. During the daytime, the western part of the US is dominated by clear sky conditions (the 100th W longitude is known to separate between the humid and dry parts of the US). During the nighttime, the clear conditions contribute to cooling by emitted LW radiation especially, over high elevations. Noticeable is also the pronounced latitudinal variability in the LST during January, dominated by solar zenith angle dependence of heating by SW radiation. The high spatial and temporal resolution of this product makes it useful for addressing hydrological issues such as modeling of evapotranspiration, snow-melt, or soil moisture estimation (utilizing morning heating rates) [61].  Figure 11 for illustration of the product. In Figure 12 we depict the diurnal variation of LST as observed at four SURFRAD/BSRN stations and from GOES-12. Notable is the large amplitude at the dry site of Desert Rock (DRA) (characterized as desert, gravel, flat, rural) as compared to the more vegetated regions at the other sites (BON is grass, flat, rural; FPK is grass, flat while GWN is grass, hilly, rural). The effect of latitude is also evident. The amplitude at GWN which is at ~34°N is much smaller than the amplitudes at the higher latitude stations (BON at ~40° and FPK at ~48°). Of interest are the differences between satellite estimates and ground observations which are more noticeable at DRA and FPK than at the other sites. A possible explanation for DRA is the lower homogeneity of the site compared to the others. The FPK is at higher elevation than BND and GCM and also at higher latitude so possibly, the cooling of the ground at the observational site may not represent the grid domain. Additional investigation is needed to better understand the behavior at these four sites during the earlier part of the day. The full agreement between the satellite and ground observations from about noon to late afternoon can possibly be due to more even heating of the ground than at the earlier hours of the day when the higher moisture content can differentially affect the emissivity. While ground observations are very sparse, the findings shown in Figure 12 indicate that satellites alone can be used to characterize the diurnal cycle over the domain of the GOES satellites (a comprehensive analysis over the entire US is needed). This information is of considerable interest since most satellite based estimates of LST use polar orbiters unable to depict the true diurnal cycle. As shown, for July, the differences in surface temperature during these two hours (close to representing daily max and min), are large. During the daytime, the western part of the US is dominated by clear sky conditions (the 100th W longitude is known to separate between the humid and dry parts of the US). During the nighttime, the clear conditions contribute to cooling by emitted LW radiation especially, over high elevations. Noticeable is also the pronounced latitudinal variability in the LST during January, dominated by solar zenith angle dependence of heating by SW radiation. The high spatial and temporal resolution of this product makes it useful for addressing hydrological issues such as modeling of evapotranspiration, snow-melt, or soil moisture estimation (utilizing morning heating rates) [61].
In Figure 12 we depict the diurnal variation of LST as observed at four SURFRAD/BSRN stations and from GOES-12. Notable is the large amplitude at the dry site of Desert Rock (DRA) (characterized as desert, gravel, flat, rural) as compared to the more vegetated regions at the other sites (BON is grass, flat, rural; FPK is grass, flat while GWN is grass, hilly, rural). The effect of latitude is also evident. The amplitude at GWN which is at~34 • N is much smaller than the amplitudes at the higher latitude stations (BON at~40 • and FPK at~48 • ). Of interest are the differences between satellite estimates and ground observations which are more noticeable at DRA and FPK than at the other sites. A possible explanation for DRA is the lower homogeneity of the site compared to the others. The FPK is at higher elevation than BND and GCM and also at higher latitude so possibly, the cooling of the ground at the observational site may not represent the grid domain. Additional investigation is needed to better understand the behavior at these four sites during the earlier part of the day. The full agreement between the satellite and ground observations from about noon to late afternoon can possibly be due to more even heating of the ground than at the earlier hours of the day when the higher moisture content can differentially affect the emissivity. While ground observations are very sparse, the findings shown in Figure 12 indicate that satellites alone can be used to characterize the diurnal cycle over the domain of the GOES satellites (a comprehensive analysis over the entire US is needed). This information is of considerable interest since most satellite based estimates of LST use polar orbiters unable to depict the true diurnal cycle.

Discussion
Available information on LST and DTR from remotely sensed data is deficient. Discrepancies and inconsistencies arise due to the quality of the satellite and the ground observations, differences in their spatial, spectral and temporal resolution, as well as differences in the inference methods and auxiliary data used. In principle, the well-established split window approach is known to perform better than the use of a single channel for deriving LST however, the 12 µm channel is not available any more during the operational period of GOES 12-15. To homogenize satellite observations to obtain a consistent long term record requires the utilization of observations from a single channel only.
With the advancement in archiving of satellite data, their maintenance in terms of calibration, geolocation, improved inference schemes and auxiliary information, it is timely to formulate an approach for deriving long-term, consistent, and calibrated data across multiple satellite sensors, as demanded by the user community. Progress has also been made in ground observations in terms of instrument characterization, guidelines for high quality maintenance and calibration. The issue of optimal coupling between satellite and ground observations is still widely debated.
LST is known to have large spatial variability at different temporal scale (diurnal, annual, interannual) and this variability has an informative value. For instance, the importance of the diurnal cycle of LST has been widely recognized [62][63][64] and numerous attempts have been made to estimate it. In an early attempt [65], used were the International Satellite Cloud Climatology Project (ISCCP) data (at 280 km resolution C-2 product) [66] in combination with ground observations to derive the monthly mean diurnal cycle in surface temperature over land (suitable for Global Climate studies). Duan et al. [64] tried to determine it using High Spatial Resolution Clear-Sky MODIS Data while Inamdar et al. [33] dis-aggregated the diurnal cycle of LST at the GOES pixel scale to that of the MODIS pixel scale. Yet, the daytime and nighttime products from polar orbiting satellites (e.g., MODIS) do not fully represent the daily amplitude as feasible from GEO satellites. Our effort represents a contribution to the development of a framework for obtaining long term records of consistent LST and DTR from the entire record of GOES satellites, using a physically based approach and utilizing the best currently available auxiliary information and the best available ground observations to evaluate the proposed approach.
In the evaluation process, factors that play a role include differences in ground instrumentation, their location above the surface, method of estimating LST from the measured outgoing LW

Discussion
Available information on LST and DTR from remotely sensed data is deficient. Discrepancies and inconsistencies arise due to the quality of the satellite and the ground observations, differences in their spatial, spectral and temporal resolution, as well as differences in the inference methods and auxiliary data used. In principle, the well-established split window approach is known to perform better than the use of a single channel for deriving LST however, the 12 µm channel is not available any more during the operational period of GOES 12-15. To homogenize satellite observations to obtain a consistent long term record requires the utilization of observations from a single channel only.
With the advancement in archiving of satellite data, their maintenance in terms of calibration, geolocation, improved inference schemes and auxiliary information, it is timely to formulate an approach for deriving long-term, consistent, and calibrated data across multiple satellite sensors, as demanded by the user community. Progress has also been made in ground observations in terms of instrument characterization, guidelines for high quality maintenance and calibration. The issue of optimal coupling between satellite and ground observations is still widely debated.
LST is known to have large spatial variability at different temporal scale (diurnal, annual, inter-annual) and this variability has an informative value. For instance, the importance of the diurnal cycle of LST has been widely recognized [62][63][64] and numerous attempts have been made to estimate it. In an early attempt [65], used were the International Satellite Cloud Climatology Project (ISCCP) data (at 280 km resolution C-2 product) [66] in combination with ground observations to derive the monthly mean diurnal cycle in surface temperature over land (suitable for Global Climate studies). Duan et al. [64] tried to determine it using High Spatial Resolution Clear-Sky MODIS Data while Inamdar et al. [33] dis-aggregated the diurnal cycle of LST at the GOES pixel scale to that of the MODIS pixel scale. Yet, the daytime and nighttime products from polar orbiting satellites (e.g., MODIS) do not fully represent the daily amplitude as feasible from GEO satellites. Our effort represents a contribution to the development of a framework for obtaining long term records of consistent LST and DTR from the entire record of GOES satellites, using a physically based approach and utilizing the best currently available auxiliary information and the best available ground observations to evaluate the proposed approach.
In the evaluation process, factors that play a role include differences in ground instrumentation, their location above the surface, method of estimating LST from the measured outgoing LW radiation, calibration and maintenance of the instruments and scale issues between ground observations and satellite footprints. There is a need to ensure that the satellite observations used represent clear sky condition. Detailed information on each of these factors is needed for a full assessment of errors in the retrieved LST products. While under controlled short term experiments the uncertainty of many of these factors can be minimized, the results obtained are not representative for extended areas and all seasons.
In this paper, the quality of the new product is evaluated against extensive record of best available observations and products that are accepted by the scientific community. Specifically, evaluation of a six-year record of instantaneous LST as well as monthly averages was performed against the DOE Atmospheric Radiation Measurement (ARM) site at the Southern Great Plains central facility, the BSRN/SURFRAD stations, MOD11 products and the Oklahoma MESONET sites.
While the quality of the instrumentation used at each site can be traced to factory specifications, it is not possible to establish how much differences in daily maintenance at each site contribute to the quality of the observations. The hypothesis of our approach is that by using long term observations at numerous sites and seasons, the evaluation results do provide an indication on the robustness of the approach. One of the major factors affecting the evaluation results is related to cloud screening which vary among methodologies as recently discussed in Ermida et al. [67]. Spatial and temporal variability in emissivity are also a contributing factors. As reported in [6] a brightness temperature error due to emissivity error in the 11 µm band is about 3% for a 0.5% error in emissivity and up to 5% for an emissivity error of 2.0%; these estimates are based on global simulations over a wide temperature range. To fully understand discrepancies between products, there is a need in controlled experiments to evaluate independently factors that can cause differences. Till now, available retrievals are based on different satellite observation, different retrieval methodology, atmospheric inputs and time periods. An early attempt to compare the performance of several well-known algorithms was presented in {6]. To make such algorithm comparison consistent the individual methodologies need to be modified; it is necessary to rederive relevant coefficients of the algorithms used in a systematic manner using the same inputs. The need for controlled experiments to facilitate discussion on sources of discrepancies between methods has been recognized by the scientific community and is conducted frequently. Examples are numerical model evaluation as conducted at Lawrence Livermore National Laboratory (https://pcmdi.llnl.gov/?projects/amip/0), while controlled experiments to estimate errors due to aerosols is described in Randles et al. [68]. The objective of the current study is to present a credible methodology to generate long term time series of LST at best available spatial and temporal resolution (that currently are possible with a long term outlook), and evaluate it against best available satellite products and ground observations. Used were long term observations that represent different climatic regions and seasons that provided statistically robust indication on the soundness of the propose approach. The need for further work to investigate the sources of discrepancies is also recognized. Limitations and advantages of each methods and their trade-offs need also to be fully understood.

Summary
In principle, the split window approach is known to perform better than a single channel to derive LSTs. But the 12 µm channel is not available any more during the operation periods of GOES 12-15. To homogenize satellite observations to a consistent long term record requires the use of a single channel observation.
We have implemented the RTTOV radiative transfer approach adjusted for GEO channel 4 to derive LST at the high resolution of about 5-km. The model is driven with the MERRA-2 reanalysis profiles for water vapor and temperature and the CAMEL product. A homogeneous six year record of LST at 0.05 • spatial resolution at hourly time scale was produced from GOES observations and evaluated for the period of 2004-2009. A six year climatology at monthly time scales was also derived and used to construct representative diurnal cycles for selected surface type.