Validation of AVHRR Land Surface Temperature with MODIS and In Situ LST—A TIMELINE Thematic Processor

: Land Surface Temperature (LST) is an important parameter for tracing the impact of changing climatic conditions on our environment. Describing the interface between long- and shortwave radiation ﬂuxes, as well as between turbulent heat ﬂuxes and the ground heat ﬂux, LST plays a crucial role in the global heat balance. Satellite-derived LST is an indispensable tool for monitoring these changes consistently over large areas and for long time periods. Data from the AVHRR (Advanced Very High-Resolution Radiometer) sensors have been available since the early 1980s. In the TIMELINE project, LST is derived for the entire operating period of AVHRR sensors over Europe at a 1 km spatial resolution. In this study, we present the validation results for the TIMELINE AVHRR daytime LST. The validation approach consists of an assessment of the temporal consistency of the AVHRR LST time series, an inter-comparison between AVHRR LST and in situ LST, and a comparison of the AVHRR LST product with concurrent MODIS (Moderate Resolution Imaging Spectroradiometer) LST. The results indicate the successful derivation of stable LST time series from multi-decadal AVHRR data. The validation results were investigated regarding different LST, TCWV and VA, as well as land cover classes. The comparisons between the TIMELINE LST product and the reference datasets show seasonal and land cover-related patterns. The LST level was found to be the most determinative factor of the error. On average, an absolute deviation of the AVHRR LST by 1.83 K from in situ LST, as well as a difference of 2.34 K from the MODIS product, was observed.


Introduction
LST is an important quantity for tracing the impact of changing climatic conditions on our environment from the local to the global scale. As a key parameter in the energy exchange at the Earth's surface, LST describes the interface between long-and shortwave radiation fluxes on one side, and turbulent heat fluxes and the ground heat flux on the other side. Furthermore, LST is recognized as one of the Essential Climate Variables (ECVs) by the World Meteorological Organization [1]. As it represents the temperature of the surface and is strongly linked to the near-surface air temperature, it can also be directly used for monitoring the global warming taking place on our planet in the last few decades.  [42]. Example of a MODIS LST product by [43].
One goal of the TIMELINE project is to retrieve accurate and consistent LST products for the entire operating period of AVHRR sensors over Europe. The basis is the 1.1 km High-Resolution Picture Transmission (HRPT) and Local Area Coverage (LAC) data. An enhanced preprocessing to Level 1B data accounts for geometric distortions due to rotation and satellite clock errors, varying spectral responses of different AVHRR sensors, calibration drift, orbit drift, sensor degradation, and atmospheric influences [44,45]. Moreover, enhanced cloud, water, and snow masks are developed [46,47]. For the subsequent LST calculation, Frey et al. [4] created a new method to achieve the best performance on AVHRR data using the Qin et al. [48] and the Becker and Li [5] algorithms. However, to make these data useful for climate change studies, the accuracy of the remotely sensed LST must be known. Guillevic et al. [10] stated that an uncertainty and precision of less than 1 K is required for LST data to be useful for climate applications. Frey et al. [4] analyzed the accuracy of their LST algorithm mathematically during development, as well as using 17 tiles of LST derived from MODIS data. The mathematical assessment resulted in a very good fit, with a mean absolute deviation (MAD) of under 0.5 K, while the comparison with MODIS LST resulted in MADs between 1 and 3 K [4]. However, the accuracy of this LST product has not been validated against in situ measurements, nor have the effects of factors such as land surface type, water vapor content in the atmosphere, or viewing geometry on the uncertainty of LST retrievals been assessed. Additionally, there has been no broad validation, covering all seasons, climatological conditions and land cover types.
The aim of this study is thus to fill this gap by assessing the accuracy and consistency of the new TIMELINE LST product derived from AVHRR data. Thus, the study assesses the effects of different total columnar water vapor (TCWV) levels and sensor view angles (VA) on the performance of the TIMELINE LST algorithm. On the other hand, possible seasonal effects and differences for land cover types and LST ranges are also considered. This results in the following research questions: (a) How accurate is the TIMELINE AVHRR LST product? (b) How robust is it to variances in TCWV, VA and land cover? (c) How consistent is the TIMELINE AVHRR LST over different LST ranges and over time?
Only daytime LST is validated because the reflectance values in the first two AVHRR bands are needed for the emissivity estimation. While the emissivity is normally constant over 24 h, the reflectance of the corresponding daytime scenes cannot be used because of the scene-based TIMELINE Level 2 framework. The validation approach consists of an assessment of the consistency of the AVHRR LST time series, an inter-comparison between AVHRR LST and in situ LST obtained at 10 stations located in Europe, North America and Southern Africa, and a comparison of the AVHRR LST product to the concurrent MODIS LST product MOD11_L2. The dataset for comparison with the validation data comprises the years 2003 to 2014. To assess the earlier years (1981)(1982)(1983)(1984)(1985)(1986)(1987)(1988)(1989)(1990)(1991)(1992)(1993)(1994)(1995)(1996)(1997)(1998)(1999) of TIMELINE LST, the full time series is extracted at three sites. The time series is normalized to a standard observation time using a diurnal temperature cycle (DTC) model, and analyzed for consistency.

Study Area and Period
The area and period of the TIMELINE LST validation are determined by the study area of the TIMELINE project and the availability of in situ data. The validation between TIMELINE and MODIS LST was conducted for the years 2003 to 2014, where the most overlaps between TIMELINE and MODIS LST over the TIMELINE study area were found. Only in situ LST data from two sites in the TIMELINE study area were available for the period 2010-2013. Eight more stations were available in North America and South Africa (see Figure 2). Therefore, the validation of TIMELINE LST was extended to these regions for this period. The in situ measurement stations are described in Section 2.4. The study area of the TIMELINE project is Europe and northern Africa, hawithving the same extent as the European Environmental Agency (EEA) reference grid: 900,000 m east, 900,000 m north to 7,400,000 m east, 5,500,000 m north. According to the Köppen-Geiger classification [49] [49]. The African sites Heimat/Rust mijn Ziel (HE) experience an arid, hot desert climate (Bsh), according to Köppen-Geiger classification [49]. The validation study is therefore representative of a wide range of climatic conditions. The land cover in DR is defined as arid shrubland. BND and PEN are located in an agricultural area, and the land cover at these sites is cropland. The land cover at the remaining stations is grassland [15,42,50]. The measurement sites of BO, PEN and GC are relatively close to the forest edge, which is therefore also covered by the corresponding AVHRR pixels.

LST Derivation Algorithm
LST is retrieved from the brightness temperatures of cloud-free pixels over land using the TIMELINE Level 1B product, the Level 2 cloud mask, and the Level 2 water mask. When developing the TIMELINE LST processor, Frey et al. [4] analyzed the accuracy and error robustness of several split-window and mono-window algorithms for AVHRR data. They concluded that among the split-window algorithms, the algorithm by Becker and Li [5] is preferred, and the algorithm by Qin et al. [48] performed the best from the monowindow algorithms. Additionally, they developed the algorithm to make it applicable to all NOAA platforms in order to generate a consistent LST product for the whole AVHRR period. To enhance the error robustness of the algorithms, coefficients were added that should compensate for the impacts of TCWV, VA, and high LST values. A different set of coefficients is used for each sensor to consider the different spectral response curves of the NOAA sensors. Equation (1) shows the resulting split-window algorithm: where t 4 and t 5 are the brightness temperatures, is the mean emissivity of both TIR channels, d is the emissivity difference in both TIR channels and p 0 -p 5 are the coefficients added by Frey et al. [4]. Equation (2) shows the resulting mono-window algorithm: where t 4 is the brightness temperature, is the emissivity, T atm is the mean atmospheric temperature, TCWV is the total columnar water vapor, and again p 0 -p 5 are the coefficients added by Frey et al. [4].

AVHRR Data
The AVHRR sensors on board the National Oceanic and Atmospheric Administration's (NOAA) polar orbiting satellites (NOAA-1 to NOAA-19) as well as on the Metereological Operational (MetOp) satellites (MetOp-A, -B and -C) have delivered daily observations in visible, near-infrared, and thermal wavelengths from the early 1980s until today. Three different generations of AVHRR sensors have been developed over the past decades, providing observations in four, five, and six spectral channels, for each generation [44]. The resolution is around 1.1 km at the nadir, but this widens up to 6.5 km with the observation angle, which spans to a maximum of ±55.4 degrees.
LST is derived from AVHRR Level 1B data, which are calibrated, corrected for geolocation errors, and quality-checked. For LST over the TIMELINE study area, TIMELINE Level 1B data from the DIMS archive at DLR are used. The comprehensive preprocessing chain, which also comprises the calculation of the correspondent sun and satellite zenith angles, is described in [45]. For LST over North America and southern Africa, NOAA Level 1B data, which were downloaded from the NOAA Comprehensive Large Array Data Stewardship System (CLASS), are used. Their preprocessing, which also generates the correspondent sun and satellite zenith angles, is described in [51].

Auxiliary Data: TCWV, T atm and Land Cover Data
TCWV (kg/m 2 ) is defined as the total gaseous water contained in a vertical column of atmosphere. High TCWV levels often lead to a decrease in the accuracy of LST derived with split-window algorithms, because of the different and non-linear absorption properties of water vapor between two infrared channels [52]. In this work, TCWV data from the ERA-Interim Archive at the European Centre for Medium-Range Weather Forecasts (ECMWF) are used. The data comprise a meteorological reanalysis with a temporal resolution of 3 h and a spatial resolution of about 79 km [53]. For the derivation of TIMELINE LST, it is projected to the orbit of the respective input scene.
T atm is a crucial parameter for the mono-window algorithm used by Qin et al. [48]. It is a proxy for the atmospheric upwelling radiance, which adds up to the land surface radiance received by the sensor. Frey et al. [4] derived T atm by using the mean of all temperatures in the vertical column for each atmospheric profile from the atmospheric profile database Seebor 5.0 [54]. This definition of T atm differs from the definition by [55], who defined T atm as the temperature in the vertical column, integrated over the water vapor content at the different altitudes. However, the first validation results derived during the revision of the algorithm by Qin et al. [48] have shown that applying the definition of T atm by [55] leads to substantially larger errors of TIMELINE LST, mainly at high temperatures and VAs. Therefore, it was decided to retain the definition of T atm in [4].
To reduce the number of input datasets for operational LST processing, T atm is proxied with the air temperature at a two-meter altitude (T 2m ). Using one half of the Seebor 5.0 dataset, Equation (3) was derived: Using the other half of the Seebor 5.0 dataset, the modeled T atm was compared to the real T atm , leading to an r 2 of 0.89. For the calculation of TIMELINE LST, T atm is derived from the ERA 5 T 2m dataset [56], which comprises reanalysis data with a temporal resolution of 3 h and a spatial resolution of about 31 km, using Equation (3).
As can be seen from Equations 1 and 2, the derivation of LST from TIR data is influenced by the emissivity of the surface. The emissivity is defined as the ratio between the TIR radiation emitted by the surface to the TIR radiation emitted by a black body. As the emissivity depends on the surface material and structure, it varies with land cover. The vegetation cover method by Caselles et al. [57] is used to estimate the emissivity. It combines spectral information about different land cover types with the seasonal fluctuation of vegetation cover, by estimating the fraction of vegetation cover (FVC) of each scene from the AVHRR reflective bands. The land cover type of each pixel is obtained from land cover classifications and assigned to an emissivity class. For the derivation of TIMELINE LST for the period from 2008, the GlobCover classification for 2009 [58] was used. For AVHRR 2 and 3 data from the years before 2008, the GlobCover classification for 2005 [59] was used. For AVHRR 1 data, the CCI classification for 1992 was used [60], adapting the vegetation cover method to the CCI classification system, as shown in Table S1.

LST Quality and Uncertainty
Additional to LST estimates, a quality and an uncertainty layer produced. The uncertainty value combines the uncertainty of the LST algorithm, the uncertainty from the emissivity estimation, the noise equivalent differential temperature and the uncertainty from geolocation. The uncertainty of the LST algorithm and the uncertainty from the emissivity estimation were simulated by [4]. For the noise equivalent differential temperature, a constant uncertainty of 0.12 K is assumed, derived from [61]. The uncertainty due to geolocation is the standard deviation of LST within an n × n window, with n either equal to 3, 5 or 7 depending on the quality of the Level 1B input data. The calibration uncertainty is simulated for an error of one percent in the brightness temperature. The uncertainties are combined with the following equation: where alg is the uncertainty of the algorithm, emi is the uncertainty of emissivity, nedt is the noise equivalent differential temperature, geo is the uncertainty due to geolocation errors and cal is the calibration uncertainty. The quality flags carry information about the quality of the emissivity estimation, the LST algorithm and the Level 1B input data. However, to assess TIMELINE LST for all land covers and climatic conditions, these flags were ignored.

Daytime Normalization
The assessment of the consistency of the whole TIMELINE LST time series is affected by the different observation times of the individual satellites, as well as by orbit drift, experienced by most of the NOAA satellites, leading to different observation times even within the time series of one satellite.
Daytime normalization can solve both problems. Daytime normalization is often performed by applying a diurnal temperature cycle (DTC) model. Göttsche and Olesen [62] proposed a DTC model developed for METEOSAT data. The daytime part of their model can be expressed as: where T 0 is the LST at sunrise, T a is the temperature amplitude of the day, ω is the daytime length, t is the observation time and t m is the time of the maximum temperature. This model has already been used by Liu et al. [63] to normalize AVHRR LST to a certain observation time by transforming Equation (5) into the following equation: where t 1 and t 2 are two different times in the DTC model. The daytime length ω can be calculated via the geolocation and the day of the year. T 0 , T a , and t m are unknown in this study, and have to be modeled.

MODIS LST
In this study, the MYD11_L2 Collection-6 data [64] were used for comparison. The MODIS sensors, carried on the Terra and Aqua satellites, have provided daily data since the years 2000 (Terra) and 2002 (Aqua) with a spatial resolution of about 1 km at the nadir in the TIR channels 31 and 32. Each product contains LST from 5 minutes of MODIS acquisition. The MODIS Level 2 LST products are generated using the generalized splitwindow algorithm [11]. They are widely used and repeatedly validated [13,65]. The inputs for the algorithm are, besides the MODIS cloud and snow masks, the TIR bands from the radiance data product (MOD021KM), the water vapor data from MOD07_L2, and the MODIS land cover MCDLC1KM. Similar to the TIMELINE LST algorithm, the generalized split-window algorithm uses different coefficients for day-and nighttime LST. Atmospheric corrections were obtained from MODTRAN 4 simulations. View angle errors are corrected with a quadratic term of the difference between brightness temperatures in the thermal bands. Emissivity is calculated via the classification-based method by [66]. This method provides emissivity values for various land cover classes, which are identified based on land cover type and seasonal dynamic, also taking into account angular effects through a bidirectional reflectance distribution function model.
Quality flags for every pixel are calculated during the production. The 16-bit flags contain information about the quality of the input data, cloud contamination, the quality of the emissivity and the quality of the split-window algorithm application [67].

In Situ LST
In situ radiometers and the AVHRR sensor measure on totally different spatial scales. Therefore, in situ measurement sites have to be representative for the surrounding pixel area, and have to be in an homogeneous environment [15]. Only a few TIR measurement stations exist that match these criteria. Ten measurement sites have been chosen, located in Europe (EV and DN), southern Africa (HE), and North America (seven surface radiation network (SURFRAD) stations). As only in situ data for the years 2010 and 2013 are available for HE, the investigation period of the validation with in situ data is 2010-2013. Table 1 provides an overview of the used validation data, while Figure 3 shows the surroundings of the sites on the AVHRR scale. Table 1. List of in situ measurement stations specifying the cover at and around the sites according to [15,42,50,68]. The measurement settings and calculation of the in situ LST for HE are described in [15]. The measurement settings and calculation of the in situ LST for EV are described in [68]. The in situ LST for DN was measured and calculated as described in [50]. SURFRAD radiometer measurements were conducted as described in [42], and were processed into LST using the Stefan-Boltzmann law. The surface emissivity at the SURFRAD sites has been derived via the broadband emissivity method, as suggested by [69], based on the global 5 km monthly MODIS LST/emissivity product MOD11C3 [70].

Validation Approach
The validation approach consists of the following aspects. (a) Inter-comparison between AVHRR LST and in situ LST obtained at 10 stations located in Europe, North America and southern Africa. The in situ measurements at DN were performed every 10 s, but the final value was stored as a 5 min average. At the other stations, the measurements have a temporal resolution of one minute. All in situ measurements were filtered using the quality flags, generated by the respective data provider. The AVHRR LST was filtered using the quality flags from the Level-1B input data and with the uncertainty value (<2 K), which was generated during the TIMELINE LST production. For comparison, the average LST in a window of three by three pixels over the station was calculated, but only if all values in that window were valid and only if the standard deviation in this window was less than 1 K. This was to ensure a valid match between the in situ point measurement and the averaged raster data. On the one hand, this takes account of possible geolocation errors, which are common in AVHRR data. On the other hand, cloud borders are filtered out, which were responsible for most of the outliers. In total, 2402 individual in situ measurements were compared to the AVHRR LST.
(b) Comparison of the AVHRR LST product with concurrent MODIS LST (MYD11_L2 product) [64]. The comparisons were conducted over the TIMELINE study area and limited by the MODIS lifetime to the years 2003-2014. Again, only MODIS LST and AVHRR LST with good quality input data and an uncertainty smaller than 2 K were compared. The uncertainty threshold is a compromise, ensuring the high quality of LST, but also keeping enough data points for the validation. Only pixels satisfying the following criteria were selected: the VA is not more than 40 • , the VA difference between AVHRR and MODIS is not more than 20 • , and the maximum sun zenith angle (SZA) is 90 • . The maximum slope allowed was 10 • using the 1 km slope GMTEDmd dataset [71]. A maximum time difference of 10 minutes between AVHRR and MODIS acquisitions was permitted, as suggested in [10]. Differences greater than 10 K were filtered out as outliers. After applying the previously mentioned filters, these made up 0.2% of the data points. To reduce the total number of scene comparisons, only AVHRR/MODIS pairs with more than 100,000 valid observations were selected. In total, 905 MODIS scenes were compared to the AVHRR LST, resulting in more than 187,000,000 data points.
(c) Since LST is derived from the TIR radiation emitted by the surface, as seen within the sensor field of view, it is a directional variable [72]. As a directional variable, it is strongly affected by differences in viewing and illumination geometry [73]. As already mentioned, the estimation of LST is influenced by the atmosphere and the emissivity of the surface. To better quantify these influences on the achieved LST accuracy and therefore identify reliable LST ranges and acquisition conditions, all validation results were investigated for different VA and LST classes, as well as regarding different land cover types and TCWV classes.
(d) In order to use the AVHRR LST time series as long-term measurements, the consistency between the different AVHRR sensors has to be assessed. The validation with in situ and MODIS LST covers the period from 2003 to 2014, and thus only parts of the multi-decadal time series and only the later NOAA missions. The comparison to other LST data sources at earlier parts of the time series is not possible as AVHRR was the only hightemporal frequency remote sensing acquisition system, and measurement sites suitable for the validation of remotely sensed LST did not exist. Therefore, a time series analysis of TIMELINE LST at three sites was conducted. The whole time series over the in situ sites DN and EV and the pseudo-invariant calibration site Algeria3 (Lat:30.185/Lon:7.59) was extracted for detailed analysis. Algeria3 was chosen because it is recommended by the CEOS/WGCV/IVOS subgroup for its temporal stability and spatial homogeneity [74]. Due to its arid conditions, only minor cloud contamination is expected, and it is a desert site, meaning variation due to vegetation or soil moisture is minimal. According to Jin and Treadon [75], the diurnal LST cycle depends on geolocation, soil or vegetation properties, and season. To reflect its seasonal variability, the parameters of the DTC model were derived for each month separately. The relatively low monthly resolution was chosen to account for the limited amount of data points in the time series. The DTC parameters were derived using the following procedure: One half of the respective time series was selected using the numpy random choice method [76]. These training samples were then aggregated for every month. For each month, LSTs from almost every NOAA platform were available, which covered large sections of the respective DTCs. For these monthly aggregated LSTs, the parameters of the DTC model described in 2.2.5 were determined by fitting Equation (5) via least squares to the data. Figures S1-S3 show the monthly aggregated LSTs and the resulting monthly DTCs for each site. Only data points before 18 h true solar time were used, so that the attenuation effect of the LST cooling in the evening [62] can be neglected. The other half of the LSTs was used to estimate the monthly error associated with the DTC modeling. To improve the temporal resolution of the parameters, they were linearly interpolated to every day of the year. Equation (6) was then used to normalize the TIMELINE LST time series at all three sites to 14.30 h true solar time. At days on which LSTs from multiple NOAA sensors were recorded, a cross-sensor comparison was conducted.
(e) For characterizing the differences, several validation metrics were used: the root mean square error (RMSE), the mean absolute deviation (MAD), the mean deviation (MD) and the standard deviation of the error (σ). While RMSE and MAD measure the magnitude of the error, MD indicates the direction of the error, often referred to as bias. σ measures the dispersion of the error and is therefore an indicator for the precision. Although RMSE and MAD are quite similar metrics, both were investigated to make this study comparable to studies that use either one.

Comparison to In Situ LST
The comparison with in situ LST comprises TIMELINE LST from the years 2010-2013 derived from NOAA-15, 16, 18 and 19 data. Figure 4 shows TIMELINE LST plotted against in situ LST at the 10 stations. The sites with the most data points are GC and DR. The sites with the fewest data points are EV, for which only in situ data for 2010 were available, as well as FP, PEN and BO. The overall MAD is 1.8 K, the MD is 0.5 K, the RMSE is 2.43 K and σ is 2.38 K. A seasonal pattern is visible: In summer (April-September), the agreement is lower, with a MAD of 2.07 K, an MD of 0.88 K, an RMSE of 2.71 K and a σ of 2.57 K. In winter (October-March), the MAD is 1.54 K, the MD is 0.14 K, the RMSE is 2.12 K and σ is 2.12 K. The highest agreement was reached at DR and GC. The lowest agreement was found at BND. Clear positive MDs were observed at FP, BND, DN and SF. Light positive MDs were observed at EV, PEN and HE, and a fairly balanced MD at DR. Light negative MDs were observed at GC and BO. Figure 4 shows a considerable positive MD of TIMELINE LST at BND, FP and EV at LSTs over 300 K.

Comparison to MODIS LST
The comparison between TIMELINE and MODIS LST was conducted for 905 spatial and temporal matches of MODIS and AVHRR scenes. The scenes were filtered as described in 2.   Figure 6 shows boxplots for the difference between TIMELINE and MODIS LST for each overlap. The plot reveals a general positive bias of TIMELINE LST towards MODIS LST. It shows the seasonal pattern, with lower accordance and higher bias in summer and higher accordance and lower bias in winter. This is supported by the substantially lower RMSE, MAD, MD and σ in winter, as displayed in Figure 7b. Figure 7a-c show the spatial distribution of the difference between TIMELINE and MODIS LST. Generally, the difference increases from north to south, i.e., with higher temperatures. TIMELINE LST is especially higher than MODIS LST in the desert regions of northern Africa and east of the Caspian Sea. As visible in Figure 7a, in summer, the positive bias of TIMELINE LST can also be observed in the steppe and agricultural areas of west, middle and Eastern Europe. Figure 7c shows that the lowest error is found in the deciduous and evergreen forests of central Europe and the boreal forests of northern Europe. Some spatial patterns of the LST difference show direct links to the observed emissivity difference, displayed in Figure 7d. For example, the desert regions in northern Africa and east of Caspian Sea, where a high emissivity and LST difference can be observed. In general, the TIMELINE emissivity is lower than the MODIS emissivity. However, this applies especially to the bare ground areas. Over the boreal forest of Siberia, the emissivity difference is quite low, leading to substantially lower MDs in these regions.

Robustness of the LST Derivation Approach
In the following subchapters the difference between TIMELINE LST and in situ and MODIS LST is analyzed concerning the influencing factors TCWV content, LST level, VA, emissivity and land cover. The influence of the LST and TCWV level on the LST accuracy has been determined in several studies, e.g., [9,52]. Directional effects represented by the VA are described in [4,73,77,78]. The emissivity and land cover are important factors in the LST estimation, which have to be known a priori. For the analysis, the matches between TIMELINE LST and the validation LST were classified for the respective variable. The classes reflect the range and distribution of the respective variable. The boxplots in Figures 8 and 9 show the behavior of the difference for each class. To get a more isolated picture of the impact of the respective variable on the error, the matches with values in the highest classes of the other influencing variables have been filtered out. For example, in Figures 8b and 9b, only matches with an LST smaller than 315 K and a VA smaller than 50 • were used. The boxplots in Figures S5, S6 and Figure 10 show the difference between TIMELINE and MODIS LST for each emissivity class classified for LST, VA and VA difference between AVHRR and MODIS, and for summer and winter. Only the difference between TIMELINE and MODIS LST from 2013 was analyzed in the following. Figure S4 shows TIMELINE LST against MODIS LST in 2013. It shows a very high accordance with an r-value of 0.99, but also the tendency of TIMELINE LST to overestimate at higher LST levels.

Robustness to Variances in LST, TCWV, and VA
The comparisons to MODIS LST and to in situ LST show a higher MAD, RMSE, MD and σ with increasing LST level. As is visible in Figure 8a, high MADs, RMSEs and σ occur at high LST levels at BND, DR, FP and EV. While the MD is stable over the LST levels at GC, PEN and DN, there is a strong positive connection between the LST level and MD visible at BND, BO, SF, FP, EV and DR. Notable high errors occur at LSTs over 315 K. Figure 8b shows no systematic impact of TCWV on the difference between TIMELINE and in situ LST. At BND, PEN and SF, the MD drops at TCWV levels over 30 kg/m 2 . At BO, DR, FP, HE, EV and DN, high TCWV levels only occur at high LST levels, and have therefore been filtered out. Figure 8c shows directional effects at HE, EV and BO, especially at VAs higher than 50 • , with considerably lower MDs at these angles. This effect is more moderate at DR, GC, FP and DN. Because of the positive bias of TIMELINE LST at DN and FP, these sites experience a lower MAD and RMSE at higher VAs. At BND and SF, no systematic impact of the VA is visible. At PEN, there is a positive connection between the VA level and the MD.
In Figure 9a, a clear positive relationship between the difference between TIMELINE and MODIS LST and the LST level can be observed. As with the comparison with in situ LST, MAD, RMSE, MD and σ increase with higher LST levels. Figure S5 shows that this relation is valid for all emissivity classes except class 8 (bare ground), which shows high errors and a high bias at all LST levels. Class 2 is an exception because of known errors during the TIMELINE emissivity estimation. Figure 9b shows a weak positive relationship between the TIMELINE and MODIS LST difference and TCWV. Figure 9c shows almost no relationship between the sensor angle and the difference between TIMELINE and MODIS LST. Figure S6 shows that MAD, RMSE, MD and σ are stable for different VAs and VA differences at all emissivity classes, even at classes with a complex surface, such as forests and shrublands.

Land Cover and Emissivity
As mentioned in Section 2, the emissivity for the TIMELINE LST product is derived from the vegetation cover method [57], while the emissivity for the MODIS LST product and in situ LST is derived from the classification-based method created by [66]. Figure 7d shows that the different methods can lead to absolute differences in emissivity up to 0.04. An especially high difference can be seen in the arid regions with bare ground. Smaller seasonal differences are experienced within the vegetation classes. Figure 9d shows the strong negative dependence of the differences between TIMELINE and MODIS LST and their corresponding emissivity differences, which can be justified by the split-window algorithm: an underestimation of the emissivity automatically leads to an overestimation of LST. Figure 9d shows that the MD for LSTs calculated with an emissivity difference between −0.06 and −0.03 is about 4.4 K, while the MD for LSTs calculated with no emissivity difference is about 1.7 K. Figure 10 shows that MAD, RMSE, MD and σ are higher in summer than in winter for all emissivity classes. The lowest accordance is found with class 8, with a MAD and MD of almost 3 K in summer and around 2.3 K in winter. The best accordance is reached in both forest classes 5 and 6, with MADs of 1.59 K and 1.7 K and MDs of 1.35 K and 1.39 K in summer, and MADs of 1.35 K and 1.41 K and MDs of 1.25 and 1.26 K in winter. It is notable that these classes show the highest seasonal stability. High seasonal differences can be seen for the classes 1, 3, 4 and 7. Figure S7 illustrates the difference between the MODIS emissivity used for the in situ LST calculation at the SURFRAD sites and the TIMELINE emissivity. The stable pattern of the MODIS emissivity is created by the method derived from [66], for which-contrary to the vegetation cover method-vegetation indices are not part of the emissivity derivation formula, but only used to distinguish emissivity classes. However, the absolute difference is very small in the vegetation classes (<0.01). To estimate the propotion of the error between TIMELINE and in situ LST due to the emissivity difference between TIMELINE and MODIS emissvity, in situ LST at the SURFRAD stations was recalculated with TIMELINE emissivity. However, using the alternative procedure would not have led to significantly different results, as displayed in Figure S8. The MAD between TIMELINE and in situ LST would only have changed in the order of 0.04 K. Figure 11 shows the parameters of the DTC models for Algeria3, DN and EV. As expected, the seasonal development of most of the parameters at DN and EV is quite similar because of their geograpical proximity and the similar land cover at these sites. The T 0 (LST at sunrise) is around 280 K at all three sites in winter, which rises to 300 K at Algeria3 and 290 K at DN and EV in summer. The temperature amplitude T a increases from 10 K in winter to 30 K in summer at EV and DN, and from 20 K in winter to 30 K in summer at Algeria3. t m (time of maximum temperature) shows no seasonal pattern, varying around 13 h true solar time. The highest fluctuations at DN are between 13.30 h in February and 12.30 h in the second half of the year. The DTC models show their highest uncertainties in spring and autumn, with RMSEs up to 4.5 K. Lower RMSEs between 2.5 and 3 K are recorded in the winter and summer months. Figure 12 shows the original TIMELINE LST time series at Algeria3, EV and DN. Figure 13 shows these time series normalized to 14.30 h true solar time with Equation (6) and the parameters of Figure 11. It is visible that the across-platform offsets and the orbit drift effects are successfully minimized in the time series. The LST values from NOAA-10, which have been calculated with the mono-window algorithm constructed by [48], fit consistently into the time series. The LST values from the other platforms were calculated with the split-window algorithm devised by [4].

Assessment of the TIMELINE LST Consistency
The LST normalized to 14:30 allows cross-sensor comparisons at days when LST values from multiple platforms were recorded. Figure 14 shows these comparisons for NOAA-9, 10, 11, 12, 14, 15, 16, 17, 18 and 19. The highest accordance is achieved between NOAA-18 and 19, with an RMSE of 1.

Comparison between TIMELINE LST and In Situ LST
Considering the previous accuracy analysis of TIMELINE LST by [4], the error lies in the expected range of 1-3 K. Similar accuracies were reached in a recent study by [79], comparing a global 0.05 • × 0.05 • AVHRR LST product to in situ LST from SURFRAD sites, resulting in RMSEs between 2.25 K and 3.86 K. In studies for different sensors, errors of this range are common, for example: the LST from MODIS data was derived with an average RMSE of 2.65 K compared to in situ measurements by [12], or compared to modeled brightness temperatures with less than 1 K on average by [13]. The LST product for SERVIRI was validated with in situ data from the Karlsruhe Institute of Technology (KIT), resulting in RMSEs between 1.2 and 4.1 K [15]. LST from AASTR was derived with an average RMSE of 3.03 K in comparison with in situ measurements [19].
The MADs at the in situ sites partly reflect their homogeneity at the AVHRR pixel scale, which is illustrated in Figure 3: DR, which is surrounded by arid shrublands, has the lowest MAD, while BND, which is surrounded by a mix of grassland, cropland and trees, has the highest MAD. Ma et al. [79] and Martin et al. [77] observed large differences between the growing and dormancy season around the BND site, leading to high errors during the comparison with remote sensing LST at this site. A negative bias at BO and GC was also observed by [79]. This can be explained by the presence of forest and vegetation near the measurement site, which is in the AVHRR field of view.

Comparison between TIMELINE LST and MODIS LST
The comparison between TIMELINE and MODIS LST resulted in a MAD of 2.34 K, an RMSE of 2.67 K and an MD of 2.21 K. This compares with a relatively small σ of 1.45 K, which indicates a systematic positive bias of TIMELINE LST towards MODIS LST. The previous studies comparing AVHRR LST to MODIS LST resulted in slightly better results, with MADs < 2 K in [4] and an overall MAD of 2.2 K in [78]. However, the observed number of data points by [4] was far smaller, and Frey et al. [78] applied stricter filtering before the validation (maximum observation time difference of 5 minutes, only one land cover in 5 × 5 pixel window, slope < 2 • ). The less strict criteria in our study were chosen in order to achieve a higher general validity.
According to [78], the factors that lead to different LST estimations between LST remote sensing products in general can be divided into two groups. The first group comprises differences in the acquisition conditions, such as observation time or view angle. These factors should be filtered out before the validation. The second group contains differences in the quality of the instruments and differences of the algorithms: this includes the sensor calibration and atmospheric correction, which is, in the case of the MODIS and TIMELINE LST product, the split-window or mono-window algorithm, respectively. Furthermore, it includes cloud detection and emissivity estimation.
As probable factors explaining the positive bias of TIMELINE LST, inconsistencies over the LST range were identified. The reasons for this will be further assessed in future versions of the product. On the other hand, a strong negative relationship between the differences in LST and the differences in emissivity were identified. The reasons for this bias lie in the emissivity algorithms and the respective assumptions about the land cover. Another factor could be the tendency of the MODIS LST product to generally underestimate LST. This is indicated by the study [12], who observed an MD of −0.93 K when comparing ten years of Collection 5 Aqua/MODIS LST to in situ measurements. For Collection 6, the MODIS LST product was improved mainly over bare soil. The validation of Collection 6 MODIS LST over the SURFRAD sites and EV conducted by [14] for the years 2004 and 2005 showed that, for most stations, the MD is still negative. The direct comparison at the respective sites in Table 2 shows that, except for BO, the MD between TIMELINE and in situ LST is around 2 K higher than the MD between MODIS and in situ LST. This fits very well with the observed positive MD between TIMELINE and MODIS LST in the emissivity class 3 (croplands and grasslands, see Figure 10), which is also in the order of 2 K. Despite the fact that the MODIS LST product experiences the same uncertainties as every remote sensing LST product, because of its long history of development and validation, it is expected to produce a very high quality of LST.

Robustness to Variances in LST, TCWV and VA
One aim of the TIMELINE LST algorithms is to minimize the LST error due to LST level, TCWV and VA. During the assessment of the revised algorithms by [4], some differences regarding these variables were still observed. For the split-window algorithm, the MAD was 0.7 K higher at the TCWV level of 50 kg/m 2 than at 10 kg/m 2 . Especially, at high TCWV levels in combination with high Vas, high MADs occurred. Contrary to our results, the MAD over different LST levels was quite stable. Table 2. MDs between MODIS and in situ LST experienced by [14] and between TIMELINE and in situ LST.

Station Name
Mean MODIS LST-In Situ LST (K) At least for the validation with MODIS LST, the LST level seems to be a determining factor in the resulting error and uncertainty (MAD at >315 K is 1.7 K higher than at 270-285 K). Some in situ site-dependent effects could be explained by seasonal differences in land cover, e.g., at BND. Leaving this special case aside, the higher uncertainties and errors at high LSTs seen in both validations are probably the reason for the seasonal pattern of the differences. The tendency for higher errors with higher LST levels and a positive bias of split-window algorithms for AVHHR is well known, and has been observed, among others, by [4,9]. This behavior was not observed during the evaluation of the split-window algorithm for MODIS LST [12,13,80]. The reasons for the remaining inconsistency over the LST range possibly lie in the coefficient retrieval process, specifically in the underlying Seebor database.

Mean TIMELINE LST-In
The comparison with MODIS LST showed only a weak positive dependence of MAD, RMSE, MD and σ on TCWV levels (the MAD at 30-50 kg/m 2 is only 0.4 K higher than at 0-10 kg/m 2 ). The split-window algorithms for TIMELINE and for MODIS LST apply different coefficients depending on the TCWV levels. This similar approach may also be the reason why TCWV content has only a weak impact on the difference between TIMELINE and MODIS LST. The comparison with in situ LST showed site-dependent behavior. Noticeable is the drop in the MD at BND, PEN and SF, which is still to be examined. At sites with warmer and more arid climates (BO, DR, FP, HE, EV and DN), the TCWV level is connected to the LST level, which makes an isolated analysis of the impact of these variables challenging. The Era Interim TCWV product has a resolution of 79 km, which may lead to inaccuracies in TCWV on the AVHRR scale. However, the sensitivity analysis by [4] showed that only TCWV errors in the order of 15 kg/m 2 have an impact on the LST error. This variance in TCWV is not common at the 79 km scale. Therefore, an impact of the different spatial scales of the TCWV and AVHRR data on the LST error is not expected.
Frey et al. [78] stressed that allowing higher sensor view angles and view angle differences leads to a visible increase in difference between AVHRR and MODIS LST. Therefore, the comparison between TIMELINE and MODIS LST was only conducted over a quite narrow range of VAs (0-40 • ) and VA differences (0-20 • ). The difference between TIMELINE and MODIS LST showed a high stability over these ranges for all emissivity classes. The comparison with in situ LST showed medium directional effects at HE, EV and BO, and light directional effects at DN and FP. The highest impact was visible at VAs over 50 • . Martin et al. [77] observed similar effects at EV, which are explained by tree shadows at this site. Furthermore, they observed directional effects at BO and DR because of the high slopes around the site, and at FP, GC and DN because of vegetation. Ermida et al. [73] stated that viewing geometry effects have the greatest impacts on surfaces with high contrasts in the temperatures of the various surface elements, such as savannalike landscapes, which applies to HE, DN and EV. The surprising positive connection between the VA level and the MD at PEN is still to be examined.

Land Surface Emissivity
The vegetation cover method was developed for the AATSR 11 µm and 12 µm bands, which have a similar measurement spectrum as the AVHRR thermal bands (10.3-11.3 µm, 11.5-12.5 µm). The method by [66] was developed for the MODIS thermal bands 31 and 32, which have a similar but narrower measurement spectrum (10.8-11.3 µm, 11.8-12.3 µm). Both the vegetation cover method devised by [57] and the method by [66] define emissivity values based on spectral measurements conducted for the vegetation and ground portion of the respective land cover class. However, while the formula devised by [57] integrates the seasonal variation in vegetation through the FVC, the method by [66] uses a priori information about the seasonal variation to distinguish emissivity classes. Directional effects on emissivity are only considered by [66].
The analysis showed a high impact of the emissivity difference on the difference between TIMELINE and MODIS LST. Frey et al. [4] analyzed the sensitivity of the Becker and Li split-window algorithm to emissivity, and observed that a deviation of 0.015 can lead to LST deviations of more than 1 K. This result can be confirmed by our study.
The emissivity estimated for TIMELINE LST is generally lower than the emissivity for MODIS LST. However, the difference is especially extreme in the emissivity class 8 (bare rock), which has the lowest emissivity level, whereas in the forest classes 5 and 6, the TIMELINE emissivity in summer is even higher than MODIS emissivity. This leads on the one hand to a strong positive MD and high MAD in arid regions, and on the other hand to a quite balanced MD and a low MAD over the forests of central and north Europe. Frey et al. [78] assumed an overestimation of emissivity by the MODIS product in these regions. Wan et al. [13] observed a serious underestimation of LST by the previous version of the MODIS LST product over arid regions, which led to further refinements in the split-window algorithm and emissivity estimation, resulting in an emissivity increase. Besides the dependence of the error on the LST level, the emissivity difference is the main driver of the distinct spatial distribution of the error, with low errors in the north of the TIMELINE study area and higher errors in the south.
Caselles et al. [57] based their method on measurements from the ASTER spectral library [81], while [66] based their method on measurements by [82]. The results for the respective emissivity class are quite different: according to [57], the mean emissivity in both TIR bands of class 8 is 0.94, while according to [66] it is 0.969. Smaller differences can be found in other classes. These differences have a direct impact on the LST differences of the two datasets. This illustrates the problem that emissivity can hardly be measured directly from space, but is a very sensitive input parameter for the split-window algorithms.
Besides the different methods of emissivity estimation, the different input datasets used for the estimation should be noted. While the MODIS emissivity algorithm uses the daily land cover information of the MCDLC1KM product, TIMELINE LST uses land cover classifications from 1992, 2005 and 2009. In return, the FVC, as part of the vegetation cover method, provides land cover information for the respective AVHRR scene. In addition, there are different spatial scales of AVHRR and MODIS. For the validation with in situ LST, the LST error caused by emissivity difference is superimposed by different LSTs within the AVHRR pixel.

Time Series Consistency
While the comparison to MODIS and in situ LST showed no inconsistencies between the platforms NOAA-15, 16, 18 and 19, earlier platforms, and especially the mono-window algorithm by [48], are not included in the validation. Therefore, the consistency of the TIMELINE LST time series was analyzed at one desert and two grassland sites. The orbit drift was corrected with a DTC model based on [62], whose parameters where derived from the multi-temporal LSTs in the time series itself. The modeled diurnal temperature ranges show good correspondence to the physical derived ranges used by [83]. However, the overall RMSE of the DTC models of 3.4 K is quite high, which is probably due to the low monthly resolution of the DTC parameters. After normalizing LST to 14:30 h true solar time, a consistent time series from 1981 to 2018, comprising LSTs from NOAA 7,9,10,11,12,14,15,16,17,18 and 19, could be produced at the three sites. The time series indicates that the combination of the mono-window algorithm by [48] and the split-window algorithm by [4] is suitable for building a consistent LST time series for AVHRR/1, 2 and 3. The following cross-sensor comparison resulted in RMSEs between 1.72 K and 5.39 K, and a mean RMSE of 2.38 K. As there is a positive relationship between the RMSEs and the observation time difference, the errors are mainly introduced by the DTC model. Positive MDs of NOAA-11 and NOAA-16 observations compared to contemporaneous observations were observed. However, it cannot be ruled out that these biases are also introduced by the DTC model.
The biggest advantage of the pixel-based application of the DTC model is its simplicity: the assessment is not complicated by atmospheric, emissivity or vegetation differences. The downside is that, because the parameters are retrieved by the time series itself, not only spurious but also real trends and transitions are partly eliminated. The pixel-based orbit drift corrections of AVHRR were among others conducted by [84][85][86], using the correlation between SZA and LST anomalies for linear or polynomial regression models. A more general approach is given by [75,83], creating a lookup table for typical DTCs for different land covers. However, these approaches require external data sources, which introduce new inconsistency and uncertainties. A promising method is proposed by [63] using neighboring pixels to retrieve the parameters of the DTC model, resulting in an increase in RMSE of only 0.1 K compared to the original validation RMSE.

Future Directions
This study provides an extensive validation of the TIMELINE LST product. However, additional studies are necessary to further improve the product. The validation with in situ LST should be extended to more sites and should cover all important land cover types, including forest and bare ground. Furthermore, the observed dependence of the error on LST and TCWV level should be addressed, which is challenging because calibration and atmospheric correction are done simultaneously through coefficients in the split-window algorithm. Additionally, a consistent emissivity estimation is still a big issue for the harmonization of multiple LST products. A validation of TIMELINE LST before the year 2000 and, therefore, a validation of the mono-window algorithm by [48] is not possible due to the lack of suitable earth observation-based and in situ LST datasets for this period. The analysis of the time series itself and the comparison of trends with historical data, such as near-surface air temperature, offers the possibility of assessing the quality and uncertainty of earlier parts of the time series. A comprehensive collection of historical temperature data is provided in the ERA5 dataset [56], for example. For this study, the TIMELINE Level 2 LST product was used. A comparison of the TIMELINE Level 3 product to other LST maps derived from AVHRR is planned. This product consists of daily, 8-day and monthly composites of quality-assured LST. The composites contain the respective maximum, minimum, median and mean LSTs and their corresponding observation times expressed in true solar time. The updated quality flags contain additional information about slope, VA and snow cover. Building on that, a Level-3B product is planned, which will contain composites of daytime-normalized LST with its associated uncertainties. For that, stable and accurate daytime normalizations of orbit drift correction models are necessary, which should be based not on the time series itself, but on more general parameters, such as vegetation cover, soil moisture, day of the year and location. The unique collection of the TIMELINE vegetation and thermal products, all acquired under the same conditions and processed in the same unifying framework, resolution, extent and format, has great potential to support the development of these kinds of models. Besides LSTs, SSTs can be derived from AVHRR brightness temperatures using the method devised by [4], which is achieved with the new TIMELINE SST product. Missing the uncertainties due to emissivity and being more spatially and temporally invariant than LST, the validation of TIMELINE SST can reveal new insights and offer a clearer picture of this method.

Conclusions
Satellite-derived LST has become an indispensable tool for tracing climate change occurring over the last few decades. Being recognized as one of the Essential Climate Variables (ECVs) by the World Meteorological Organization, LST can be directly used to derive global warming trends and anomalies. It is a key parameter in the energy exchange on the Earth's surface. Further applications include droughts and plant stress, epidemiology, urban heat islands, land surface models and land cover classifications. However, climate-relevant statements can only be derived with daily and multi-decadal observations. This unique temporal coverage and resolution is only provided by one single sensor, the AVHRR.
Within TIMELINE, a new consistent LST product from 40 years of AVHRR observations over Europe and North Africa is generated with the method by [4]. The comprehensive preprocessing includes the correction of AVHRR characteristic detractions, such as geometric distortions, calibration drift, orbit drift and sensor degradation. Moreover, cloud, water, and snow masks are applied. The subsequent LST algorithm accounts for varying spectral responses of different AVHRR sensors and variances in LST, TCWV and VA, which have often been identified as interfering factors in the LST estimation.
To make climate-relevant and valid statements, the accuracy and consistency of LST is a crucial factor. The aim of this study was to assess the accuracy and consistency of this product. The validation approach consisted of the following aspects: (a) A comparison of in situ LST at 10 measurement sites in Europe, North America and southern Africa with different land cover types (grassland, cropland, shrubland, savannah). In total, 2402 data points between 2010 and 2013 were analyzed. The following results were obtained: (a) The comparison with in situ LST resulted in MADs between 1.19 K and 2.81 K, and RMSEs between 1.54 K and 3.68 K, which is in the accuracy range of other LST studies. Besides the site-dependent differences, a seasonal difference was observed with a MAD of 2.07 K in summer and a MAD of 1.54 K in winter. (b) The comparison with MODIS LST resulted in an MAD of 2.34 K, an RMSE of 2.67 K and an MD of 2.21 K, indicating a systematic positive bias of TIMELINE LST towards MODIS LST. Again, seasonal differences were observed with an MAD of 2.55 K in summer and 1.95 K in winter. (c) The LST levels and the emissivity difference between the TIMELINE and MODIS product had the highest impact on the error. Directional effects were observed over a few in situ sites, and can be explained by vegetation in the AVHRR field of view. (d) Normalizing LST to 14.30 h true solar time, a consistent TIMELINE LST time series at DN, EV and Algeria3 could be produced. The subsequent cross-sensor comparison for observations on the same day resulted in RMSEs between 1.72 K and 5.39 K. The magnitude of the error was highly related to the observation time difference.
Both validations showed a seasonal pattern with a higher error, MD and σ in summer, which could be explained by seasonal land cover changes at the in situ sites or by higher LST levels in summer. The differences between TIMELINE and in situ LST at some in situ sites were partly caused by the inhomogeneous land cover on the AVHRR pixel scale. However, using in situ measurements in the validation was important because comparisons between remote sensing systems are often biased due to their similar retrieving methods. Further studies are necessary to expand the in situ validation to all relevant land cover types, including forest and bare rock. The differences between TIMELINE and MODIS LST are probably a result of the difference between split-window algorithms and emissivity estimation methods. Further research is necessary to isolate these errors and their sources. In particular, a better agreement of the estimated emissivity is crucial to remove the uncertainties between remote sensing LST products. Suitable LST datasets for comparison are only available for years after 2000, which limits the validation of the earlier AVHRR sensors and the mono-window algorithm by [48] to the analyses of the time series itself. Accurate and stable methods for daytime normalization are therefore crucial, not only for the validation but also for future applications of the TIMELINE LST product.
As for comparable AVHRR LST products, there is still some way to go to reach the target accuracy of 1 K, defined as the LST product user requirement for climate-related studies [10]. However, this study presents a consistent LST product, which contributes to the understanding of the climate and environmental change taking place in Europe and North Africa in the last 40 years.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/ 10.3390/rs13173473/s1, Figure S1: LST at Algeria3 aggregated per month by platform and the resulting DTCs, Figure S2: LST at EV aggregated per month by platform and the resulting DTCs, Figure S3: LST at DN aggregated per month by platform and the resulting DTCs, Figure S4 Figure S6: Difference between TIMELINE and MODIS LST classified by VA, absolute VA difference between AVHRR and MODIS, and the emissivity classes devised by [57], Figure S7: SURFRAD emissivity and TIMELINE emissivity at the vegetated sites BND, FP, GC and SF, Figure S8: TIMELINE LST against in situ LST at all 10 in situ stations (as Figure 4), but this time in situ LST at the SURDRAD stations was estimated with TIMELINE emissivity, Table S1: CCI land cover classes and their corresponding emissivity classes.