Bayesian cloud detection over land for climate data records

: Cloud detection is a necessary step in the generation of land surface temperature (LST) climate data records (CDRs) and affects data quality and uncertainty. We present here a sensor-independent Bayesian cloud detection algorithm and show that it is suitable for use in the production of LST CDRs. We evaluate the performance of the cloud detection with reference to two manually masked datasets for the Advanced Along-Track Scanning Radiometer (AATSR) and ﬁnd a 7.9% increase in the hit rate and 4.9% decrease in the false alarm rate when compared to the operational cloud mask. We then apply the algorithm to four instruments aboard polar-orbiting satellites, which together can produce a global, 25-year LST CDR: the second Along-Track Scanning Radiometer (ATSR-2), AATSR, the Moderate Resolution Spectroradiometer (MODIS Terra) and the Sea and Land Surface Temperature Radiometer (SLSTR-A). The Bayesian cloud detection hit rate is assessed with respect to in situ ceilometer measurements for periods of overlap between sensors. The consistency of the hit rate is assessed between sensors, with mean differences in the cloud hit rate of 4.5% for ATSR-2 vs. AATSR, 4.9% for AATSR vs. MODIS, and 2.5% for MODIS vs. SLSTR-A. This is important because consistent cloud detection performance is needed for the observational stability of a CDR. The application of a sensor-independent cloud detection scheme in the production of CDRs is thus shown to be a viable approach to achieving LST observational stability over time.


Introduction
Satellite data are used to produce climate data records (CDRs) for 21 essential climate variables (ECVs) [1] within the European Space Agency (ESA) Climate Change Initiative (CCI) Program [2,3].These data records of 15-40 years enable the assessment of long-term trends in geophysical variables that can be attributed to changes in the Earth's climate, e.g., [4][5][6].For many CDRs, cloud detection is an essential pre-processing step in data production [7][8][9][10].Clouds and aerosols modify the Earth's radiance as viewed from space at thermal infrared and reflectance wavelengths.As such, many CDRs require the identification of data that are adequately 'clear' for valid retrieval of surface or atmospheric properties [9,10].The desired endpoint of this process is retrieval-dependent: surface property retrievals will typically discard observations not classified as clear-sky, while the retrieval of cloud or aerosol properties will necessarily require identification of clouds."Clear-sky" in this paper is shorthand for "containing negligible cloud and aerosols for the purpose of valid LST retrieval".
Long data records are essential for assessing climate changes [3] and CDRs are typically constructed using data from a series of satellite instruments, with a typical lifetime of 3-8 years.In order to identify long-term signals, data stability is essential [11,12], which includes consistency in the cloud detection applied to different sensors within a CDR and consistency in the retrieval methods.
Here we focus on cloud detection to retrieve land surface temperature (LST).Bayesian methods have been successfully applied to the production of sea surface temperature (SST) CDRs [7,13] from many different sensors, with very good results compared to thresholdbased techniques [14].Bayesian methods are, therefore, an ideal candidate for cloud detection over land, where identification of appropriate cloud detection thresholds for threshold-based techniques is further complicated by underlying heterogeneous land cover types.
In this paper, we present the Bayesian cloud detection formulation used for SST retrieval [7] adapted for LST retrievals.Cloud detection algorithms typically work on the premise that clouds are both bright and cold [15,16], enabling them to be distinguished from the underlying Earth surface at thermal infrared and reflectance wavelengths.This premise does not always hold (even over the ocean, which is typically dark and relatively warm [17]) and over land is further complicated by heterogeneous land cover, potentially significant diurnal cycles in surface temperature, and surfaces that are themselves both cold and bright, for example, snow-and ice-covered ground [18].These particular challenges make a fully Bayesian approach to calculating a probability of clear-sky well-suited to the problem, because of the dynamic aspect of the cloud detection which exploits prior information that accounts for spatiotemporal variability [19].The Bayesian formulation is sensor-independent and directly accounts for the differences in instrumental spectral response functions through radiative transfer simulations.It can easily be applied to multiple sensors used to create a long-term CDR.
The remainder of this paper proceeds as follows: in Section 2, we describe the Bayesian cloud-detection algorithm and the algorithmic developments relevant to cloud screening over land.In Section 3, we evaluate the cloud detection performance with respect to manually masked satellite imagery using data from the Advanced Along-Track Scanning Radiometer.In Section 4, we discuss the application of Bayesian cloud detection to data from other instruments and the benefits of this approach to the generation of CDRs using data from multiple sensors.We conclude the paper in Section 5.

Bayesian Cloud Detection over Land
The Bayesian approach to cloud detection as applied to SST retrieval has been widely documented [7,13,14,17,19], so we present here only a brief overview.This overview covers the main points relevant to understanding the evolutions of the algorithm for use over land, which is the subject of this paper.
Bayes' theorem, as applied to the cloud detection problem, can be used to calculate the probability that an observation corresponds to clear-sky (P c y o , x b ), given knowledge of the observation vector (y o ) and the prior background state (x b ) as shown in Equation (1).
The observation vector contains the satellite observations for all channels used in cloud detection.At night, this includes data from infrared wavelengths only, typically centered on 3.7, 11, and 12 µm (where these channels are available for a given instrument).During the day, reflectance channels can also be used, for example, wavelengths of 0.6, 0.8 and 1.6 µm, in addition to the 11 and 12 µm infrared wavelengths.The background state vector contains information on the prior surface and atmospheric conditions specified by the ERA-5 numerical weather prediction (NWP) analysis fields [20].P(c) and P(c) are the prior probabilities of not-clear and clear-sky conditions respectively.The ERA-5 NWP cloud fraction informs the prior probabilities.However, the cloud fraction is only a proxy for probability, and the probability of clear-sky is constrained to the range of 0.5-0.95, to ensure that the Bayes' calculation is not fully pre-conditioned by the prior [14].To put it another way, the probability of there being a cloud in an observation when the NWP cloud fraction is zero may well be low, but it is not zero.P(y o x b , c) is the probability of the observations given the background state vector under clear-sky conditions.It is calculated from the difference between observations and the prior clear-sky simulations of the fast forward model RTTOV [21], in light of the expected uncertainties.For cloudy conditions, P(y o x b , c) is determined using an empirical look-up table, developed as a pragmatic alternative to more computationally expensive cloudy-sky simulations [22].
The remainder of this section details the evolutions and settings particular to optimizing performance over land.All modifications are made with a view to producing multi-sensor CDRs.Sensor-specific dependencies are limited to the spectral response functions (accounted for in the radiative transfer simulations) and the uncertainty assumptions.

Radiative Transfer and Numerical Weather Prediction (NWP) Data
Radiative transfer simulations of clear-sky conditions use version 12.3 of the fastforward model RTTOV [21], with the latest RTTOV coefficients (LBLRTM v12.8, v9 predictors) [23].RTTOV is used to simulate observations at both reflectance and infrared wavelengths, with prior surface and atmospheric conditions constrained by the closest NWP analysis data from the hourly ERA-5 products [20].For channels with solar reflectance (visible and near-infrared), a Rayleigh single-scattering approximation is used [24].
The radiative transfer model is run at atmospheric profile locations commensurate with the resolution of the NWP data (~30 km), with the outputs then bilinearly interpolated to the higher resolution of the satellite imagery (typically ~1 km at nadir).Higherresolution surface data are used to specify the local conditions for a given satellite observation (Section 2.4).

Uncertainty Specification
The radiative transfer model outputs are used in the following equation to calculate the probability of observations given the background state vector for clear-sky conditions.
Uncertainties in the background state are propagated through the model by the H T BH + R term.B contains the background error covariance matrix, calculated for a reduced state vector: surface temperature, total column water vapour (TCWV), and aerosol optical depth (AOD, discussed further in Section 2.6).The uncertainty in the surface temperature is given by the spread in the ERA5 skin temperature ensemble.TCWV uncertainty is described by an exponentially decaying curve from 45% to 5% over a range of 0-65 kg m −2. .Tropospheric dust aerosol uses an uncertainty of 10% and stratospheric sulphate an uncertainty of 32%.Off-diagonal terms of the B matrix are set to zero.The H matrix contains the forward model tangent linears for each satellite observation channel used in the cloud detection and each element of the reduced state vector.The R matrix contains two components: R m , which characterizes the error in the forward model and R o , which characterizes the error in the satellite observations.R o contains the noise equivalent differential temperature (NEdT) for each satellite observation used in the cloud detection.The model uncertainty, R m , is of order 0.15-0.17K for the infrared channels, 9% for the 1.6 and 0.8 µm reflectance wavelengths and 8% for the 0.6 µm wavelength.The reflectance wavelength uncertainties were calculated by assessing observation minus simulation differences for clear-sky MODIS scenes.

Gaussian Mixed Model
Over land, we found that the ERA5 NWP reanalysis data does not fully capture the extent of the diurnal warming over surfaces such as deserts.Our clear-sky PDFs as simulated using RTTOV have a Gaussian distribution and as such, cases where the observations were much warmer than the simulation (due to an underestimated prior surface temperature) could fall outside of the clear-sky PDF and be erroneously flagged as cloud.To address this, we use a Gaussian mixture model to extend the PDF on the warm side.This calculates a second Gaussian, centered on the sum of the prior surface temperature and the associated uncertainty.This second Gaussian has a sigma twice as large as the uncertainty on the original Gaussian.Adding the two Gaussians, as illustrated in Figure 1 (in a toy example), extends the clear-sky PDF on the warm side whilst having a small impact on the cold side of the original Gaussian distribution.This makes the Bayesian cloud detection more robust to the underestimation of the diurnal temperature cycle in the ERA5 data.

Gaussian Mixed Model
Over land, we found that the ERA5 NWP reanalysis data does not fully capture the extent of the diurnal warming over surfaces such as deserts.Our clear-sky PDFs as simulated using RTTOV have a Gaussian distribution and as such, cases where the observations were much warmer than the simulation (due to an underestimated prior surface temperature) could fall outside of the clear-sky PDF and be erroneously flagged as cloud.
To address this, we use a Gaussian mixture model to extend the PDF on the warm side.This calculates a second Gaussian, centered on the sum of the prior surface temperature and the associated uncertainty.This second Gaussian has a sigma twice as large as the uncertainty on the original Gaussian.Adding the two Gaussians, as illustrated in Figure 1 (in a toy example), extends the clear-sky PDF on the warm side whilst having a small impact on the cold side of the original Gaussian distribution.This makes the Bayesian cloud detection more robust to the underestimation of the diurnal temperature cycle in the ERA5 data.

Surface Characterisation
Surface emissivity at the satellite image pixel location is specified using the combined ASTER MODIS Emissivity over Land (CAMEL) v002 monthly climatology compiled from observations made between 2000-2016 [25], which is specifically designed to be integrated with RTTOVv12.3simulations.It has a spatial resolution of 0.05 × 0.05°, with an improved representation of fractional snow cover compared with v001 [25].Surface reflectance is defined using the bidirectional reflectance distribution function (BRDF) atlas developed for RTTOV v11 [26].The atlas is at 0.1 resolution, covering wavelengths of 0.4-2.5 μm and derived from the MODIS BRDF-kernel product [26].
Surface elevation is specified using the ASTER Global Digital Elevation Model (DEM) [27].The native resolution of the DEM is 1 arc-second in latitude and longitude (~30 m) and this has been regridded globally to a resolution of 1/125 degrees (~900 m), commensurate with the resolution of observations from many polar-orbiting satellites used for the generation of LST CDRs (e.g., ATSR-2, AATSR, MODIS and SLSTR).

Cloudy Probability Distribution Functions (PDFs)
Empirical probability density function (PDF) look-up tables (LUTs) are used to specify (  |  , ̅ ), the probability of cloud (not clear) given the observations and background state vector.For cloud detection over land, new sets of PDF LUTs were defined using 11 years of data from the MODIS Terra instrument aboard the Earth Observing System (EOS).MODIS Terra is a good choice for deriving LUTs due to the long data record, a

Surface Characterisation
Surface emissivity at the satellite image pixel location is specified using the combined ASTER MODIS Emissivity over Land (CAMEL) v002 monthly climatology compiled from observations made between 2000-2016 [25], which is specifically designed to be integrated with RTTOV v12.3 simulations.It has a spatial resolution of 0.05 × 0.05 • , with an improved representation of fractional snow cover compared with v001 [25].Surface reflectance is defined using the bidirectional reflectance distribution function (BRDF) atlas developed for RTTOV v11 [26].The atlas is at 0.1 • resolution, covering wavelengths of 0.4-2.5 µm and derived from the MODIS BRDF-kernel product [26].
Surface elevation is specified using the ASTER Global Digital Elevation Model (DEM) [27].The native resolution of the DEM is 1 arc-second in latitude and longitude (~30 m) and this has been regridded globally to a resolution of 1/125 degrees (~900 m), commensurate with the resolution of observations from many polar-orbiting satellites used for the generation of LST CDRs (e.g., ATSR-2, AATSR, MODIS and SLSTR).

Cloudy Probability Distribution Functions (PDFs)
Empirical probability density function (PDF) look-up tables (LUTs) are used to specify P(y o x b , c) , the probability of cloud (not clear) given the observations and background state vector.For cloud detection over land, new sets of PDF LUTs were defined using 11 years of data from the MODIS Terra instrument aboard the Earth Observing System (EOS).MODIS Terra is a good choice for deriving LUTs due to the long data record, a wide range of available channels, and an equator overpass time of 10:30, similar to other instrument series (e.g., ATSRs, SLSTRs).Thus, a single set of LUTs can be used to apply Bayesian cloud detection to a wide range of sensors.
Table 1 shows the three PDF LUTs used as applied to the ATSR-2, AATSR, MODIS and SLSTR sensors in this study.Infrared and reflectance channels are treated independently, ensuring an adequate population of the PDF observation space.There are two infrared channel PDFs, as the 3.7 µm channel is not used during the day due to solar contamination.The PDF LUTs are multidimensional, binned by observations, viewing geometry and surface characteristics.Over land, the radiative response of the Earth's surface will be dependent on the land cover and as such, the PDFs are binned by biome, using the land cover classification from the European Space Agency (ESA) LST Climate Change Initiative (CCI).Land cover is stratified using the ESA Land Cover CCI classification [28], coupled with extended differentiation of bare soil types using the ATSR Land Biome Classification [29].Table 2 summarizes which of the classes from each classification fall within the condensed classification used to generate the PDF LUTs. Figure 2 demonstrates the biome dependence of the empirical PDFs.It shows the difference in the 11.0 µm minus NWP surface temperature against the 11.0-12.0µm channels for daytime and near-nadir observations.The shape of the PDF is biome-dependent, with larger differences in the 11.0-12.0µm channels for urban areas, cropland, and flooded ground than over bare soil and snow-and-ice surfaces.The largest positive differences in the 11.0 µm minus NWP surface temperature dimension are seen for snow and ice, bare soil and shrubland.As viewed from this slice, the cropland, flooded ground and urban PDFs were the most compact, the highest densities occurred where the 11.0-12.0µm differences were close to zero and the 11.0 µm minus NWP surface temperatures were slightly negative.
Figure 2 demonstrates the biome dependence of the empirical PDFs.It show difference in the 11.0 μm minus NWP surface temperature against the 11.0-12.0μm nels for daytime and near-nadir observations.The shape of the PDF is biome-depen with larger differences in the 11.0-12.0μm channels for urban areas, cropland flooded ground than over bare soil and snow-and-ice surfaces.The largest positive d ences in the 11.0 μm minus NWP surface temperature dimension are seen for snow ice, bare soil and shrubland.As viewed from this slice, the cropland, flooded ground urban PDFs were the most compact, the highest densities occurred where the 11.0 μm differences were close to zero and the 11.0 μm minus NWP surface temperatures slightly negative.Empirical PDFs derived from MODIS data can be used when processing other sors by applying a spectral shift to the observations prior to the look-up in order to them 'look like' MODIS [14].The spectral shifts are calculated using RTTOV 137 NWP profile data, including two sets of profiles: 5000 diverse in temperature and diverse in specific humidity [30].Of the 10,000 profiles in these two datasets, app mately 2000 are located over land, and these are used to calculate the shifts.Each sh piecewise linear in total column water vapour, with a dependency on atmospheric length.Shifts are provided for each channel, with reflectance wavelength shifts havi additional dependence on the solar zenith angle.For infrared wavelengths, the sh provided in the form of an offset, and for reflectance wavelengths, a scale factor.T spectral shifts are stored in a look-up table and can be applied prior to indexing the pirical PDFs.

Aerosol Characterisation
Aerosol information comes from two sources: tropospheric aerosol from the C nicus Atmosphere Monitoring Service (CAMS) reanalysis [31] and the stratospheri phate aerosol dataset used in ESA SST CCI [7] for periods affected by major volcanic tions (1982-1984 and 1991-1993).The CAMS aerosol climatology [32,33] provides mo mean profiles of aerosol mass mixing ratios for sea salt (three size bins), mineral (three size bins), organic matter and black carbon.These are interpolated to the loc Empirical PDFs derived from MODIS data can be used when processing other sensors by applying a spectral shift to the observations prior to the look-up in order to make them 'look like' MODIS [14].The spectral shifts are calculated using RTTOV 137-level NWP profile data, including two sets of profiles: 5000 diverse in temperature and 5000 diverse in specific humidity [30].Of the 10,000 profiles in these two datasets, approximately 2000 are located over land, and these are used to calculate the shifts.Each shift is piecewise linear in total column water vapour, with a dependency on atmospheric path length.Shifts are provided for each channel, with reflectance wavelength shifts having an additional dependence on the solar zenith angle.For infrared wavelengths, the shift is provided in the form of an offset, and for reflectance wavelengths, a scale factor.These spectral shifts are stored in a look-up table and can be applied prior to indexing the empirical PDFs.

Aerosol Characterisation
Aerosol information comes from two sources: tropospheric aerosol from the Copernicus Atmosphere Monitoring Service (CAMS) reanalysis [31] and the stratospheric sulphate aerosol dataset used in ESA SST CCI [7] for periods affected by major volcanic eruptions (1982-1984 and 1991-1993).The CAMS aerosol climatology [32,33] provides monthly mean profiles of aerosol mass mixing ratios for sea salt (three size bins), mineral dust (three size bins), organic matter and black carbon.These are interpolated to the location of the NWP analysis fields and the full profile is included in the background state vector passed to RTTOV, which includes support for all CAMS aerosol species.Total column dust mass and stratospheric sulphate are included in the reduced state vector, so their uncertainties (10% and 32%, respectively) will affect the shape of the clear-sky PDF (see Section 2.2).
When processing data that overlap the full CAMS analysis (2003 onwards), it is also possible to download the daily column integrated masses and scale the climatological aerosol profiles to match the daily data.This is currently done for the three mineral dust components, while other components use the climatological values.

Results
The performance of the Bayesian cloud mask can be assessed with reference to a manual cloud mask generated by expert inspection of satellite imagery.Two manually masked datasets are currently available for Advanced Along-Track Scanning Radiometer (AATSR) imagery.The first was generated as part of the SYNERGY project, which developed a cloud mask scheme using AATSR data in conjunction with observations from the Medium Resolution Imaging Spectrometer (MERIS) aboard the same satellite platform [34,35].The dataset consists of 21 daytime scenes between 2002-2007 over five different regions: Abracos Hill (Brazil), Cart Site (North America), Ouagadougou (Burkino Faso), Mongu (Zambia) and Tomsk (Russia) [19].The second dataset was developed within the European Space Agency (ESA) Data User Element (DUE) GlobTemperature project as part of a cloud-detection round-robin exercise [36].This dataset consists of 10 scenes between 2007 and 2011, including both day and nighttime data, one scene over ice in Antarctica and a dust-affected case over Algeria.The details of the 31 scenes used to evaluate the cloud mask performance are given in Table 3, summarized from more detailed information provided in [19,36].Collectively these datasets are designed to cover a wide range of atmospheric and surface conditions, different solar illuminations and differing cloud types.Data from these 31 scenes were used to calculate performance metrics [20,32] for the Bayesian cloud detection in comparison with the operational Standard ATSR Cloud Detection (SADIST) cloud mask [37,38], which used a series of threshold tests to determine the presence of cloud-affected observations.The performance metrics are as follows: (1) the percentage of perfect classification (the percentage of pixels across the whole scene that are correctly classified), (2) the hit rate (the percentage of cloudy pixels that are correctly masked), (3) the false alarm rate (the percentage of clear-sky pixels falsely flagged as cloud) and ( 4) the true skill score (hit rate minus the false alarm rate).The results given in Table 4 show that the Bayesian cloud detection consistently improves on the operational cloud mask across all four metrics.The percentage of perfect classification across all scenes is 86.6% for the Bayesian mask compared with 81.0% for the operational mask.The hit rate is significantly higher for the Bayesian mask (93.9% compared with 86.0%), whilst the false alarm rate is lower (16.7% compared with 21.6%).As a result of the higher hit rate and lower false alarm rate, the Bayesian true skill score is 77.2%, compared with 64.7% for the operational mask.Table 5 provides the confusion matrix for the Bayesian and operational cloud masks across the 31 scenes.The performance metrics were calculated using 4,958,105 observations with 1,530,831 cloudy pixels and 3,427,274 clear-sky pixels.The numbers in bold are the observations correctly classified by each algorithm, whilst the off-diagonal terms are the number of misclassified observations.Despite the larger number of clear-sky observations compared to cloud, which in terms of performance metrics would typically benefit an algorithm with a tendency to under-flag (missing some cloud, but with a low false alarm rate), the strength of the Bayesian mask is seen in the larger numbers of correctly classified pixels in both classes-cloud and clear-sky (total 4,294,347)-compared with the operational mask (4,014,623).Any cloud detection algorithm may produce a poor cloud mask for particular scenes, but there are also characteristic differences between cloud detection algorithms, repeated across many images, that underly the overall performance metrics, particularly with respect to the false alarm rate.The examples in Figure 3 typify the performance of the Bayesian and operational cloud masks relative to the manually masked datasets.Here, we see the manual mask on the left of the image, the Bayesian mask in the centre and the operational mask on the right.Pixels masked as cloud are shown in white, clear-sky-over-land in green and water pixels (where the masks are not compared) are shown in blue.
The Bayesian cloud mask is often (but not exclusively) more conservative with respect to the edges of cloud features than the manual mask, extending the footprint of the cloud.This is true for both datasets, manually masked using different systems and by different experts.This will contribute to some false flagging of clear-sky pixels.In the operational cloud mask (Figure 3, right), there is a tendency to mistakenly flag surface features as clouds, as seen in the centre and on the left side of the image.This contributes in a large part to the higher false alarm rate in the operational cloud mask when compared to the Bayesian mask.The differences in hit rates arise from the operational mask missing some cloud features and underestimating the cloud extent for others.Any cloud detection algorithm may produce a poor cloud mask for particular sc but there are also characteristic differences between cloud detection algorithms, rep across many images, that underly the overall performance metrics, particularly wi spect to the false alarm rate.The examples in Figure 3 typify the performance of the B ian and operational cloud masks relative to the manually masked datasets.Here, w the manual mask on the left of the image, the Bayesian mask in the centre and the o tional mask on the right.Pixels masked as cloud are shown in white, clear-sky-over in green and water pixels (where the masks are not compared) are shown in blue.3) between the manually masked dataset the Bayesian cloud mask (center), and the operational cloud mask (right).The x and y axes are the column and row numbers of the cloud mask arrays, included as a reference for image parison.

Bayesian
The Bayesian cloud mask is often (but not exclusively) more conservative wi spect to the edges of cloud features than the manual mask, extending the footprint cloud.This is true for both datasets, manually masked using different systems an different experts.This will contribute to some false flagging of clear-sky pixels.In th erational cloud mask (Figure 3, right), there is a tendency to mistakenly flag surfac tures as clouds, as seen in the centre and on the left side of the image.This contribu a large part to the higher false alarm rate in the operational cloud mask when com to the Bayesian mask.The differences in hit rates arise from the operational mask m some cloud features and underestimating the cloud extent for others.

Application to CDR Generation
The applicability of the Bayesian cloud detection to many sensors is advantageo the generation of Climate Data Records (CDRs).Since sensor differences are repres by the forward model and a few uncertainty parameters, the Bayesian framework dependent on specific instrument characteristics, equator overpass time or channel ability at specific wavelengths, meaning that it can be readily applied to data from d ent sensors.This is often required in the generation of CDRs spanning 20-30+ years structed using observations from multiple sensors.
For CDR generation, a similar cloud mask performance over time and between sors is needed for observational stability, i.e., to avoid apparent changes in LST sta arising in reality from cloud masking discontinuities.Does consistency in the cloud m ing technique (Bayesian) across sensors give consistent cloud mask performance?3) between the manually masked dataset (left), the Bayesian cloud mask (center), and the operational cloud mask (right).The x and y axes labels are the column and row numbers of the cloud mask arrays, included as a reference for image comparison.

Application to CDR Generation
The applicability of the Bayesian cloud detection to many sensors is advantageous in the generation of Climate Data Records (CDRs).Since sensor differences are represented by the forward model and a few uncertainty parameters, the Bayesian framework is not dependent on specific instrument characteristics, equator overpass time or channel availability at specific wavelengths, meaning that it can be readily applied to data from different sensors.This is often required in the generation of CDRs spanning 20-30+ years, constructed using observations from multiple sensors.
For CDR generation, a similar cloud mask performance over time and between sensors is needed for observational stability, i.e., to avoid apparent changes in LST statistics arising in reality from cloud masking discontinuities.Does consistency in the cloud masking technique (Bayesian) across sensors give consistent cloud mask performance?
We investigated the consistency of the Bayesian cloud detection algorithm across multiple sensors by considering an LST CDR constructed using data from the second Along-Track Scanning Radiometer (ATSR-2), AATSR, the Moderate Resolution Infrared Spectroradiometer (MODIS Terra) and the Sea and Land Surface Temperature Radiometer (SLSTR-A).These sensors are aboard polar-orbiting satellites with equator overpass times of 10:00 (AATSR and SLSTR) or 10:30 (ATSR-2 and MODIS) local time.Each sensor pair in this CDR (ATSR-2 and AATSR, AATSR and MODIS, MODIS and SLSTR) has some overlap, and in this overlap period we compared the cloud detection hit rate for each sensor using the Bayesian algorithm.
The hit rate was calculated with reference to a series of in situ ground stations with ceilometers, which measure cloud base height.The ceilometer locations are globally distributed, as shown in Table 6.Each ceilometer record has a different length, with three sites (North Slope, Ny Alesund, and Southern Great Plains [39][40][41]) covering all three satellite-overlap periods and the remainder covering only a subset of the overlaps.The satellite data were matched to the ceilometer observations with a spatial separation of 1 km and within a time window of 5 min.The temporal difference in the match-up is related to the frequency of the in situ observations.The mean and median time differences were typically within 30 s for all ground stations, with the exception of Ny Alesund and Sabana, where the in situ observations are less frequent.The ceilometer observation is matched to the satellite pixel in which the cloud base is observed, taking into account the cloud base height and viewing geometry of the sensor.The satellite zenith angle in the possible matches is limited to +/−22 degrees to ensure consistency in comparison between the narrow-swath ATSR sensors and the MODIS and SLSTR instruments.These matches were then used to calculate the cloud hit rate.The hit rates for the sensor overlap periods are compared in Table 7, with the corresponding number of matches from which the hit rate was calculated shown in Table 8.For ATSR-2 and AATSR the overlap period is from 09/2002-02/2003, for AATSR and MODIS, 02/2011-03/2012 and for MODIS and SLSTR 06/2016-07/2017.The full overlap period is used for each location, where ceilometer data are available.The latter two comparison periods typically have more matches, as the overlap period between sensors is longer.The number of matches in Puerto Rico was small due to the relative infrequency of both the in situ observations and satellite overpasses (the location is closer to the equator than many of the other ground stations) in addition to the in situ data record finishing in 12/2016, partway through the overlap period.The hit rates were assessed across all sensor pairs, over a wide range of geographical locations and surface biomes, and at different points in the CDR time series.The mean difference in hit rate is 4.5% between ATSR-2 and AATSR, 4.9% between AATSR and MODIS, and 2.5% between MODIS and SLSTR, the latter comparison benefitting from a greater number of available satellite to in situ matches.In comparison, the mean difference in the hit rate for the operational cloud mask between ATSR-2 and AATSR was greater at 12.1% (not shown), suggesting a greater difficulty in obtaining threshold-based algorithm consistency across different satellite sensors (this is the only overlapping pair sharing the same operational cloud-mask formulation).Differences in the absolute hit rate for both cloud masks did occur between sites (ranging between 70-100% for the Bayesian and 33.3-100% for the operational), reflecting some remaining challenges in optimising cloud detection across all land surfaces.The largest inter-sensor differences for a single match-up location occurred for the North Slope of Alaska, as seen for the overlap between AATSR and MODIS.Here, we see a step-change in the performance of the cloud detection between the two ATSR instruments and the other sensors.From these data alone, it was difficult to determine the exact reason for the magnitude of the difference at this specific location, but some variation was expected as the hit rate was calculated from an independent set of matches for each sensor, and there was a 30 min time difference in the equator overpass time for consecutive sensors in the time series.
Figure 4 compares the hit rate difference across all sensor pairs and locations for the Bayesian and operational masks.All of the operational masks used threshold-based techniques: the SADIST mask [37,38] is used for ATSR-2 and AATSR, the MOD35_L2 products for MODIS Terra [46] and the summary cloud mask for SLSTR-A [47] (which is an evolution of the SADIST mask applied to the ATSR instruments).The median percentage difference was 2.7% for the Bayesian cloud detection compared with 10.3% for the operational masks.The interquartile range was smaller for the Bayesian mask (3.9% compared with 10.95% for the operational masks), with the 75th percentile of the Bayesian hit rate comparison (5.1%) falling below the 25th percentile for the operational masks (6.7%).The smaller percentage differences in the hit rate between overlapping sensors and the reduced data spread in the Bayesian mask comparison suggest that applying this cloud mask to all sensors in a CDR will give better observational stability than using the operational cloud masks for each sensor.
Remote Sens. 2022, 14, x FOR PEER REVIEW 12 of 15 for MODIS Terra [46] and the summary cloud mask for SLSTR-A [47] (which is an evolution of the SADIST mask applied to the ATSR instruments).The median percentage difference was 2.7% for the Bayesian cloud detection compared with 10.3% for the operational masks.The interquartile range was smaller for the Bayesian mask (3.9% compared with 10.95% for the operational masks), with the 75th percentile of the Bayesian hit rate comparison (5.1%) falling below the 25th percentile for the operational masks (6.7%).The smaller percentage differences in the hit rate between overlapping sensors and the reduced data spread in the Bayesian mask comparison suggest that applying this cloud mask to all sensors in a CDR will give better observational stability than using the operational cloud masks for each sensor.

Conclusions
In this paper we present a Bayesian cloud detection scheme as applied to satellite observations over land for the purpose of generating LST CDRs.We assessed the Bayesian algorithm using two manually masked datasets for the AATSR sensor in comparison with

Conclusions
In this paper we present a Bayesian cloud detection scheme as applied to satellite observations over land for the purpose of generating LST CDRs.We assessed the Bayesian algorithm using two manually masked datasets for the AATSR sensor in comparison with the operational threshold-based algorithm.The Bayesian cloud detection hit rate was 7.9% greater, and the false alarm rate was 4.9% smaller than the operational cloud mask.This mean improvement in cloud detection skill was found across a wide range of conditions present in the validation dataset.
We also considered the suitability of the Bayesian algorithm for application to multiple satellite sensors when generating climate data records.The cloud detection hit rates between the sensors in overlapping time periods were consistent on average to about 5% using the Bayesian algorithm, compared to 22.5% for the operational cloud masks.Observational stability between sensors is important to avoid apparent changes in any CDR variable (not just LST) that results from the discontinuities in data processing over time, including cloud-masking.The assessment of cloud mask consistency between sensors as presented in this paper would be recommended when producing CDRs dependent on cloud masking as a pre-processing step.This could be achieved by using in situ ceilometer data for validation as presented here, or by using other forms of data as a reference, e.g., manually masked scenes, or sets of pre-classified pixels.
Some further work remains to ensure uniformity of cloud mask performance across different geographical locations and land surface types, as the absolute hit rate is variable between the validation sites (this is true of both the Bayesian and operational cloud masks).The Bayesian cloud mask performance can be increased over cold and/or bright surfaces, including snow, ice and deserts.More work on cloud detection in these regions is required to provide global consistency in cloud-masking skill.

Figure 1 .
Figure 1.Gaussian mixture model illustration: the original PDF (blue) and additional PDF (red) are added to give a new PDF (black dashed line).The new PDF is centered on the sum of the mean and sigma of the original PDF and has a sigma twice that of the original PDF.

Figure 1 .
Figure 1.Gaussian mixture model illustration: the original PDF (blue) and additional PDF (red) are added to give a new PDF (black dashed line).The new PDF is centered on the sum of the mean and sigma of the original PDF and has a sigma twice that of the original PDF.

Figure 2 .
Figure 2. PDF 1 (Table 1) plots for different biome classifications.Plots show the 11.0 μmsurface temperature (d110-surft) as a function of the 11.0-12.0μm channel difference (d110 PDFs are for near-nadir observations (the first atmospheric path length bin) and summed ov NWP surface temperature bins.The colour bars use consistent shading between plots with dif maxima dependent on the data density.

Figure 2 .
Figure 2. PDF 1 (Table 1) plots for different biome classifications.Plots show the 11.0 µm-NWP surface temperature (d110-surft) as a function of the 11.0-12.0µm channel difference (d110-120).PDFs are for near-nadir observations (the first atmospheric path length bin) and summed over all NWP surface temperature bins.The colour bars use consistent shading between plots with different maxima dependent on the data density.

Figure 3 .
Figure 3. Cloud mask comparison for scene 15 (Table3) between the manually masked dataset the Bayesian cloud mask (center), and the operational cloud mask (right).The x and y axes are the column and row numbers of the cloud mask arrays, included as a reference for image parison.

Figure 3 .
Figure 3. Cloud mask comparison for scene 15 (Table3) between the manually masked dataset (left), the Bayesian cloud mask (center), and the operational cloud mask (right).The x and y axes labels are the column and row numbers of the cloud mask arrays, included as a reference for image comparison.

Figure 4 .
Figure 4. Box and whisker plot comparing the hit rate percentage difference between sensor pairs (ATSR2 vs. AATSR, AATSR vs. MODIS and MODIS vs. SLSTR) for all locations in Table 7 for the Bayesian cloud mask (bottom) and the various operational cloud masks (top).The red line shows the median difference and the edges of the boxes represent the 25th and 75th percentiles.

Figure 4 .
Figure 4. Box and whisker plot comparing the hit rate percentage difference between sensor pairs (ATSR2 vs. AATSR, AATSR vs. MODIS and MODIS vs. SLSTR) for all locations in Table 7 for the Bayesian cloud mask (bottom) and the various operational cloud masks (top).The red line shows the median difference and the edges of the boxes represent the 25th and 75th percentiles.

Table 1 .
PDF LUTs used to determine the probability of 'not clear' conditions given the observations and background state vector.

Table 2 .
[29]e definitions for PDF LUT binning over land surfaces as described by the Land Cover CCI Classes[28]and the ATSR LST Biome Classification (ALB2) classes[29].

Table 4 .
Cloud detection performance metrics for Bayesian and operational cloud masks compared to manually masked AATSR data from the SYNERGY and ESA DUE GlobTemperature projects.

Table 5 .
Confusion matrices for Bayesian and operational cloud masks compared to manually masked AATSR data from the SYNERGY and ESA DUE GlobTemperature projects.The numbers in bold are the observations correctly classified by each algorithm.

Table 7 .
Bayesian cloud detection hit rate percentage with reference to ceilometer observations for overlapping data between different sensors: ATSR-2 and AATSR, AATSR and MODIS, MODIS and SLSTR.

Table 8 .
Number of cloudy ceilometer observations compared when calculating the hit rate for overlapping data between different sensors: ATSR-2 and AATSR, AATSR and MODIS, MODIS and SLSTR.