3.2. Sensitivity to the Distribution of Relevant Variables
In order to study the sensitivity of the proposed database to some of the choices that were made, a set of experiments was performed. The baseline calibration dataset, which is based on a choice of profiles that is adequate to fill the TCWV/LST diagram, is referred to as WTS_−15_15 (TCWV is sometimes represented as W
in the literature and TS
). A different criterion could have been adopted to choose a few calibration profiles from the more than 15000 profiles in the SeeBor database, such as ensuring a flat distribution of TCWV. This criterion was adopted, together with the wide geographical distribution criterion of WTS_−15_15, for experiments FLAT14_−15_15 and FLAT10_−15_15. The difference between these two is that for the first, 14 profiles per TCWV class were chosen (112 profiles vs. 116 in WTS_−15_15) and for the latter only 10 (leading to a total of 80 profiles). The goal was to test the relevance of the number of profiles and of the respective joint LST /TCWV distribution for the robustness of the regression coefficients. The statistical and geographical distributions of these databases are illustrated in Figure 8
and Figure 9
. Large parts of the TCWV/LST diagram are not covered such as the most extreme LST classes. In the intermediate TCWV classes, a large number of the cases fall in the same LST range, as these combinations are globally more frequent for clear sky conditions, and therefore also more frequent in the SeeBor database. Note that a few of the profiles are common to FLAT14_−15_15 and to FLAT10_−15_15; this is because the algorithm is initiated with the same random seed, which generated the same random number sequence for all the experiments. The geographical distributions show that relatively fewer profiles over land are selected, which might be explained by the fact that the inclusion of more extreme situations was not a requirement.
Another factor that greatly influences the robustness of the coefficients is the
difference. Therefore, we tested a few variants of the WTS_−15_15 database varying the lower and upper limits of the prescribed
difference, always using steps of 5 K. These experiments are referred to as WTS_−10_10, WTS_−10_15, WTS_−10_20, WTS_−15_20, WTS_−20_15, WTS_−20_20, WTS_−20_25 and WTS_−25_25 (the numbers in the experiment name refer to the lower and upper limits of
. All these choices of calibration databases were tested in both the GSW and the MW formulations and the same validation database was used to assess their statistical properties. The set of sensitivity experiments is described in Table 1
The results of the sensitivity experiments are summarized in Table 2
and Table 3
for the GSW and MW algorithms, respectively. Both algorithms were adjusted using the different calibration databases described above and assessed using a common and independent validation database. In Table 2
and Table 3
, values of the overall bias and RMSE are indicated, as well as their variability among the TCWV/ZVA classes (via the standard deviation of the bias and RMSE, respectively, obtained per TCWV/ZVA class). The GSW model shows a slightly higher bias and RMSE using the FLAT approach when compared to the WPS. Their variabilities are also larger for the FLAT-type databases, which means that there are classes that are not so well represented when using this approach.
The set of experiments summarized in Table 1
also suggest high sensitivity to the lower and upper limits of the prescribed
difference prescribed in the calibration databases as this range is the only condition changing among experiments denoted by “WTS”. The results presented in Table 1
suggest that it is hard to tell which combination is the best. In general, widening the
range of possible values seems to make the overall RMSE worse, although there are a few exceptions. Another discernible pattern regards the sign and magnitude of the overall bias: increasing the upper limit increases the bias (i.e., it becomes “more positive”); conversely, decreases in the lower limit seem to make the bias more negative. Well-balanced ranges (absolute value of the lower and the upper limits close to each other) seem to lower the variability of the statistics.
In the case of the MW model, the experiments show even less linear results. In fact, the case with more favorable error statistics is arguably FLAT10_−15_15, with a lower absolute value of the bias and bias variability, an overall RMSE that is comparable to that of the baseline experiment and with less variability among classes. For the MW model, the experiment with the smallest RMSE is WTS_−10_10 (of about 1.97 K); however, it has also the worst absolute value for the bias, 0.55 K. Like in the case of the GSW model, there is also a tendency to get worse RMSE values towards wider ranges of difference.
These results suggest that the configuration of an appropriate calibration database may vary with the algorithm to be used and area coverage, as the distribution of the variables analyzed above (most notably
) over the area of interest may support the exclusion of more extreme cases and non-relevant. The choice of profiles from a SeeBor-like database is non-trivial, but basing the choice on fully covering the bivariate TCWV/LST distribution over the respective region of interest seems to have some advantages. It is worth noting that covering the most frequent classes in the TCWV/LST diagram leads, as expected, to better overall statistics, as those will be the most frequent in the validation database (and also in real applications). In Figure 10
, the overall statistics are analyzed for the FLAT14_−15_15 calibration database, which, despite having a comparable number of profiles to WTS_−15_15 and much more than FLAT10_−15_15, shows overall worse performance than those cases. The analysis of the bias (Figure 10
c) as a function of TCWV clearly shows that some classes are affected by large negative biases (between 2 and 3 cm, and around 5 cm) while between 3 and 4 cm the bias is positive; the ZVA dependency seems less important in the analyzed case. This shows that even with a flat distribution of TCWV, the performance of the model will depend on the TCWV, suggesting that combined distributions of variables relevant to the problem need to be taken into account. In practice this would translate in a roughly latitude-dependent bias (following the latitude dependence of TCWV), which is something that should be avoided in global datasets.
In order to explore the effect of the prescribed
differences in the representation of the most extreme cases, boxplots of the error distribution (as given by
) were calculated by classes of
in the validation database, and also as a function of the TCWV class, for two of the proposed experiments: MW calibrated using WTS_−15_15 and WTS_−25_25, respectively, as shown in Figure 11
and Figure 12
. There were some classes with only few cases, reflecting the fact that largely negative differences rarely occur and they do so in very dry atmospheres, so we merged them into a single class
to increase the figure readability. Large positive differences are more frequent and may occur in all types of atmospheres. The comparison of the error distributions shown in Figure 11
and Figure 12
indicates that only a few classes seem to be statistically affected by the temperature difference range that is applied. In drier atmospheres (TCWV < 3 cm) the effect is in fact negligible, since under these conditions the TOA brightness temperatures will be highly dominated by the surface emitted signal (i.e., by LST and surface emissivity). In most cases, the only noticeable effect is the increase in the range of the error when the temperature difference increases, even in those classes that are “covered” by both calibration databases (e.g.,
). This is what causes the overall loss of performance of the database with the wider temperature ranges, since those classes are more populated than those with more extreme temperature differences. It is also worth noticing that extending the temperature difference range does not necessarily lead to a better representation of the extreme cases. When
is positive and large, it likely means the surface sensible heat flux may generate a convective boundary layer, which is often topped by a temperature inversion [33
]. It is well known that large LST retrieval errors occur under very moist atmospheres (e.g., [20
]). If on top of such conditions we have the development of a convective boundary layer, the height of the largest thermal and moisture gradients may be shifted upwards and therefore the peak of thermal weighting function of (split-)window channels may also be shifted upwards [34
], which makes it harder to disentangle surface emission (LST and emissivity) from the signal emitted by the lower atmosphere. Some currently used schemes address this issue using different coefficients for day and night retrievals (e.g., [37
]), which somewhat tunes the LST algorithms to different structures of the atmospheric boundary layer but introduces an additional discontinuity in the algorithm coefficients, while other schemes use additional information from numerical weather prediction models regarding near-surface air temperature (which may also bring additional model forecast errors into the retrieval). Although not shown, the GSW model seems much less sensitive to these effects, as the boxplot diagrams for the cases illustrated in Figure 11
and Figure 12
for the MW algorithm are much closer to each other in the GSW case. In summary, extending the
values to include the most extreme cases may not be beneficial for the overall performance of the retrievals because it can lead to higher errors in the classes that are more frequent, without significant compensation from the classes with more extreme situations.